Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Michal Migurski
> On Mon, Oct 27, 2008 at 9:39 PM, Michal Migurski <[EMAIL PROTECTED]>  
> wrote:
>> I'm liking Jochen Topf's suggestion here:
>>
>>   "If the planet dump plus the diff from the same day is what  
>> everybody
>> wants anyway, why not do this on the server side and hold the planet
>> back after the first diff is available, run this over the planet and
>> then publish that as the planet?"
>
> 1. Because there are plenty of uses for the planet dump that don't
> need consistant snapshots.

Those uses would not be impacted by consistent snapshots.


> 2. Because such consistant snapshots have been available elsewhere for
> quite a while now and people who need them can get them. There's no
> particular reason why it has to be on the same site as the normal
> planet dumps.

Yet there is no link to these places from planet.openstreetmap.org  
that indicates that the files available there differ in some important  
or useful way. The telascience.org source you suggested is described  
as "extracts of NL, Scandinavia and Taiwan" at 
http://wiki.openstreetmap.org/index.php/Planet.osm 
, rather than a complete dump of Planet with different datetime  
boundaries.

I'm happy to keep bellying up to the trial & error bar here, but as I  
mention in a previous mail, the volume of data involved means that  
individual attempts at the data (successful or not) have multiple-day  
costs associated with them.


> Umm, yeah. I was ofcourse assuming you were running the latest
> version, otherwise anything is possible, The creates-as-modifies fix
> was done two months ago.


I'll recompile and replace the two-month-old version of osm2pgsql I've  
been using.

-mike.


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Brett Henderson
On Tue, Oct 28, 2008 at 7:39 AM, Michal Migurski <[EMAIL PROTECTED]> wrote:

> >> Finally, the boundaries between the hourlies and dailies seem
> >> misaligned.
> >>
> >
> > This shouldn't be the case.
> >> After running the remaining hourlies for the 22nd, I attempted to
> >> pick  up on the 23rd with a daily. The final hourly I used was
> >> 2008102223-2008102300.osc.gz. It's my expectation that I should be
> >> able to immediately follow that with 20081023-20081024.osc.gz, but
> >> this led to duplicate key violation suggesting that there's an
> >> overlap  between the two files. Continuing with hourlies *works*,
> >> but is  tedious and I suspect slower than the dailies.
> >>
> >
> > You should have been able to do what you've suggested.  If you are
> > finding problems, please provide me with some example data which is
> > misaligned between the two types of changesets.
>
> Try the two files mentioned above - that's where I saw this behavior,
> they're quite recent.
>
>2008102223-2008102300.osc.gz
>20081023-20081024.osc.gz


I need you to provide some specific examples of broken data.  If you can say
that "way 27123456 is created in both of the above files even though they
are for different time periods" then I can take a look at why this may have
occurred.  Just saying that there is misalignment between those two files
doesn't help me at all.  Presumably you ran into a specific problem and
received a specific error message, this is the kind of information I need.
I only do this project in my spare time and can't go looking for problems
that I'm not sure even exist, I have enough known problems to look into
already :-)


>
>
>
> >> My sense from reading other people's experiences has been that it's
> >> a  common pattern to rely solely on the weekly planet dumps,
> >> incurring  the substantial overhead of parsing and importing the
> >> full 5GB dump  once every week, and then re-rendering the complete
> >> set of tiles.
> >>
> >
> > For a long time weekly planet dumps were the only bulk data
> > available.  Osmosis changesets have been on the scene for some time
> > now though and are gradually being utilised by more and more
> > clients.  As the planet grows, this will become more critical.  Who
> > knows, if the kinks gradually get ironed out of the osm2pgsql
> > program we may even begin to see the main mapnik tile generator move
> > to using changesets.
>
> I would love to rely on these exclusively, it's much more efficient.
> But, I was seeing a fair bit of information fall through the cracks so
> that's why I'm re-synching to planet every four weeks.


Again, please provide some specific examples.  If data is being missed I'd
like to know about it.  Osmosis provides some tools that may be useful
here.  You can download a planet, apply changesets for a week, then compare
against the next planet and see what the differences are.  Obviously both
planets would need appropriate changesets applied to make them consistent
before performing a comparison to eliminate noise.

I probably should do some of these comparisons myself, but again just
haven't found time yet and nobody else has complained about missing data.
The minute changesets run 5 minutes behind the API so could potentially miss
data if a lock is held for several minutes.  The daily and hourly changesets
run at least 20 minutes behind API (forget off the top of my head) and
should be extremely unlikely to miss data.

Brett
___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Martijn van Oosterhout
On Mon, Oct 27, 2008 at 9:40 PM, Michal Migurski <[EMAIL PROTECTED]> wrote:
>> Now that I think about it though, I think what I did was take one of
>> the planet dumps from http://hypercube.telascience.org/planet/ (which
>> *are* consistant snapshots), and run the dailies from there.
>
> Is there any reason to not use those? They seem to be more frequent
> than the planet.openstreetmap.org ones - is there some disadvantage?
> How are they created?

Umm, they are created by taking the planet dumps and applying the
daily diffs every day. They are used to produce consistant snapshots
of for example, NL and by the coastline checker (which really likes
having consistant snapshots to work with).

Have a nice day,
-- 
Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Martijn van Oosterhout
On Mon, Oct 27, 2008 at 9:39 PM, Michal Migurski <[EMAIL PROTECTED]> wrote:
> I'm liking Jochen Topf's suggestion here:
>
>"If the planet dump plus the diff from the same day is what everybody
> wants anyway, why not do this on the server side and hold the planet
> back after the first diff is available, run this over the planet and
> then publish that as the planet?"

1. Because there are plenty of uses for the planet dump that don't
need consistant snapshots.

2. Because such consistant snapshots have been available elsewhere for
quite a while now and people who need them can get them. There's no
particular reason why it has to be on the same site as the normal
planet dumps.
> Probably what I need to do is get a fresh update of osm2pgsql. I can
> see now that the revision I'm using is older than #10464, where some
> inconsistency resilience was added.

Umm, yeah. I was ofcourse assuming you were running the latest
version, otherwise anything is possible, The creates-as-modifies fix
was done two months ago.

Have a nice day,
-- 
Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Michal Migurski
On Oct 27, 2008, at 12:59 AM, Martijn van Oosterhout wrote:

> On Mon, Oct 27, 2008 at 1:10 AM, Michal Migurski <[EMAIL PROTECTED]>  
> wrote:
>> The final event in each weekly planet dump does not fall on an even
>> day boundary. In the case of the most recent Oct. 22nd planet.osm, it
>> was necessary to experiment with hourly diffs from that day to find
>> that the boundary was approx. 2:00pm. Hourlies up to and including
>> 2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I
>> could go more granular here, checking the minute diffs as well for a
>> more precise breakpoint, but it seems odd that the planet dump does
>> not break cleanly on a midnight boundary so that it's possible to  
>> pick
>> up the differences moving forward.
>
> As I recall, osm2pgsql did support this kind of operation (or at least
> it did last time I tried, it was discussed on the list). All creates
> in diffs are treated as delete+insert. You don't actually say what the
> error was you ran into though so I can't be sure if you're talking
> about the same problem.
>
> Now that I think about it though, I think what I did was take one of
> the planet dumps from http://hypercube.telascience.org/planet/ (which
> *are* consistant snapshots), and run the dailies from there.

Is there any reason to not use those? They seem to be more frequent  
than the planet.openstreetmap.org ones - is there some disadvantage?  
How are they created?

-mike.


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Michal Migurski
>> Yep, as others have commented there are two tables types in the osm  
>> database; current tables, and history tables.  The planet dumper  
>> just reads current tables which is the fastest approach.   
>> Unfortunately the current tables change constantly during the  
>> planet generation process resulting in inconsistencies.  It is  
>> possible to produce a consistent snapshot reading history tables  
>> and osmosis has the ability to do just that but it is significantly  
>> slower.  It is also possible to produce a consistent snapshot by  
>> taking an inconsistent planet and applying changesets from a point  
>> in time prior to the planet dump beginning through to a point after  
>> completion, this effectively produces the same result at much  
>> reduced load on the main database.
>>

I'm liking Jochen Topf's suggestion here:

"If the planet dump plus the diff from the same day is what everybody  
wants anyway, why not do this on the server side and hold the planet  
back after the first diff is available, run this over the planet and  
then publish that as the planet?"


>> Finally, the boundaries between the hourlies and dailies seem   
>> misaligned.
>>
>
> This shouldn't be the case.
>> After running the remaining hourlies for the 22nd, I attempted to  
>> pick  up on the 23rd with a daily. The final hourly I used was   
>> 2008102223-2008102300.osc.gz. It's my expectation that I should be   
>> able to immediately follow that with 20081023-20081024.osc.gz, but   
>> this led to duplicate key violation suggesting that there's an  
>> overlap  between the two files. Continuing with hourlies *works*,  
>> but is  tedious and I suspect slower than the dailies.
>>
>
> You should have been able to do what you've suggested.  If you are  
> finding problems, please provide me with some example data which is  
> misaligned between the two types of changesets.

Try the two files mentioned above - that's where I saw this behavior,  
they're quite recent.

2008102223-2008102300.osc.gz
20081023-20081024.osc.gz


>> My sense from reading other people's experiences has been that it's  
>> a  common pattern to rely solely on the weekly planet dumps,  
>> incurring  the substantial overhead of parsing and importing the  
>> full 5GB dump  once every week, and then re-rendering the complete  
>> set of tiles.
>>
>
> For a long time weekly planet dumps were the only bulk data  
> available.  Osmosis changesets have been on the scene for some time  
> now though and are gradually being utilised by more and more  
> clients.  As the planet grows, this will become more critical.  Who  
> knows, if the kinks gradually get ironed out of the osm2pgsql  
> program we may even begin to see the main mapnik tile generator move  
> to using changesets.

I would love to rely on these exclusively, it's much more efficient.  
But, I was seeing a fair bit of information fall through the cracks so  
that's why I'm re-synching to planet every four weeks.



>> I can see a few possible solutions.
>>
>> The cutoff times for files on planet.openstreetmap.org could  
>> behave  more consistently. A weekly dump should end at 11:59pm so  
>> that dailies  can immediately pick up user activity. Hourly and  
>> daily dumps should  be synchronized. This seems more difficult.
>>
>
> You only need a single consistent snapshot to get started.  You can  
> download a planet, then download the two daily changesets either  
> side of the planet generation window, then use osmosis to patch the  
> planet.  This will give you a consistent snapshot.  Once you've  
> imported that into your target database you can then start using  
> daily changesets to keep up to date (or hourly or minute as  
> appropriate).
>
> While it would be nice to have planet dumps already in consistent  
> form, it does add a significant overhead to the whole process.  It's  
> not terribly hard to fix on the client side.

Probably what I need to do is get a fresh update of osm2pgsql. I can  
see now that the revision I'm using is older than #10464, where some  
inconsistency resilience was added.


-mike.

>


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Frederik Ramm
Hi,

Brett Henderson wrote:
>> Brett Henderson has offered to look into creating the dailies from 
>> history as well, but I don't know about the status of that.
>>   
> Are you referring to the daily changesets? 
[...]
> Or did you mean planets instead of dailies? 

Mix-up on my part, sorry, yes I meant the planets.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail [EMAIL PROTECTED]  ##  N49°00'09" E008°23'33"

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Brett Henderson
Jochen Topf wrote:
> If the planet dump plus the diff from the same day is what everybody
> wants anyway, why not do this on the server side and hold the planet
> back after the first diff is available, run this over the planet and
> then publish that as the planet?
>   
It would add delay to the planet creation process.  I don't know how 
much of an issue that would be.

How many people still download the full planet on a regular basis?  I 
would hope that people would begin to use changesets even if they only 
require a complete xml file.  For bandwidth reasons alone the gains are 
well worthwhile, plus you can get far more regular updates than weekly.  
The script below automates keeping a snapshot file in sync:
http://svn.openstreetmap.org/applications/utils/osmosis/script/contrib/replicate_osm_file.sh


___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Brett Henderson
Frederik Ramm wrote:
> Hi,
>
> Michal Migurski wrote:
>   
>> I've noticed some misalignments between the data in the dumps and the  
>> osm2pgsql importer that leads to unavoidable holes in the data.
>> 
>
> As TomH has already said, this is not a bug, it stems from the fact that 
> the full planet export reads the "current" tables and as such is subject 
> to changes that occur during the export process. (There may even be 
> inconsistencies when something like this happens: Exporter dumps nodes, 
> exporter starts dumping ways, user adds new node into way, new way 
> version is dumped referring to new node that is not in the dump.)
>
> The daily, hourly, and minutely diffs have a clean cutoff date because 
> they are taken from the history tables.
>
> Brett Henderson has offered to look into creating the dailies from 
> history as well, but I don't know about the status of that.
>   
Are you referring to the daily changesets?  The daily, hourly and minute 
changesets are all using the same osmosis-extract-mysql application and 
the only difference is the interval being used.  For a while it was 
using a shell script for dailies that spaetz initially created and I 
extended, but it was unreliable in the face of database outages and is 
no longer being used.  That was the reason for the switch from bzip to 
gzip compression.

Or did you mean planets instead of dailies?  I have a working 
implementation but it is slower than the existing planet dump process so 
I've never tried to introduce it.  It would be faster to automate 
patching of a current table planet with some changesets.

Brett


___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Brett Henderson
Others have already commented on most of your points but I'll add my 
thoughts in case there's some gaps.

Michal Migurski wrote:
> Hi,
>
> I've been trying to keep up to date with the dumps and diffs from 
> http://planet.openstreetmap.org/ 
> , and I'm running into a number of bugs related to cutoff dates.
>
> In keeping my Bay Area tiles 
> (http://mike.teczno.com/notes/cascadenik-openstreetmap.html 
> ) up to date, I've been grabbing complete planet.osm dumps about once  
> per month, and filling in the intervening time with daily diffs. I've  
> noticed some misalignments between the data in the dumps and the  
> osm2pgsql importer that leads to unavoidable holes in the data.
>
> It seems that they could be fixed in either osm2pgsql, the planet  
> files, or both.
>
> The final event in each weekly planet dump does not fall on an even  
> day boundary. In the case of the most recent Oct. 22nd planet.osm, it  
> was necessary to experiment with hourly diffs from that day to find  
> that the boundary was approx. 2:00pm. Hourlies up to and including  
> 2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I  
> could go more granular here, checking the minute diffs as well for a  
> more precise breakpoint, but it seems odd that the planet dump does  
> not break cleanly on a midnight boundary so that it's possible to pick  
> up the differences moving forward.
>   
Yep, as others have commented there are two tables types in the osm 
database; current tables, and history tables.  The planet dumper just 
reads current tables which is the fastest approach.  Unfortunately the 
current tables change constantly during the planet generation process 
resulting in inconsistencies.  It is possible to produce a consistent 
snapshot reading history tables and osmosis has the ability to do just 
that but it is significantly slower.  It is also possible to produce a 
consistent snapshot by taking an inconsistent planet and applying 
changesets from a point in time prior to the planet dump beginning 
through to a point after completion, this effectively produces the same 
result at much reduced load on the main database.
> osm2pgsql itself notifies the user of inconsistencies by failing. I  
> can see that effort has been put into making it more resilient (e.g. 
> http://trac.openstreetmap.org/changeset/10464) 
> . Does osm2pgsql have something like a `--force` switch? I haven't  
> been able to find one. In looking at the diff files, it seems that it  
> should be possible to ignore possible conflicts by simply overwriting  
> whatever's in the DB with whatever's in the .osc file.
>   
Yes, that's true.  I can't comment on osm2pgsql but when osmosis 
processes changeset files it does exactly that.
> Finally, the boundaries between the hourlies and dailies seem  
> misaligned.
>   
This shouldn't be the case.
> After running the remaining hourlies for the 22nd, I attempted to pick  
> up on the 23rd with a daily. The final hourly I used was  
> 2008102223-2008102300.osc.gz. It's my expectation that I should be  
> able to immediately follow that with 20081023-20081024.osc.gz, but  
> this led to duplicate key violation suggesting that there's an overlap  
> between the two files. Continuing with hourlies *works*, but is  
> tedious and I suspect slower than the dailies.
>   
You should have been able to do what you've suggested.  If you are 
finding problems, please provide me with some example data which is 
misaligned between the two types of changesets.  I've gone to a fair bit 
of trouble to ensure that timestamp management is correct.  For example, 
all changesets and file names are using UTC even though the database 
itself is using BST.  If I've made a mistake somewhere I'd like to know 
about it.  Given that daily, hourly and minute changesets are using 
*identical* code, I find it hard to believe they're inconsistent with 
each other.
> My sense from reading other people's experiences has been that it's a  
> common pattern to rely solely on the weekly planet dumps, incurring  
> the substantial overhead of parsing and importing the full 5GB dump  
> once every week, and then re-rendering the complete set of tiles.
>   
For a long time weekly planet dumps were the only bulk data available.  
Osmosis changesets have been on the scene for some time now though and 
are gradually being utilised by more and more clients.  As the planet 
grows, this will become more critical.  Who knows, if the kinks 
gradually get ironed out of the osm2pgsql program we may even begin to 
see the main mapnik tile generator move to using changesets.
> My hope has been to proceed in a more incremental fashion, since this  
> makes it possible to track what specific tiles need to be re-rendered  
> on a near-constant schedule, based on actual content or activity, vs.  
> simple cache expiration. Right now I'm doing this daily, I'd like to  
> do it as often as hourly.
>   
Yep, that was one of my original aims.
> I can

Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Jochen Topf
On Mon, Oct 27, 2008 at 08:22:32AM +, Tom Hughes wrote:
> Shaun McDonald wrote:
> > On 27 Oct 2008, at 00:50, Michal Migurski wrote:
> > 
> >>> Planet dumps are not snapshots - they do not represent a consistent
> >>> view at any particular point in time because they take a number of
> >>> hours to generate, during which time new changes are constantly
> >>> being made to the contents of the database.
>  >>
> >> Shouldn't it be possible to ignore any changes that happen after the
> >> cutoff, though?
> > 
> > At the moment we don't look at the time stamps when dumping the planet  
> > file.
> 
> It's not as simple as that - you also have to switch to reading the 
> history tables rather than the current tables or you won't be able to 
> see what the state of the object used to be if it has changed since the 
> snapshot time.
> 
> Which means you're reading much more data, and either having to track 
> the state of each object (in order to find the most recent valid change) 
> or you have to index scan so that you're seeing things in timestamp order.

If the planet dump plus the diff from the same day is what everybody
wants anyway, why not do this on the server side and hold the planet
back after the first diff is available, run this over the planet and
then publish that as the planet?

Jochen
-- 
Jochen Topf  [EMAIL PROTECTED]  http://www.remote.org/jochen/  +49-721-388298


___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Tom Hughes
Shaun McDonald wrote:
> On 27 Oct 2008, at 00:50, Michal Migurski wrote:
> 
>>> Planet dumps are not snapshots - they do not represent a consistent
>>> view at any particular point in time because they take a number of
>>> hours to generate, during which time new changes are constantly
>>> being made to the contents of the database.
 >>
>> Shouldn't it be possible to ignore any changes that happen after the
>> cutoff, though?
> 
> At the moment we don't look at the time stamps when dumping the planet  
> file.

It's not as simple as that - you also have to switch to reading the 
history tables rather than the current tables or you won't be able to 
see what the state of the object used to be if it has changed since the 
snapshot time.

Which means you're reading much more data, and either having to track 
the state of each object (in order to find the most recent valid change) 
or you have to index scan so that you're seeing things in timestamp order.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Martijn van Oosterhout
On Mon, Oct 27, 2008 at 1:10 AM, Michal Migurski <[EMAIL PROTECTED]> wrote:
> The final event in each weekly planet dump does not fall on an even
> day boundary. In the case of the most recent Oct. 22nd planet.osm, it
> was necessary to experiment with hourly diffs from that day to find
> that the boundary was approx. 2:00pm. Hourlies up to and including
> 2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I
> could go more granular here, checking the minute diffs as well for a
> more precise breakpoint, but it seems odd that the planet dump does
> not break cleanly on a midnight boundary so that it's possible to pick
> up the differences moving forward.

As I recall, osm2pgsql did support this kind of operation (or at least
it did last time I tried, it was discussed on the list). All creates
in diffs are treated as delete+insert. You don't actually say what the
error was you ran into though so I can't be sure if you're talking
about the same problem.

Now that I think about it though, I think what I did was take one of
the planet dumps from http://hypercube.telascience.org/planet/ (which
*are* consistant snapshots), and run the dailies from there.

Have a nice day,
-- 
Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-27 Thread Jochen Topf
On Sun, Oct 26, 2008 at 06:11:04PM -0700, Michal Migurski wrote:
> What is the difference between osmosis and osm2pgsql, with regards to  
> postGIS?

osm2pgsql creates the structure needed for Mapnik. Osmosis creates a
structure more simliar to the one in the OSM central database.

> If I've been maintaining a dataset based on osm2pgsql with the  
> provided default.style, would a dataset based on osmosis result in a  
> substantially different table structure?

Yes.

Jochen
-- 
Jochen Topf  [EMAIL PROTECTED]  http://www.remote.org/jochen/  +49-721-388298


___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Shaun McDonald

On 27 Oct 2008, at 00:50, Michal Migurski wrote:

>>> The final event in each weekly planet dump does not fall on an
>>> even  day boundary. In the case of the most recent Oct. 22nd
>>> planet.osm, it  was necessary to experiment with hourly diffs from
>>> that day to find  that the boundary was approx. 2:00pm. Hourlies up
>>> to and including  2008102213-2008102214.osc.gz failed, hourlies
>>> after that succeeded. I  could go more granular here, checking the
>>> minute diffs as well for a  more precise breakpoint, but it seems
>>> odd that the planet dump does  not break cleanly on a midnight
>>> boundary so that it's possible to pick  up the differences moving
>>> forward.
>>
>> Planet dumps are not snapshots - they do not represent a consistent
>> view at any particular point in time because they take a number of
>> hours to generate, during which time new changes are constantly
>> being made to the contents of the database.
>
> Shouldn't it be possible to ignore any changes that happen after the
> cutoff, though?

At the moment we don't look at the time stamps when dumping the planet  
file.

> I may not understand the structure of the OSM
> database, but it seems like if it supports rollbacks, then in theory
> it ought to be possible to only include things before a given
> timestamp when creating the dump file. That, or make it clear what the
> actual cutoff time is in the dumpfile.

We currently don't support rollbacks. It would require a rewrite of  
the dump script, and more time and processing to be able to produce a  
consistent planet dump.

>
>
> I understand that in practice, practice is different from theory. =)

Have you got the rails port running?

>
>
>
>> I believe that it is supposed to be safe to apply diffs which
>> overlap with the planet dump in order to bring it to a consistent
>> state however.
>
> This is what I would have hoped, however osm2pgsql does not appear to
> allow it. It feels like the easiest solution would be to give
> osm2pgsql a --force option, and add some explanation of timing and
> cutoffs to http://planet.openstreetmap.org/README.
>

The initial import that you do with osm2pgsql, must be using a special  
mode to allow diff imports. Could it be that you need to update to the  
latest version of osm2pgsql? You should be able to happily apply the  
diffs to an inconsistent planet dump, to get a consistent planet dump.

This will become easier when the version numbers are exposed in the  
0.6 API. The diff mechanism would then be able to look at the version  
numbers of the nodes/ways/relations and be able to deal with them  
appropriately.

Shaun


___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Michal Migurski
On Oct 26, 2008, at 5:50 PM, Frederik Ramm wrote:

> Brett Henderson has offered to look into creating the dailies from  
> history as well, but I don't know about the status of that.
>
> If you use osmosis, it is safe (and in fact recommended) that, after  
> loading the database with a planet file initially, you should load  
> that same day's diff file as the first diff, creating a clean cutoff  
> point. It is possible that the same is not working with osm2pgsql, I  
> have no experience there.


What is the difference between osmosis and osm2pgsql, with regards to  
postGIS?

If I've been maintaining a dataset based on osm2pgsql with the  
provided default.style, would a dataset based on osmosis result in a  
substantially different table structure?

-mike.


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Michal Migurski
>> The final event in each weekly planet dump does not fall on an  
>> even  day boundary. In the case of the most recent Oct. 22nd  
>> planet.osm, it  was necessary to experiment with hourly diffs from  
>> that day to find  that the boundary was approx. 2:00pm. Hourlies up  
>> to and including  2008102213-2008102214.osc.gz failed, hourlies  
>> after that succeeded. I  could go more granular here, checking the  
>> minute diffs as well for a  more precise breakpoint, but it seems  
>> odd that the planet dump does  not break cleanly on a midnight  
>> boundary so that it's possible to pick  up the differences moving  
>> forward.
>
> Planet dumps are not snapshots - they do not represent a consistent  
> view at any particular point in time because they take a number of  
> hours to generate, during which time new changes are constantly  
> being made to the contents of the database.

Shouldn't it be possible to ignore any changes that happen after the  
cutoff, though? I may not understand the structure of the OSM  
database, but it seems like if it supports rollbacks, then in theory  
it ought to be possible to only include things before a given  
timestamp when creating the dump file. That, or make it clear what the  
actual cutoff time is in the dumpfile.

I understand that in practice, practice is different from theory. =)


> I believe that it is supposed to be safe to apply diffs which  
> overlap with the planet dump in order to bring it to a consistent  
> state however.

This is what I would have hoped, however osm2pgsql does not appear to  
allow it. It feels like the easiest solution would be to give  
osm2pgsql a --force option, and add some explanation of timing and  
cutoffs to http://planet.openstreetmap.org/README.


> BTW I'm not sure why you CCed the OSMF board on this... I don't  
> think it needs their input at all.

Mikel Maron suggested that I cc: team@, when I spoke to him about this  
a few days ago, because it's connected to a *.openstreetmap.org service.

Thanks for your reply!

>

-mike.


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Frederik Ramm
Hi,

Michal Migurski wrote:
> I've noticed some misalignments between the data in the dumps and the  
> osm2pgsql importer that leads to unavoidable holes in the data.

As TomH has already said, this is not a bug, it stems from the fact that 
the full planet export reads the "current" tables and as such is subject 
to changes that occur during the export process. (There may even be 
inconsistencies when something like this happens: Exporter dumps nodes, 
exporter starts dumping ways, user adds new node into way, new way 
version is dumped referring to new node that is not in the dump.)

The daily, hourly, and minutely diffs have a clean cutoff date because 
they are taken from the history tables.

Brett Henderson has offered to look into creating the dailies from 
history as well, but I don't know about the status of that.

If you use osmosis, it is safe (and in fact recommended) that, after 
loading the database with a planet file initially, you should load that 
same day's diff file as the first diff, creating a clean cutoff point. 
It is possible that the same is not working with osm2pgsql, I have no 
experience there.

Bye
Frederik

-- 
Frederik Ramm  ##  eMail [EMAIL PROTECTED]  ##  N49°00'09" E008°23'33"

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Tom Hughes
Michal Migurski wrote:

> The final event in each weekly planet dump does not fall on an even  
> day boundary. In the case of the most recent Oct. 22nd planet.osm, it  
> was necessary to experiment with hourly diffs from that day to find  
> that the boundary was approx. 2:00pm. Hourlies up to and including  
> 2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I  
> could go more granular here, checking the minute diffs as well for a  
> more precise breakpoint, but it seems odd that the planet dump does  
> not break cleanly on a midnight boundary so that it's possible to pick  
> up the differences moving forward.

Planet dumps are not snapshots - they do not represent a consistent view 
at any particular point in time because they take a number of hours to 
generate, during which time new changes are constantly being made to the 
contents of the database.

I believe that it is supposed to be safe to apply diffs which overlap 
with the planet dump in order to bring it to a consistent state however.

> The cutoff times for files on planet.openstreetmap.org could behave  
> more consistently. A weekly dump should end at 11:59pm so that dailies  
> can immediately pick up user activity. Hourly and daily dumps should  
> be synchronized. This seems more difficult.

As explained above, there is no cutoff time as such, and it isn't 
possible to implement one as things stand. It may be possible once we 
have working transactions, though it's not clear that a transaction that 
lasts many hours would be sensible or workable.

BTW I'm not sure why you CCed the OSMF board on this... I don't think it 
needs their input at all.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


[OSM-talk] osm2pgsql & planet: frustrations, cutoffs, and idempotence

2008-10-26 Thread Michal Migurski
Hi,

I've been trying to keep up to date with the dumps and diffs from 
http://planet.openstreetmap.org/ 
, and I'm running into a number of bugs related to cutoff dates.

In keeping my Bay Area tiles 
(http://mike.teczno.com/notes/cascadenik-openstreetmap.html 
) up to date, I've been grabbing complete planet.osm dumps about once  
per month, and filling in the intervening time with daily diffs. I've  
noticed some misalignments between the data in the dumps and the  
osm2pgsql importer that leads to unavoidable holes in the data.

It seems that they could be fixed in either osm2pgsql, the planet  
files, or both.

The final event in each weekly planet dump does not fall on an even  
day boundary. In the case of the most recent Oct. 22nd planet.osm, it  
was necessary to experiment with hourly diffs from that day to find  
that the boundary was approx. 2:00pm. Hourlies up to and including  
2008102213-2008102214.osc.gz failed, hourlies after that succeeded. I  
could go more granular here, checking the minute diffs as well for a  
more precise breakpoint, but it seems odd that the planet dump does  
not break cleanly on a midnight boundary so that it's possible to pick  
up the differences moving forward.

osm2pgsql itself notifies the user of inconsistencies by failing. I  
can see that effort has been put into making it more resilient (e.g. 
http://trac.openstreetmap.org/changeset/10464) 
. Does osm2pgsql have something like a `--force` switch? I haven't  
been able to find one. In looking at the diff files, it seems that it  
should be possible to ignore possible conflicts by simply overwriting  
whatever's in the DB with whatever's in the .osc file.

Finally, the boundaries between the hourlies and dailies seem  
misaligned.

After running the remaining hourlies for the 22nd, I attempted to pick  
up on the 23rd with a daily. The final hourly I used was  
2008102223-2008102300.osc.gz. It's my expectation that I should be  
able to immediately follow that with 20081023-20081024.osc.gz, but  
this led to duplicate key violation suggesting that there's an overlap  
between the two files. Continuing with hourlies *works*, but is  
tedious and I suspect slower than the dailies.

My sense from reading other people's experiences has been that it's a  
common pattern to rely solely on the weekly planet dumps, incurring  
the substantial overhead of parsing and importing the full 5GB dump  
once every week, and then re-rendering the complete set of tiles.

My hope has been to proceed in a more incremental fashion, since this  
makes it possible to track what specific tiles need to be re-rendered  
on a near-constant schedule, based on actual content or activity, vs.  
simple cache expiration. Right now I'm doing this daily, I'd like to  
do it as often as hourly.

I can see a few possible solutions.

The cutoff times for files on planet.openstreetmap.org could behave  
more consistently. A weekly dump should end at 11:59pm so that dailies  
can immediately pick up user activity. Hourly and daily dumps should  
be synchronized. This seems more difficult.

Or, osm2pgsql could be more fault-tolerant, so that potentially- 
overlapping .osm and .osc files can be safely used. As long as they  
are applied in chronological order, repetitions should be idempotent.  
Is this just a matter of futzing with the SQL commands to suppress  
index key collisions?

-mike.


michal migurski- [EMAIL PROTECTED]
  415.558.1610




___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk