Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-03-04 Thread Brett Henderson
Hi Frederik,

On Fri, Feb 18, 2011 at 7:57 PM, Frederik Ramm frede...@remote.org wrote:

 But I've been thinking: With the the high performance of PBF reading, a
 two-pass operation should become possible. Simply read the input file twice,
 determining which objects to copy in pass 1, and actually copying them in
 pass 2. I'm just not sure how that could be made to fit in Osmosis. One way
 could be creating a special type of file, a selection list, from a given
 entity stream. A new task --write-seelction-list would dump the IDs of all
 nodes, ways, and relations that were either present or referenced in the
 entity stream:


Your suggestion of creating a selection list file to support a two-pass
approach seems quite reasonable, and would avoid making any changes to the
pipeline design.

I've often thought about ways of allowing tasks to receive their input data
stream twice, but can't think of a way to do it without greatly complicating
the entire pipeline.  My latest idea was for Sink tasks (eg. your --multi-bb
task) to throw an exception (eg. ResendDataException) that could be caught
by Source tasks (eg. --read-xml) triggering them to re-start processing.  It
wouldn't be terribly difficult to add support for it to tasks like
--read-xml, but it starts to get messy in the pipeline control tasks like
--tee and --buffer.  I'd have to re-write the DataPostbox class which forms
the basis of all thread interaction to fix --buffer, and --tee would have to
track which downstream pipes had requested a restart then trigger a single
restart when all downstream pipes had finished with the first pass.

One thing I like about the exception approach to restarts is that it could
be rolled into existing tasks incrementally.  Any tasks not supporting it
would abort cleanly when they received the exception.

I started experimenting with this idea this evening, but I've had to abandon
the idea because I don't have the time to see it through.

Brett
___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Frederik Ramm

Igor,

On 02/18/11 11:40, Igor Podolskiy wrote:

just a random thought: what's wrong with using --dataset-bounding-box?
Importing the planet file into a database and doing a bunch of queries
against is equivalent to creating a single disk buffer for all bb tasks
(the database _is_ the disk buffer, if you want). This still isn't very
elegant as it requires two passes (import into DB and export to PBFs)
but is IMHO more elegant than selection lists.


I currently use bounding polygons so I'd have to add a 
--dataset-bounding-polygon task for that but that should be possible.


The major problem is finding a database that would be able to import the 
data, index it, and fulfil my requests, all within less than 24 hours. 
From what I hear, PostGIS takes a week for the import step alone but I 
want to produce fresh extracts every day. So I would have to import into 
PostGIS and then apply the daily diffs, but I could only do that if 
applying a daily diff and making all the bounding-polygon requests would 
take much less than a day and somehow I suspect that's not going to work.



Don't get me wrong, maybe there's a reason why dataset tasks are
unsuitable - but I've not found any... Maybe the dbb task itself needs
improvement - but in this case, this should be the way to go.


I doubt that a performant solution can ever be found with PostGIS but 
I'd love to hear from someone who can prove me wrong.


Bye
Frederik

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread André Joost

Am 18.02.11 12:20, schrieb Frederik Ramm:
ataset-bounding-polygon task for that but that should be possible.


The major problem is finding a database that would be able to import the
data, index it, and fulfil my requests, all within less than 24 hours.
 From what I hear, PostGIS takes a week for the import step alone but I
want to produce fresh extracts every day. So I would have to import into
PostGIS and then apply the daily diffs, but I could only do that if
applying a daily diff and making all the bounding-polygon requests would
take much less than a day and somehow I suspect that's not going to work.



You can use minutely diffs, which gives you a permanently current database.

Greetings,
André Joost



___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Brett Henderson
On Fri, Feb 18, 2011 at 10:20 PM, Frederik Ramm frede...@remote.org wrote:

 Igor,


 On 02/18/11 11:40, Igor Podolskiy wrote:

 just a random thought: what's wrong with using --dataset-bounding-box?
 Importing the planet file into a database and doing a bunch of queries
 against is equivalent to creating a single disk buffer for all bb tasks
 (the database _is_ the disk buffer, if you want). This still isn't very
 elegant as it requires two passes (import into DB and export to PBFs)
 but is IMHO more elegant than selection lists.


 I currently use bounding polygons so I'd have to add a
 --dataset-bounding-polygon task for that but that should be possible.

 The major problem is finding a database that would be able to import the
 data, index it, and fulfil my requests, all within less than 24 hours. From
 what I hear, PostGIS takes a week for the import step alone but I want to
 produce fresh extracts every day. So I would have to import into PostGIS and
 then apply the daily diffs, but I could only do that if applying a daily
 diff and making all the bounding-polygon requests would take much less than
 a day and somehow I suspect that's not going to work.


  Don't get me wrong, maybe there's a reason why dataset tasks are
 unsuitable - but I've not found any... Maybe the dbb task itself needs
 improvement - but in this case, this should be the way to go.


 I doubt that a performant solution can ever be found with PostGIS but I'd
 love to hear from someone who can prove me wrong.


Processing a full day's worth of daily diffs is doable in much less than a
day.  The one time I did this on a server with only average IO performance
(data spread across two disks) it would take about 6 hours to process a day
of data.  Extracting continent sized chunks from the database will probably
be a problem though.  The schema works well for 1x1 or 2x2 degree bboxes,
but gets rapidly slower as the size of returned data increases.

Given that each of your extracts is extracting a large percentage of the
overall planet, then doing multiple passes across the entire planet will be
tough to beat performance wise.

In other words, a database is best (and the only way possible) for small
up-to-date extracts, but sequential reads are still best for large extracts
where multi-hour time lags are acceptable.  I suspect that lots of
optimisation is possible for databases, because for one thing the PostGIS
data types take up humungous amounts of disk space.  Perhaps someone with
lots of time and some custom data types could improve on this ...

Brett
___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Frederik Ramm

Scott,

On 02/18/11 14:15, Scott Crosby wrote:

I used two bitsets for every output file. One indicating which nodes
were already output and another (built when processing ways)
indicating what node ID's were missed and will need to be grabbed on
the next pass. I have another two pair of bitsets for ways and
relations. Thats around 400mb of RAM per output.


I see. Means I overestimated a little ;)


Really, almost everything you want is already done from the mkgmap
splitter crosby_integration branch. Just make it support
non-rectangular regions, and track output/missed entities with
bitsets. I'd say about 4 hours and 2gb+400mb/region  to generate as
many outputs as you want.


I'll have a look at that.


* all nodes in the bounding box
* all ways using any of these nodes
* all nodes used by any of these ways even if outside
* all relations using any of these nodes or ways
o all nodes and ways used by any of these relations even if outside
o but NOT all nodes used by a way drawn in through a relation.

(The points marked * are what the API does; the API does not do the o
marked points even though users could be interested in them.)


I did not know that it was allowed to miss the things in o. That
makes the job easier.


The API doesn't do them on purpose even though, having access to an 
indexed database, it could easily add the o points. Reason being that 
other relation members may not even be geographically nearby, and if you 
load a small part of European countryside you might not be interested in 
receiving the full detail about the Madrid-Peking long distance cycle 
route just because that happens to pass through your area of interest ;)



There's a second approach that could split the entire planet into
extracts, simultaneously, in 2-3 minutes; a new planet format that I'm
working on that is geographically sorted. Progress is ongoing, but
slow.


Sounds interesting - and difficult, in case of the Madrid-Peking long 
distance cycle route ;)


Bye
Frederik

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Nakor

On 2/18/2011 7:24 AM, Frederik Ramm wrote:
Provided that applying a minutely diff (possily even while the 
extraction requests are running) does indeed take less than a minute. 
Which, I repeat myself, I believe to be unlikely.



  Frederik, André,

I have been playing with this lately on a machine that has 4 cores and 
8GB of RAM. The DB is stored on a RAID0 array on 3x200Gb physical disks 
(I do not remember the I/O performance of the RAID array and do not have 
access to the machine ATM). It took me 7 days (give or take a couple 
hours) to import the planet. I then began to apply the daily diffs and 
it took me 17 hours for each of the first ones. I stopped at that point 
as assuming the import time is constant it would take a month to have an 
up-to-date DB.


I did not give a try to the minutely diffs yet.

  My 2c,

N.

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev