Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-03-04 Thread Brett Henderson
Hi Frederik,

On Fri, Feb 18, 2011 at 7:57 PM, Frederik Ramm  wrote:

> But I've been thinking: With the the high performance of PBF reading, a
> two-pass operation should become possible. Simply read the input file twice,
> determining which objects to copy in pass 1, and actually copying them in
> pass 2. I'm just not sure how that could be made to fit in Osmosis. One way
> could be creating a special type of file, a "selection list", from a given
> entity stream. A new task "--write-seelction-list" would dump the IDs of all
> nodes, ways, and relations that were either present or referenced in the
> entity stream:
>

Your suggestion of creating a selection list file to support a two-pass
approach seems quite reasonable, and would avoid making any changes to the
pipeline design.

I've often thought about ways of allowing tasks to receive their input data
stream twice, but can't think of a way to do it without greatly complicating
the entire pipeline.  My latest idea was for Sink tasks (eg. your --multi-bb
task) to throw an exception (eg. ResendDataException) that could be caught
by Source tasks (eg. --read-xml) triggering them to re-start processing.  It
wouldn't be terribly difficult to add support for it to tasks like
--read-xml, but it starts to get messy in the pipeline control tasks like
--tee and --buffer.  I'd have to re-write the DataPostbox class which forms
the basis of all thread interaction to fix --buffer, and --tee would have to
track which downstream pipes had requested a restart then trigger a single
restart when all downstream pipes had finished with the first pass.

One thing I like about the exception approach to restarts is that it could
be rolled into existing tasks incrementally.  Any tasks not supporting it
would abort cleanly when they received the exception.

I started experimenting with this idea this evening, but I've had to abandon
the idea because I don't have the time to see it through.

Brett
___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Nakor

On 2/18/2011 7:24 AM, Frederik Ramm wrote:
Provided that applying a minutely diff (possily even while the 
extraction requests are running) does indeed take less than a minute. 
Which, I repeat myself, I believe to be unlikely.



  Frederik, André,

I have been playing with this lately on a machine that has 4 cores and 
8GB of RAM. The DB is stored on a RAID0 array on 3x200Gb physical disks 
(I do not remember the I/O performance of the RAID array and do not have 
access to the machine ATM). It took me 7 days (give or take a couple 
hours) to import the planet. I then began to apply the daily diffs and 
it took me 17 hours for each of the first ones. I stopped at that point 
as assuming the import time is constant it would take a month to have an 
up-to-date DB.


I did not give a try to the minutely diffs yet.

  My 2c,

N.

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Frederik Ramm

Scott,

On 02/18/11 14:15, Scott Crosby wrote:

I used two bitsets for every output file. One indicating which nodes
were already output and another (built when processing ways)
indicating what node ID's were missed and will need to be grabbed on
the next pass. I have another two pair of bitsets for ways and
relations. Thats around 400mb of RAM per output.


I see. Means I overestimated a little ;)


Really, almost everything you want is already done from the mkgmap
splitter crosby_integration branch. Just make it support
non-rectangular regions, and track output/missed entities with
bitsets. I'd say about 4 hours and 2gb+400mb/region  to generate as
many outputs as you want.


I'll have a look at that.


* all nodes in the bounding box
* all ways using any of these nodes
* all nodes used by any of these ways even if outside
* all relations using any of these nodes or ways
o all nodes and ways used by any of these relations even if outside
o but NOT all nodes used by a way drawn in through a relation.

(The points marked "*" are what the API does; the API does not do the "o"
marked points even though users could be interested in them.)


I did not know that it was allowed to miss the things in "o". That
makes the job easier.


The API doesn't do them on purpose even though, having access to an 
indexed database, it could easily add the "o" points. Reason being that 
other relation members may not even be geographically nearby, and if you 
load a small part of European countryside you might not be interested in 
receiving the full detail about the Madrid-Peking long distance cycle 
route just because that happens to pass through your area of interest ;)



There's a second approach that could split the entire planet into
extracts, simultaneously, in 2-3 minutes; a new planet format that I'm
working on that is geographically sorted. Progress is ongoing, but
slow.


Sounds interesting - and difficult, in case of the Madrid-Peking long 
distance cycle route ;)


Bye
Frederik

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Scott Crosby
On Fri, Feb 18, 2011 at 2:57 AM, Frederik Ramm  wrote:
> Hi,
>
>   in the long run I'd like to change the Geofabrik extracts so that they
> have the completeWays/completeRelations feature enabled. It's a pain because
> that totally breaks the elegant and well-performing streaming mode in
> Osmosis but it would really make the extracts more usable, and more in line
> with what people get from the API.
>
> My biggest concern is the disk space used for temporary storage. If I read
> things correctly, a temporary storage of the input stream is created for
> each --bb or --bp task. So if you do something like
>
> osmosis --rb planet --tee 5
>  --bb ... --wb europe
>  --bb ... --wb asia
>  --bb ... --wb america
>  --bb ... --wb australia
>  --bb ... --wb africa
>
> then you will temporarily have 5 copies of the planet file lying around. So
> while, if there was only one copy of it, I could still hope to make use of
> linux file system buffers and a lot of RAM to soften the negative impact of
> file storage, that will kill performance for sure.
>


> I wonder if there is a way to at least reduce this to *one* temporary
> storage. The easiest thing I could imagine would be a new "multi-bb" (or
> "multi-bp") task that basically combines the tee and bb. That would be less
> elegant and would probably also be less efficient because it would not use
> multiple threads, but it could easily use one shared temporary storage.

>
> But I've been thinking: With the the high performance of PBF reading, a
> two-pass operation should become possible. Simply read the input file twice,
> determining which objects to copy in pass 1, and actually copying them in
> pass 2. I'm just not sure how that could be made to fit in Osmosis. One way
> could be creating a special type of file, a "selection list", from a given
> entity stream. A new task "--write-seelction-list" would dump the IDs of all
> nodes, ways, and relations that were either present or referenced in the
> entity stream:

I planned out how to do this, but never got around to implementing it,
because implementing it within the osmosis piping framework
was...unclear.

I used two bitsets for every output file. One indicating which nodes
were already output and another (built when processing ways)
indicating what node ID's were missed and will need to be grabbed on
the next pass. I have another two pair of bitsets for ways and
relations. Thats around 400mb of RAM per output.

In the first pass, assign nodes to one or more regions via geographic
location. To remember this mapping for assigning ways&relations, use a
sparse multihashmap. (Built from layering several sparsehashmaps
(http://code.google.com/p/google-sparsehash/)). As the keys are really
densely packed integers, don't store them; use a sparse-hash-array.
(Actually, use a hybrid approach, see the mkgmap splitter.)

Really, almost everything you want is already done from the mkgmap
splitter crosby_integration branch. Just make it support
non-rectangular regions, and track output/missed entities with
bitsets. I'd say about 4 hours and 2gb+400mb/region  to generate as
many outputs as you want.

One caveat is that each file is no longer sorted by
(entityType,entityId), and would need to be re-processed to have that
order.  sorted output.

My mechanism has two lists for each output.

Thats roughly what I planned on, but one trick that can signifigantly
reduce memory usage is to instead track a 'missing list'. what nodes
does this extract need that aren't dumped.

> Also,
> what I have sketched above would be able to give you
>
> * all nodes in the bounding box
> * all ways using any of these nodes
> * all nodes used by any of these ways even if outside
> * all relations using any of these nodes or ways
> o all nodes and ways used by any of these relations even if outside
> o but NOT all nodes used by a way drawn in through a relation.
>
> (The points marked "*" are what the API does; the API does not do the "o"
> marked points even though users could be interested in them.)

I did not know that it was allowed to miss the things in "o". That
makes the job easier.

>
> Does anybody have any thoughts about this; maybe a different approach still?

There's a second approach that could split the entire planet into
extracts, simultaneously, in 2-3 minutes; a new planet format that I'm
working on that is geographically sorted. Progress is ongoing, but
slow.

Scott

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Brett Henderson
On Fri, Feb 18, 2011 at 10:20 PM, Frederik Ramm  wrote:

> Igor,
>
>
> On 02/18/11 11:40, Igor Podolskiy wrote:
>
>> just a random thought: what's wrong with using --dataset-bounding-box?
>> Importing the planet file into a database and doing a bunch of queries
>> against is equivalent to creating a single disk buffer for all bb tasks
>> (the database _is_ the disk buffer, if you want). This still isn't very
>> elegant as it requires two passes (import into DB and export to PBFs)
>> but is IMHO more elegant than selection lists.
>>
>
> I currently use bounding polygons so I'd have to add a
> --dataset-bounding-polygon task for that but that should be possible.
>
> The major problem is finding a database that would be able to import the
> data, index it, and fulfil my requests, all within less than 24 hours. From
> what I hear, PostGIS takes a week for the import step alone but I want to
> produce fresh extracts every day. So I would have to import into PostGIS and
> then apply the daily diffs, but I could only do that if applying a daily
> diff and making all the bounding-polygon requests would take much less than
> a day and somehow I suspect that's not going to work.
>
>
>  Don't get me wrong, maybe there's a reason why dataset tasks are
>> unsuitable - but I've not found any... Maybe the dbb task itself needs
>> improvement - but in this case, this should be the way to go.
>>
>
> I doubt that a performant solution can ever be found with PostGIS but I'd
> love to hear from someone who can prove me wrong.
>

Processing a full day's worth of daily diffs is doable in much less than a
day.  The one time I did this on a server with only average IO performance
(data spread across two disks) it would take about 6 hours to process a day
of data.  Extracting continent sized chunks from the database will probably
be a problem though.  The schema works well for 1x1 or 2x2 degree bboxes,
but gets rapidly slower as the size of returned data increases.

Given that each of your extracts is extracting a large percentage of the
overall planet, then doing multiple passes across the entire planet will be
tough to beat performance wise.

In other words, a database is best (and the only way possible) for small
up-to-date extracts, but sequential reads are still best for large extracts
where multi-hour time lags are acceptable.  I suspect that lots of
optimisation is possible for databases, because for one thing the PostGIS
data types take up humungous amounts of disk space.  Perhaps someone with
lots of time and some custom data types could improve on this ...

Brett
___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Frederik Ramm

Hi,

On 02/18/11 12:47, André Joost wrote:

You can use minutely diffs, which gives you a permanently current database.


Provided that applying a minutely diff (possily even while the 
extraction requests are running) does indeed take less than a minute. 
Which, I repeat myself, I believe to be unlikely.


Bye
Frederik

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread André Joost

Am 18.02.11 12:20, schrieb Frederik Ramm:
ataset-bounding-polygon task for that but that should be possible.


The major problem is finding a database that would be able to import the
data, index it, and fulfil my requests, all within less than 24 hours.
 From what I hear, PostGIS takes a week for the import step alone but I
want to produce fresh extracts every day. So I would have to import into
PostGIS and then apply the daily diffs, but I could only do that if
applying a daily diff and making all the bounding-polygon requests would
take much less than a day and somehow I suspect that's not going to work.



You can use minutely diffs, which gives you a permanently current database.

Greetings,
André Joost



___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Frederik Ramm

Igor,

On 02/18/11 11:40, Igor Podolskiy wrote:

just a random thought: what's wrong with using --dataset-bounding-box?
Importing the planet file into a database and doing a bunch of queries
against is equivalent to creating a single disk buffer for all bb tasks
(the database _is_ the disk buffer, if you want). This still isn't very
elegant as it requires two passes (import into DB and export to PBFs)
but is IMHO more elegant than selection lists.


I currently use bounding polygons so I'd have to add a 
--dataset-bounding-polygon task for that but that should be possible.


The major problem is finding a database that would be able to import the 
data, index it, and fulfil my requests, all within less than 24 hours. 
From what I hear, PostGIS takes a week for the import step alone but I 
want to produce fresh extracts every day. So I would have to import into 
PostGIS and then apply the daily diffs, but I could only do that if 
applying a daily diff and making all the bounding-polygon requests would 
take much less than a day and somehow I suspect that's not going to work.



Don't get me wrong, maybe there's a reason why dataset tasks are
unsuitable - but I've not found any... Maybe the dbb task itself needs
improvement - but in this case, this should be the way to go.


I doubt that a performant solution can ever be found with PostGIS but 
I'd love to hear from someone who can prove me wrong.


Bye
Frederik

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Improving completeWays/completeRelations performance

2011-02-18 Thread Igor Podolskiy

Hi Frederik, hi everybody,

just a random thought: what's wrong with using --dataset-bounding-box? 
Importing the planet file into a database and doing a bunch of queries 
against is equivalent to creating a single disk buffer for all bb tasks 
(the database _is_ the disk buffer, if you want). This still isn't very 
elegant as it requires two passes (import into DB and export to PBFs) 
but is IMHO more elegant than selection lists.


I think a solution with selection lists that wouldn't eat up 20 or more 
GB of RAM (the planet file won't shrink anytime soon, rather it will get 
bigger) would require you to implement something like a database anyway. 
So, why not use one that's already implemented and tested and so on? :)


Don't get me wrong, maybe there's a reason why dataset tasks are 
unsuitable - but I've not found any... Maybe the dbb task itself needs 
improvement - but in this case, this should be the way to go.


Best regards
Igor

___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev