Re: [osmosis-dev] Improving completeWays/completeRelations performance
Hi Frederik, On Fri, Feb 18, 2011 at 7:57 PM, Frederik Ramm wrote: > But I've been thinking: With the the high performance of PBF reading, a > two-pass operation should become possible. Simply read the input file twice, > determining which objects to copy in pass 1, and actually copying them in > pass 2. I'm just not sure how that could be made to fit in Osmosis. One way > could be creating a special type of file, a "selection list", from a given > entity stream. A new task "--write-seelction-list" would dump the IDs of all > nodes, ways, and relations that were either present or referenced in the > entity stream: > Your suggestion of creating a selection list file to support a two-pass approach seems quite reasonable, and would avoid making any changes to the pipeline design. I've often thought about ways of allowing tasks to receive their input data stream twice, but can't think of a way to do it without greatly complicating the entire pipeline. My latest idea was for Sink tasks (eg. your --multi-bb task) to throw an exception (eg. ResendDataException) that could be caught by Source tasks (eg. --read-xml) triggering them to re-start processing. It wouldn't be terribly difficult to add support for it to tasks like --read-xml, but it starts to get messy in the pipeline control tasks like --tee and --buffer. I'd have to re-write the DataPostbox class which forms the basis of all thread interaction to fix --buffer, and --tee would have to track which downstream pipes had requested a restart then trigger a single restart when all downstream pipes had finished with the first pass. One thing I like about the exception approach to restarts is that it could be rolled into existing tasks incrementally. Any tasks not supporting it would abort cleanly when they received the exception. I started experimenting with this idea this evening, but I've had to abandon the idea because I don't have the time to see it through. Brett ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
On 2/18/2011 7:24 AM, Frederik Ramm wrote: Provided that applying a minutely diff (possily even while the extraction requests are running) does indeed take less than a minute. Which, I repeat myself, I believe to be unlikely. Frederik, André, I have been playing with this lately on a machine that has 4 cores and 8GB of RAM. The DB is stored on a RAID0 array on 3x200Gb physical disks (I do not remember the I/O performance of the RAID array and do not have access to the machine ATM). It took me 7 days (give or take a couple hours) to import the planet. I then began to apply the daily diffs and it took me 17 hours for each of the first ones. I stopped at that point as assuming the import time is constant it would take a month to have an up-to-date DB. I did not give a try to the minutely diffs yet. My 2c, N. ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
Scott, On 02/18/11 14:15, Scott Crosby wrote: I used two bitsets for every output file. One indicating which nodes were already output and another (built when processing ways) indicating what node ID's were missed and will need to be grabbed on the next pass. I have another two pair of bitsets for ways and relations. Thats around 400mb of RAM per output. I see. Means I overestimated a little ;) Really, almost everything you want is already done from the mkgmap splitter crosby_integration branch. Just make it support non-rectangular regions, and track output/missed entities with bitsets. I'd say about 4 hours and 2gb+400mb/region to generate as many outputs as you want. I'll have a look at that. * all nodes in the bounding box * all ways using any of these nodes * all nodes used by any of these ways even if outside * all relations using any of these nodes or ways o all nodes and ways used by any of these relations even if outside o but NOT all nodes used by a way drawn in through a relation. (The points marked "*" are what the API does; the API does not do the "o" marked points even though users could be interested in them.) I did not know that it was allowed to miss the things in "o". That makes the job easier. The API doesn't do them on purpose even though, having access to an indexed database, it could easily add the "o" points. Reason being that other relation members may not even be geographically nearby, and if you load a small part of European countryside you might not be interested in receiving the full detail about the Madrid-Peking long distance cycle route just because that happens to pass through your area of interest ;) There's a second approach that could split the entire planet into extracts, simultaneously, in 2-3 minutes; a new planet format that I'm working on that is geographically sorted. Progress is ongoing, but slow. Sounds interesting - and difficult, in case of the Madrid-Peking long distance cycle route ;) Bye Frederik ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
On Fri, Feb 18, 2011 at 2:57 AM, Frederik Ramm wrote: > Hi, > > in the long run I'd like to change the Geofabrik extracts so that they > have the completeWays/completeRelations feature enabled. It's a pain because > that totally breaks the elegant and well-performing streaming mode in > Osmosis but it would really make the extracts more usable, and more in line > with what people get from the API. > > My biggest concern is the disk space used for temporary storage. If I read > things correctly, a temporary storage of the input stream is created for > each --bb or --bp task. So if you do something like > > osmosis --rb planet --tee 5 > --bb ... --wb europe > --bb ... --wb asia > --bb ... --wb america > --bb ... --wb australia > --bb ... --wb africa > > then you will temporarily have 5 copies of the planet file lying around. So > while, if there was only one copy of it, I could still hope to make use of > linux file system buffers and a lot of RAM to soften the negative impact of > file storage, that will kill performance for sure. > > I wonder if there is a way to at least reduce this to *one* temporary > storage. The easiest thing I could imagine would be a new "multi-bb" (or > "multi-bp") task that basically combines the tee and bb. That would be less > elegant and would probably also be less efficient because it would not use > multiple threads, but it could easily use one shared temporary storage. > > But I've been thinking: With the the high performance of PBF reading, a > two-pass operation should become possible. Simply read the input file twice, > determining which objects to copy in pass 1, and actually copying them in > pass 2. I'm just not sure how that could be made to fit in Osmosis. One way > could be creating a special type of file, a "selection list", from a given > entity stream. A new task "--write-seelction-list" would dump the IDs of all > nodes, ways, and relations that were either present or referenced in the > entity stream: I planned out how to do this, but never got around to implementing it, because implementing it within the osmosis piping framework was...unclear. I used two bitsets for every output file. One indicating which nodes were already output and another (built when processing ways) indicating what node ID's were missed and will need to be grabbed on the next pass. I have another two pair of bitsets for ways and relations. Thats around 400mb of RAM per output. In the first pass, assign nodes to one or more regions via geographic location. To remember this mapping for assigning ways&relations, use a sparse multihashmap. (Built from layering several sparsehashmaps (http://code.google.com/p/google-sparsehash/)). As the keys are really densely packed integers, don't store them; use a sparse-hash-array. (Actually, use a hybrid approach, see the mkgmap splitter.) Really, almost everything you want is already done from the mkgmap splitter crosby_integration branch. Just make it support non-rectangular regions, and track output/missed entities with bitsets. I'd say about 4 hours and 2gb+400mb/region to generate as many outputs as you want. One caveat is that each file is no longer sorted by (entityType,entityId), and would need to be re-processed to have that order. sorted output. My mechanism has two lists for each output. Thats roughly what I planned on, but one trick that can signifigantly reduce memory usage is to instead track a 'missing list'. what nodes does this extract need that aren't dumped. > Also, > what I have sketched above would be able to give you > > * all nodes in the bounding box > * all ways using any of these nodes > * all nodes used by any of these ways even if outside > * all relations using any of these nodes or ways > o all nodes and ways used by any of these relations even if outside > o but NOT all nodes used by a way drawn in through a relation. > > (The points marked "*" are what the API does; the API does not do the "o" > marked points even though users could be interested in them.) I did not know that it was allowed to miss the things in "o". That makes the job easier. > > Does anybody have any thoughts about this; maybe a different approach still? There's a second approach that could split the entire planet into extracts, simultaneously, in 2-3 minutes; a new planet format that I'm working on that is geographically sorted. Progress is ongoing, but slow. Scott ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
On Fri, Feb 18, 2011 at 10:20 PM, Frederik Ramm wrote: > Igor, > > > On 02/18/11 11:40, Igor Podolskiy wrote: > >> just a random thought: what's wrong with using --dataset-bounding-box? >> Importing the planet file into a database and doing a bunch of queries >> against is equivalent to creating a single disk buffer for all bb tasks >> (the database _is_ the disk buffer, if you want). This still isn't very >> elegant as it requires two passes (import into DB and export to PBFs) >> but is IMHO more elegant than selection lists. >> > > I currently use bounding polygons so I'd have to add a > --dataset-bounding-polygon task for that but that should be possible. > > The major problem is finding a database that would be able to import the > data, index it, and fulfil my requests, all within less than 24 hours. From > what I hear, PostGIS takes a week for the import step alone but I want to > produce fresh extracts every day. So I would have to import into PostGIS and > then apply the daily diffs, but I could only do that if applying a daily > diff and making all the bounding-polygon requests would take much less than > a day and somehow I suspect that's not going to work. > > > Don't get me wrong, maybe there's a reason why dataset tasks are >> unsuitable - but I've not found any... Maybe the dbb task itself needs >> improvement - but in this case, this should be the way to go. >> > > I doubt that a performant solution can ever be found with PostGIS but I'd > love to hear from someone who can prove me wrong. > Processing a full day's worth of daily diffs is doable in much less than a day. The one time I did this on a server with only average IO performance (data spread across two disks) it would take about 6 hours to process a day of data. Extracting continent sized chunks from the database will probably be a problem though. The schema works well for 1x1 or 2x2 degree bboxes, but gets rapidly slower as the size of returned data increases. Given that each of your extracts is extracting a large percentage of the overall planet, then doing multiple passes across the entire planet will be tough to beat performance wise. In other words, a database is best (and the only way possible) for small up-to-date extracts, but sequential reads are still best for large extracts where multi-hour time lags are acceptable. I suspect that lots of optimisation is possible for databases, because for one thing the PostGIS data types take up humungous amounts of disk space. Perhaps someone with lots of time and some custom data types could improve on this ... Brett ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
Hi, On 02/18/11 12:47, André Joost wrote: You can use minutely diffs, which gives you a permanently current database. Provided that applying a minutely diff (possily even while the extraction requests are running) does indeed take less than a minute. Which, I repeat myself, I believe to be unlikely. Bye Frederik ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
Am 18.02.11 12:20, schrieb Frederik Ramm: ataset-bounding-polygon task for that but that should be possible. The major problem is finding a database that would be able to import the data, index it, and fulfil my requests, all within less than 24 hours. From what I hear, PostGIS takes a week for the import step alone but I want to produce fresh extracts every day. So I would have to import into PostGIS and then apply the daily diffs, but I could only do that if applying a daily diff and making all the bounding-polygon requests would take much less than a day and somehow I suspect that's not going to work. You can use minutely diffs, which gives you a permanently current database. Greetings, André Joost ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
Igor, On 02/18/11 11:40, Igor Podolskiy wrote: just a random thought: what's wrong with using --dataset-bounding-box? Importing the planet file into a database and doing a bunch of queries against is equivalent to creating a single disk buffer for all bb tasks (the database _is_ the disk buffer, if you want). This still isn't very elegant as it requires two passes (import into DB and export to PBFs) but is IMHO more elegant than selection lists. I currently use bounding polygons so I'd have to add a --dataset-bounding-polygon task for that but that should be possible. The major problem is finding a database that would be able to import the data, index it, and fulfil my requests, all within less than 24 hours. From what I hear, PostGIS takes a week for the import step alone but I want to produce fresh extracts every day. So I would have to import into PostGIS and then apply the daily diffs, but I could only do that if applying a daily diff and making all the bounding-polygon requests would take much less than a day and somehow I suspect that's not going to work. Don't get me wrong, maybe there's a reason why dataset tasks are unsuitable - but I've not found any... Maybe the dbb task itself needs improvement - but in this case, this should be the way to go. I doubt that a performant solution can ever be found with PostGIS but I'd love to hear from someone who can prove me wrong. Bye Frederik ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev
Re: [osmosis-dev] Improving completeWays/completeRelations performance
Hi Frederik, hi everybody, just a random thought: what's wrong with using --dataset-bounding-box? Importing the planet file into a database and doing a bunch of queries against is equivalent to creating a single disk buffer for all bb tasks (the database _is_ the disk buffer, if you want). This still isn't very elegant as it requires two passes (import into DB and export to PBFs) but is IMHO more elegant than selection lists. I think a solution with selection lists that wouldn't eat up 20 or more GB of RAM (the planet file won't shrink anytime soon, rather it will get bigger) would require you to implement something like a database anyway. So, why not use one that's already implemented and tested and so on? :) Don't get me wrong, maybe there's a reason why dataset tasks are unsuitable - but I've not found any... Maybe the dbb task itself needs improvement - but in this case, this should be the way to go. Best regards Igor ___ osmosis-dev mailing list osmosis-dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/osmosis-dev