Re: [osmosis-dev] Changes to Osmosis Pgsql Schema

2010-08-09 Thread Brett Henderson
Hi David,

On Sat, Aug 7, 2010 at 10:10 PM, David Carmean d...@halibut.com wrote:


 Brett, et al:

 Please consider leaving the current simple schema intact (with
 the addition of the index clustering) and instead create a third
 schema for the HSTORE version.

 Losing the current schema would break at least two of my use cases:

 1: I use postGIS as back-end storage for GIS client software, and
 use views and tables derived from queries against the OSM simple
 schema to create various layers of OSM data in the GIS client,
 e.g. roads, footpaths/trails/bike routes, hydrography, landuse,
 structures (buildings), etc.  This is such an obvious technique
 that I'd be very surprised not to find a lot of others doing the
 same.


It's great to hear you're finding this schema useful.  I don't hear a lot
from people using it, so getting feedback is always appreciated.

Is it enough to have the existing 0.36 version of Osmosis to do this?  Is it
possible to change your queries to use the hstore data instead?

I'm hoping to avoid maintaining lots of different schemas if possible.



 2: I am beginning a project to parallelize OSM data processing
 with Hadoop, and the postgreSQL copy-format output is perfect
 for loading into HDFS. (If this goes well, I'd want to discuss
 ideas for adapting Osmosis to talk to Hadoop, eventually.)

 I also think that the simple schema lends itself better to
 research uses, especially when the input is a country/state
 extract or smaller.

 That said, I can also see cases where it would be useful to have
 the option to create a simple++ schema, with the separate
 tags table *and* the HSTORE columns as well.


I did think about adding the hstore columns as an optional add-on to the
existing schema.  I already have a couple of cases of this (ie. action
table, way linestring column, way bbox column).  But this change is fairly
invasive and given that I have to re-work a large amount of existing code
I'd end up having to implement everything twice.

If the only bit you need here is the COPY format, you may be better off with
a new task that creates a file in exactly the format you want.



 The CLUSTER operation is indeed a big performance booster.  Also,
 creating the optional ways.bbox column is at least two orders of
 magnitude faster when performed against an indexed ways.linestring
 column rather than as-shipped, which uses the nodes geometry before
 that has been indexed.


The general intent is that you only create one of the linestring or bbox
columns.  If you already have a linestring column is there any point
creating the bbox column?

As for using the node geometry before indexing, you shouldn't have to do
that.  The fastest way to create the linestring or bbox columns will be to
use the enableLinestringBuilder or enableBboxBuilder options on the various
write-pgsql tasks.

Cheers,
Brett
___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev


Re: [osmosis-dev] Changes to Osmosis Pgsql Schema

2010-08-07 Thread David Carmean

Brett, et al:

Please consider leaving the current simple schema intact (with 
the addition of the index clustering) and instead create a third 
schema for the HSTORE version.  

Losing the current schema would break at least two of my use cases:

1: I use postGIS as back-end storage for GIS client software, and 
use views and tables derived from queries against the OSM simple 
schema to create various layers of OSM data in the GIS client, 
e.g. roads, footpaths/trails/bike routes, hydrography, landuse, 
structures (buildings), etc.  This is such an obvious technique 
that I'd be very surprised not to find a lot of others doing the 
same.

2: I am beginning a project to parallelize OSM data processing 
with Hadoop, and the postgreSQL copy-format output is perfect 
for loading into HDFS. (If this goes well, I'd want to discuss 
ideas for adapting Osmosis to talk to Hadoop, eventually.)

I also think that the simple schema lends itself better to 
research uses, especially when the input is a country/state 
extract or smaller.  

That said, I can also see cases where it would be useful to have 
the option to create a simple++ schema, with the separate 
tags table *and* the HSTORE columns as well.

The CLUSTER operation is indeed a big performance booster.  Also, 
creating the optional ways.bbox column is at least two orders of 
magnitude faster when performed against an indexed ways.linestring 
column rather than as-shipped, which uses the nodes geometry before 
that has been indexed.



On Sat, Aug 07, 2010 at 09:00:07PM +1000, Brett Henderson wrote:
 Hi All,
 
 I'm currently working on some changes to the Osmosis simple schema which
 may be of interest to others.  I'd be interested to hear if anybody has any
 major issues with this, or any better suggestions.
 
 The current schema performs poorly, largely due to the data for typical
 queries being spread across the disk.  It is well indexed, but retrieving
 large numbers of rows requires huge numbers of disk seeks.  Performance
 would be better if data was physically grouped according to geospatial
 location.  I am planning several changes to address this:
 
- CLUSTER the nodes table by the geom column index, and ways column by
the (optional) linestring column index.  I've already tested this out for
bbox style queries and it makes queries on these tables significantly
quicker.  It takes a long time to perform the CLUSTER operation, but
subsequent queries are then improved.
- Move the tags tables into hstore tags columns on the nodes, ways and
relations tables.  This will avoid the need to join to external tables, and
will allow the tags data to also be clustered geospatially by the 
 geospatial
indexes.  For entities with large numbers of tags or large tags the data 
 may
be stored externally (
http://www.postgresql.org/docs/8.4/interactive/storage-toast.html), but
this should be the exception and most tags should fit inline in the table.
- Create a nodes column on the ways table.  This will contain an array
which holds only the ids of nodes that make up the way.  For typical
bounding box style queries this will allow completeWays style
functionality to be performed more efficiently without having to join to
large numbers of rows in the way_nodes table.  For bbox style queries in
some use cases it will also be possible to create synthetic node entities
(without tag or user info) for missing nodes lying outside the bounding box
which will further improve performance.
 
 So far I've written a migration script for moving tags data into hstore
 columns, and I've figured out how to get Java and JDBC playing nicely with
 hstore columns.  The next step is to update existing tasks to use these
 columns.  As part of this change I will also change the way the bounding box
 queries work so that they store more data in the temporary tables to avoid
 having to join back to the main data tables.  Again, this will significantly
 reduce disk seeking.
 
 I'll move onto the addition of a way.nodes column after I've finished the
 tags changes.
 
 I'm not sure when I'll find time to finish all of this, but it's the main
 thing I'm working on.
 
 Brett

 ___
 osmosis-dev mailing list
 osmosis-dev@openstreetmap.org
 http://lists.openstreetmap.org/listinfo/osmosis-dev


___
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev