Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-31 Thread Alexander Farber
Thank you for the insights


Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-31 Thread hector vass
On Sat, 30 Mar 2024, 10:04 Alexander Farber, 
wrote:

> Thank you, Justin -
>
> On Sat, Mar 30, 2024 at 4:33 AM Justin Clift 
> wrote:
>
>> On 2024-03-30 05:53, Alexander Farber wrote:
>> > I use the following postgresql.conf in my Dockerfile
>> > ( the full version at https://stackoverflow.com/a/78243530/165071 ),
>> > when loading a 28 GByte large europe-latest.osm.pbf
>>
>> Not specific conf file improvements, but for an initial data load
>> have you done things like turning off fsync(), deferring index
>> creating until after the data load finishes, and that kind of thing?
>>
>
> I will try the following commands in my Dockerfile then
> and later report back on any improvements:
>
> RUN set -eux && \
> pg_ctl init && \
> echo "shared_buffers = 1GB">> $PGDATA/postgresql.conf
> && \
> echo "work_mem = 50MB" >> $PGDATA/postgresql.conf
> && \
> echo "maintenance_work_mem = 10GB" >> $PGDATA/postgresql.conf
> && \
> echo "autovacuum_work_mem = 2GB"   >> $PGDATA/postgresql.conf
> && \
> echo "wal_level = minimal" >> $PGDATA/postgresql.conf
> && \
> echo "checkpoint_timeout = 60min"  >> $PGDATA/postgresql.conf
> && \
> echo "max_wal_size = 10GB" >> $PGDATA/postgresql.conf
> && \
> echo "checkpoint_completion_target = 0.9"  >> $PGDATA/postgresql.conf
> && \
> echo "max_wal_senders = 0" >> $PGDATA/postgresql.conf
> && \
> echo "random_page_cost = 1.0"  >> $PGDATA/postgresql.conf
> && \
> echo "password_encryption = scram-sha-256" >> $PGDATA/postgresql.conf
> && \
> echo "fsync = off">>
> $PGDATA/postgresql.conf && \
> pg_ctl start && \
> createuser --username=postgres $PGUSER && \
> createdb --username=postgres --encoding=UTF8 --owner=$PGUSER
> $PGDATABASE && \
> psql --username=postgres $PGDATABASE --command="ALTER USER $PGUSER
> WITH PASSWORD '$PGPASSWORD';" && \
> psql --username=postgres $PGDATABASE --command='CREATE EXTENSION IF
> NOT EXISTS postgis;' && \
> psql --username=postgres $PGDATABASE --command='CREATE EXTENSION IF
> NOT EXISTS hstore;' && \
> osm2pgsql --username=$PGUSER --database=$PGDATABASE --create
> --cache=6 --hstore --latlong /data/map.osm.pbf && \
> rm -f /data/map.osm.pbf && \
> pg_ctl stop && \
> echo "fsync = on">>
> $PGDATA/postgresql.conf && \
> echo '# TYPE DATABASE USER ADDRESS METHOD'>
> $PGDATA/pg_hba.conf && \
> echo "local all postgres peer"   >>
> $PGDATA/pg_hba.conf && \
> echo "local $PGDATABASE $PGUSER   scram-sha-256" >>
> $PGDATA/pg_hba.conf && \
> echo "host  $PGDATABASE $PGUSER 0.0.0.0/0 scram-sha-256" >>
> $PGDATA/pg_hba.conf
>
> The later fsync = on will override the former, right?
>
> Best regards
> Alex
>
>
>

2hrs sounds reasonable for Europe, it's a big place in terms of osm data
and osm2pgsql is doing processing to convert to geometry objects prior to
doing anything on the Postgresql side.
If you examine the --log--sql output for a small test country you can see
what it does in terms of the postgresql.
osm2pgsql gives options to trim the output to only what you need (so if you
don't want waterways, traffic features, parking places or places of worship
etc.. why load them)
Hopefully you have found the excellent geofabrik
https://download.geofabrik.de/ source for osm data.
Rather than load this data afresh each update cycle you would be better off
simply loading the changes so the .osc files or ... osm osmosis will create
the equivalent of a diff file for you
Looks like you are already using osm2psql's recommended postgresql.config
settings, I'd be surprised if this was way off.  Getting as close to tin
rather than virtual machines and containers will also help, lots of io
going on here.
If you are only interested in the geography you might consider geofabrik's
shapefile available for many countries, they have already done some of the
work for you.

Apologies if you are already a long way down this route & just asking about
the final stage of loading the osm2pgsql output to Postgresql but however
well you do here I would only expect small marginal gains.


Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-31 Thread Justin Clift

On 2024-03-31 04:07, Alexander Farber wrote:
Turning fsync = off has resulted in no noticable build time reduction 
for
my Dockerfile with OSM Europe data, but still thank you for the 
suggestion!


No worries. :)

With this import you're doing, is it something that will be repeated a 
lot

with the exact same data set, or is this a once off thing?

If it's something that'll be repeated a lot (maybe part of some 
automated

process?), then it might be worth making a backup / snapshot / something
of the database after the import has completed.

With a backup or snapshot in place (depends on the storage you're 
using),
you could potentially load things from that backup / snapshot (etc) 
instead

of having to do the import all over again each time.

Regards and best wishes,

Justin Clift




Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-30 Thread Alexander Farber
Turning fsync = off has resulted in no noticable build time reduction for
my Dockerfile with OSM Europe data, but still thank you for the suggestion!

>


Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-30 Thread Alexander Farber
Thank you, Justin -

On Sat, Mar 30, 2024 at 4:33 AM Justin Clift  wrote:

> On 2024-03-30 05:53, Alexander Farber wrote:
> > I use the following postgresql.conf in my Dockerfile
> > ( the full version at https://stackoverflow.com/a/78243530/165071 ),
> > when loading a 28 GByte large europe-latest.osm.pbf
>
> Not specific conf file improvements, but for an initial data load
> have you done things like turning off fsync(), deferring index
> creating until after the data load finishes, and that kind of thing?
>

I will try the following commands in my Dockerfile then
and later report back on any improvements:

RUN set -eux && \
pg_ctl init && \
echo "shared_buffers = 1GB">> $PGDATA/postgresql.conf
&& \
echo "work_mem = 50MB" >> $PGDATA/postgresql.conf
&& \
echo "maintenance_work_mem = 10GB" >> $PGDATA/postgresql.conf
&& \
echo "autovacuum_work_mem = 2GB"   >> $PGDATA/postgresql.conf
&& \
echo "wal_level = minimal" >> $PGDATA/postgresql.conf
&& \
echo "checkpoint_timeout = 60min"  >> $PGDATA/postgresql.conf
&& \
echo "max_wal_size = 10GB" >> $PGDATA/postgresql.conf
&& \
echo "checkpoint_completion_target = 0.9"  >> $PGDATA/postgresql.conf
&& \
echo "max_wal_senders = 0" >> $PGDATA/postgresql.conf
&& \
echo "random_page_cost = 1.0"  >> $PGDATA/postgresql.conf
&& \
echo "password_encryption = scram-sha-256" >> $PGDATA/postgresql.conf
&& \
echo "fsync = off">>
$PGDATA/postgresql.conf && \
pg_ctl start && \
createuser --username=postgres $PGUSER && \
createdb --username=postgres --encoding=UTF8 --owner=$PGUSER
$PGDATABASE && \
psql --username=postgres $PGDATABASE --command="ALTER USER $PGUSER WITH
PASSWORD '$PGPASSWORD';" && \
psql --username=postgres $PGDATABASE --command='CREATE EXTENSION IF NOT
EXISTS postgis;' && \
psql --username=postgres $PGDATABASE --command='CREATE EXTENSION IF NOT
EXISTS hstore;' && \
osm2pgsql --username=$PGUSER --database=$PGDATABASE --create
--cache=6 --hstore --latlong /data/map.osm.pbf && \
rm -f /data/map.osm.pbf && \
pg_ctl stop && \
echo "fsync = on">> $PGDATA/postgresql.conf
&& \
echo '# TYPE DATABASE USER ADDRESS METHOD'>
$PGDATA/pg_hba.conf && \
echo "local all postgres peer"   >>
$PGDATA/pg_hba.conf && \
echo "local $PGDATABASE $PGUSER   scram-sha-256" >>
$PGDATA/pg_hba.conf && \
echo "host  $PGDATABASE $PGUSER 0.0.0.0/0 scram-sha-256" >>
$PGDATA/pg_hba.conf

The later fsync = on will override the former, right?

Best regards
Alex


Re: Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-29 Thread Justin Clift

On 2024-03-30 05:53, Alexander Farber wrote:


I use the following postgresql.conf in my Dockerfile
( the full version at https://stackoverflow.com/a/78243530/165071 ),
when loading a 28 GByte large europe-latest.osm.pbf



Is anybody please able to spot any improvements I could
apply to the postgresql.conf config values at the top of my mail,
that could reduce the loading time of almost 2 hours?


Not specific conf file improvements, but for an initial data load
have you done things like turning off fsync(), deferring index
creating until after the data load finishes, and that kind of thing?

You don't want fsync() off when you're using the database in
production, but for long data load scenarios it seems like it'd
be a decent fit.

With .pbf files, from skimming over how they're described here:

  https://wiki.openstreetmap.org/wiki/PBF_Format

... they don't seem to be optimised for loading into a database. (?)

It kind of looks like they'd be stored into individual records,
which probably means they'd be getting imported as individual
INSERT statements rather than something that's optimised for
bulk loading. :(

Regards and best wishes,

Justin Clift




Please recommend postgresql.conf improvements for osm2pgsql loading Europe

2024-03-29 Thread Alexander Farber
Good evening,

I use the following postgresql.conf in my Dockerfile
( the full version at https://stackoverflow.com/a/78243530/165071 ),
when loading a 28 GByte large europe-latest.osm.pbf
into PostgreSQL 16 with PostGIS extension:

echo "shared_buffers = 1GB">> $PGDATA/postgresql.conf
&& \
echo "work_mem = 50MB" >> $PGDATA/postgresql.conf
&& \
echo "maintenance_work_mem = 10GB" >> $PGDATA/postgresql.conf
&& \
echo "autovacuum_work_mem = 2GB"   >> $PGDATA/postgresql.conf
&& \
echo "wal_level = minimal" >> $PGDATA/postgresql.conf
&& \
echo "checkpoint_timeout = 60min"  >> $PGDATA/postgresql.conf
&& \
echo "max_wal_size = 10GB" >> $PGDATA/postgresql.conf
&& \
echo "checkpoint_completion_target = 0.9"  >> $PGDATA/postgresql.conf
&& \
echo "max_wal_senders = 0" >> $PGDATA/postgresql.conf
&& \
echo "random_page_cost = 1.0"  >> $PGDATA/postgresql.conf
&& \
echo "password_encryption = scram-sha-256" >> $PGDATA/postgresql.conf
&& \

And after/during osm2pgsql is loading the data into the database,
the following messages are displayed by PostgreSQL:

2024-03-29T14:50:19.6909027Z 2024-03-29 14:50:19Processed 3355283586
nodes in 432s (7m 12s) - 7767k/s
2024-03-29T14:50:19.6909478Z 2024-03-29 14:50:19Processed 406646166
ways in 2720s (45m 20s) - 150k/s
2024-03-29T14:50:19.6909785Z 2024-03-29 14:50:19Processed 7265827
relations in 672s (11m 12s) - 11k/s
2024-03-29T14:50:20.0292733Z [0m [91m2024-03-29 14:50:20  node cache:
stored: 3355283586(100.00%), storage efficiency: 56.37% (dense blocks:
268264, sparse nodes: 1877553733), hit rate: 100.00%
2024-03-29T14:50:20.0293430Z 2024-03-29 14:50:20  Clustering table
'planet_osm_line' by geometry...
2024-03-29T14:50:20.0293787Z 2024-03-29 14:50:20  Clustering table
'planet_osm_point' by geometry...
2024-03-29T14:50:20.1056438Z [0m [91m2024-03-29 14:50:20  Clustering table
'planet_osm_polygon' by geometry...
2024-03-29T14:50:20.1056819Z 2024-03-29 14:50:20  Clustering table
'planet_osm_roads' by geometry...
2024-03-29T15:15:27.3834184Z [0m [91m2024-03-29 15:15:24  Creating geometry
index on table 'planet_osm_roads'...
2024-03-29T15:15:56.7850293Z [0m2024-03-29 15:15:56.784 UTC [344] LOG:
skipping vacuum of "planet_osm_roads" --- lock not available
2024-03-29T15:16:56.7864755Z 2024-03-29 15:16:56.785 UTC [347] LOG:
skipping vacuum of "planet_osm_roads" --- lock not available
2024-03-29T15:17:56.7891503Z 2024-03-29 15:17:56.788 UTC [350] LOG:
skipping vacuum of "planet_osm_roads" --- lock not available
2024-03-29T15:18:16.5830638Z [91m2024-03-29 15:18:16  Analyzing table
'planet_osm_roads'...
2024-03-29T15:18:18.1485666Z [0m [91m2024-03-29 15:18:18  All
postprocessing on table 'planet_osm_roads' done in 1678s (27m 58s).
2024-03-29T15:22:17.1463669Z [0m [91m2024-03-29 15:22:17  Creating geometry
index on table 'planet_osm_point'...
2024-03-29T15:22:56.8860848Z [0m2024-03-29 15:22:56.885 UTC [365] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:23:56.8890669Z 2024-03-29 15:23:56.888 UTC [368] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:24:56.9521355Z 2024-03-29 15:24:56.951 UTC [371] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:25:56.9166127Z 2024-03-29 15:25:56.915 UTC [374] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:26:57.4606148Z 2024-03-29 15:26:57.460 UTC [377] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:27:56.9555832Z 2024-03-29 15:27:56.955 UTC [380] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:28:56.9718398Z 2024-03-29 15:28:56.971 UTC [383] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:29:56.9912768Z 2024-03-29 15:29:56.990 UTC [386] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:30:57.1482580Z 2024-03-29 15:30:57.147 UTC [389] LOG:
skipping vacuum of "planet_osm_point" --- lock not available
2024-03-29T15:31:01.2674499Z [91m2024-03-29 15:31:01  Analyzing table
'planet_osm_point'...
2024-03-29T15:31:02.3310950Z [0m [91m2024-03-29 15:31:02  All
postprocessing on table 'planet_osm_point' done in 2442s (40m 42s).
2024-03-29T15:41:13.1544481Z [0m [91m2024-03-29 15:41:13  Creating geometry
index on table 'planet_osm_line'...
2024-03-29T15:41:57.2434689Z [0m2024-03-29 15:41:57.242 UTC [422] LOG:
skipping vacuum of "planet_osm_line" --- lock not available
2024-03-29T15:42:57.2519273Z 2024-03-29 15:42:57.251 UTC [425] LOG:
skipping vacuum of "planet_osm_line" --- lock not available
2024-03-29T15:43:57.2626514Z 2024-03-29 15:43:57.262 UTC [428] LOG:
skipping vacuum of "planet_osm_line" --- lock not available
2024-03-29T15:44:57.2531401Z 2024-03-29 15:44:57.252 UTC [431] LOG:
skipping vacuum of "planet_osm_line" --- lock not available