Re: [GENERAL] Advise on memory usage limitation by PostgreSQL on Windows

2015-09-23 Thread Daniel Begin
Venkata , 

Not really an advice but the wiki has specific comments about shared_buffers 
values in Windows environment. See…

http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-SHARED-BUFFERS

 

I use a Windows 7 64b, 16 GB RAM, 4 cores.  According to the above link I set 
shared_buffers = 128MB with an effective_cache_size = 12GB 

Daniel

 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Venkata Balaji N
Sent: September-22-15 21:33
To: pgsql-general
Subject: [GENERAL] Advise on memory usage limitation by PostgreSQL on Windows

 

Hello,

 

We would like to know if there is any limitation around memory utilization by 
PostgreSQL on Windows systems.

 

As of now, we do not have the details about the exact Windows version in use.

 

Windows server is with 5 GB RAM and 4 CPUs (4 cores). PostgreSQL Version is 
9.1.x which will possibly be upgraded to 9.4.x.

 

After a detailed analysis on memory usage by OS and other processes, is it safe 
to advise on configuring shared_buffers to 2 GB ?

 

Any advise will be appreciated.

 

Regards,

Venkata Balaji N

 

Fujitsu Australia



[GENERAL] PostgreSQL and Windows 10 upgrade

2015-09-23 Thread Daniel Begin
Does anyone running PostgreSQL 9.3 on Windows 7 has upgraded to Windows 10?
Successfully or not.

 

Daniel



Re: [GENERAL] Planner cost adjustments

2015-06-19 Thread Daniel Begin
Here is a last follow-up on adjusting the planner costs calculation 

1 - Francisco, you provide me with a clue I was looking for at this point!

I learned enough PLpgsql over the last week to reprogram all my processes (not 
always elegant but it works). By processing individual records using loop, I 
will be able to move forward with my work and save individual results as they 
are processed. 

Furthermore, it is now easier to predict jobs completion times, since the DB 
uses similar plans for each record. I did a test on a stratified sampling of 
records and I expect the jobs will run for about two weeks (contrarily to 
months in previous estimates!-)

2 - PLpgsql helped me to bypass the problem. As PT suggested, I will try to 
setup a test case that demonstrates the problem and post it to the developers' 
list so they might figure out what to do in such situation.

Thanks to all,
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: June-11-15 10:05
To: Daniel Begin
Cc: Bill Moran; pgsql-general@postgresql.org; Tomas Vondra; Melvin Davidson
Subject: Re: [GENERAL] Planner cost adjustments

Hi Daniel:

On Thu, Jun 11, 2015 at 2:38 PM, Daniel Begin jfd...@hotmail.com wrote:
.
 The remaining problem seems related to the statistics of some large tables.
 On one hand, I might increase the statistic target for these tables to 
 500, or even to 1000 and look at the results (but I have doubts it 
 will help). On the other hand, I could simply force enable_seqscan to 
 OFF for queries dealing with them.

 Still not sure about the best solution but the problem is definitely 
 narrower :-)

One somehow unrelated point. IIRC your problems where related to queries doing 
joins with selected sets of indexed IDs on a smaller table which then have to 
be looked up on some very large tables. ( I'm not able to follow too well which 
is which, users on changesets, I'm a bit lost ). Given your runtimes are always 
high ( in the seconds range, so it seems wire speed / latencies are not too 
much of an issue
) and that selectivity estimates on huge tables are always problematic and may 
be thwarting your plans you may be able to get faster results splitting your 
query.

 If I read your plans correctly, that would be selecting your 600 users in one 
query and then preparing the changeset query for a single user_id, which should 
be indexed, and looping it for every user. Given current machines can easily 
send-receive 600 queries in a second it may lead to a simpler solution. This 
mean you're using the DB as a somehow inteligent plain old indexed file, but 
sometimes this is the simpler approach ( heck, some of my code uses algorithms 
from the tape era as they were the faster way I could do it ).

I needed to do this in one of my programs, the optimizer kept selecting bad 
plans so I did the first query, held the results in memory, and then prepared 
and executed in a loop from the app, my query was selecting a couple thousand 
values from submillion table, and joining with a multimillion one, getting a 
couple hundreds matches per original value. Splitting it made the queries on 
the big table always run indexed and fast ( and as a side bonus avoided 
duplicating the parts of the first record in the wire a hundred times, which 
was nice since the short table was wide and I only needed 3 short fields from 
the second one, and that made the first query run at wire speed and the second 
at disk speed ).

Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Planner cost adjustments

2015-06-11 Thread Daniel Begin
Hi Bill, you are right about not changing two variables at the same time.

I first increased the statistics target for problematic columns. It helps
but did not provide significant improvement. I then changed the
random_page_cost which really improved planner's choices about Seq/Index
scan.

Since I now get expected Seq/Index scan selection on most tables but larger
ones, I am not sure keeping increasing random_page_cost is a good idea. I
wonder if it may trigger index scan on smaller table when a Seq scan would
have been less expensive.

The remaining problem seems related to the statistics of some large tables.
On one hand, I might increase the statistic target for these tables to 500,
or even to 1000 and look at the results (but I have doubts it will help). On
the other hand, I could simply force enable_seqscan to OFF for queries
dealing with them.

Still not sure about the best solution but the problem is definitely
narrower :-)
Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Bill Moran
Sent: June-11-15 06:51
To: Daniel Begin
Cc: pgsql-general@postgresql.org; 'Tomas Vondra'; 'Melvin Davidson'
Subject: Re: [GENERAL] Planner cost adjustments

On Wed, 10 Jun 2015 17:20:00 -0400
Daniel Begin jfd...@hotmail.com wrote:

 Here is a follow-up on adjusting the planner costs calculation
 
 -Statistics target of problematic columns were increased from 100 to 200.
 -Analyse was ran on all concerned tables (actually ran on the whole 
 DB) -Random_page_cost was decreased from 4 to 2.

Since you changed two things, there's no way to be sure which change led to
the improvement. You seem to be assuming that changing the stastics target
was what helped. While that _may_ be right, it might also have been the
change to random_page_cost.

 As a result, about 80% of queries are now using what I would consider 
 an optimal plan. However, the planner keeps using full table scans for 
 large tables...
 
 For instance, I ran a query that should have used an index scan on two 
 similar test tables. The planner had selected an index scan for the 
 smaller one and a Seq Scan for larger one. Except for their sizes and 
 for one field not used in the exercise, the test tables were identical 
 and indexed on the same field. The smaller test table had 3.26E+10 
 records and the larger one
 3.78E+11 records.  
 
 The query looked like...
 SELECT cs.user_id, cs.changeset_id, nd.id, nd.version FROM 
 changesets_selection cs, a_test_table nd WHERE 
 nd.changeset_id=cs.changeset_id;
 
 In order to understand why the planner selected the Seq Scan instead 
 of an Index Scan on the large table (nodes), I ran an EXPLAIN ANALYSE 
 (on warm
 cache) using enable_seqscan set to OFF/ON.
 
 -- QUERY PLAN on larger table (nodes) with SET enable_seqscan=OFF 
 -;  Nested Loop  
 (cost=100.58..10210480648.03 rows=194204592 width=40) (actual 
 time=74.088..278008.579 rows=140112 loops=1)
-  Seq Scan on changesets_selection cs
 (cost=100.00..1000110.44 rows=6644 width=24) (actual
 time=0.015..4.904 rows=6903 loops=1)
-  Index Scan using nodes_changesetidndx on nodes nd
 (cost=0.58..31387.49 rows=29230 width=24) (actual time=12.655..40.259
 rows=20 loops=6903)
  Index Cond: (changeset_id = cs.changeset_id)  Total runtime: 
 278026.196 ms
 (5 rows)
 
 -Completed after less than 5 minutes processing -I assume that 
 cost=100.00..1000110.44 for the Seq Scan of 
 changesets_selection is an artefact of setting enable_seqscan=OFF.
 -From what I see, the evaluation of rows number is still way off 
 (1400X) for the large table, even if the statistics target was doubled to
200.
 
 -- QUERY PLAN on larger table (nodes) with SET enable_seqscan=ON 
 --;  Hash Join  
 (cost=156171782.28..185673195.13 rows=194204592 width=40)
Hash Cond: (cs.changeset_id = n.changeset_id)
-  Seq Scan on changesets_selection cs  (cost=0.00..110.44 
 rows=6644
 width=24)
-  Hash  (cost=84959952.68..84959952.68 rows=3878771968 width=24)
  -  Seq Scan on nodes nd  (cost=0.00..84959952.68 
 rows=3878771968
 width=24)
 (5 rows)
 
 -Still running after 2:30 hours processing! That is why I did not 
 provided the actual time and rows (however, actual rows are provided 
 on first query
 plan)
 -Not surprisingly, the evaluation of rows number is way off again for 
 the large table - same stats, same results...
 
 It seems there is a problem with my large table statistics, even after 
 increase them to 200. Should I increase the statistic target to 500, 
 or even to 1000?
 Is there something else I can trigger to get the appropriate plan?
 
 Comments/explanations would be appreciated Daniel
 
 
 -Original Message-
 From: pgsql-general-ow...@postgresql.org
 [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Daniel Begin
 Sent: June-03-15 06:32
 To: 'Bill Moran'
 Cc: pgsql-general

Re: [GENERAL] Planner cost adjustments

2015-06-10 Thread Daniel Begin
Here is a follow-up on adjusting the planner costs calculation

-Statistics target of problematic columns were increased from 100 to 200.
-Analyse was ran on all concerned tables (actually ran on the whole DB)
-Random_page_cost was decreased from 4 to 2.

As a result, about 80% of queries are now using what I would consider an
optimal plan. However, the planner keeps using full table scans for large
tables... 

For instance, I ran a query that should have used an index scan on two
similar test tables. The planner had selected an index scan for the smaller
one and a Seq Scan for larger one. Except for their sizes and for one field
not used in the exercise, the test tables were identical and indexed on the
same field. The smaller test table had 3.26E+10 records and the larger one
3.78E+11 records.  

The query looked like...
SELECT cs.user_id, cs.changeset_id, nd.id, nd.version  
FROM changesets_selection cs, a_test_table nd
WHERE nd.changeset_id=cs.changeset_id;

In order to understand why the planner selected the Seq Scan instead of an
Index Scan on the large table (nodes), I ran an EXPLAIN ANALYSE (on warm
cache) using enable_seqscan set to OFF/ON.

-- QUERY PLAN on larger table (nodes) with SET enable_seqscan=OFF
-;
 Nested Loop  (cost=100.58..10210480648.03 rows=194204592 width=40)
(actual time=74.088..278008.579 rows=140112 loops=1)
   -  Seq Scan on changesets_selection cs
(cost=100.00..1000110.44 rows=6644 width=24) (actual
time=0.015..4.904 rows=6903 loops=1)
   -  Index Scan using nodes_changesetidndx on nodes nd
(cost=0.58..31387.49 rows=29230 width=24) (actual time=12.655..40.259
rows=20 loops=6903)
 Index Cond: (changeset_id = cs.changeset_id)
 Total runtime: 278026.196 ms
(5 rows) 

-Completed after less than 5 minutes processing 
-I assume that cost=100.00..1000110.44 for the Seq Scan of
changesets_selection is an artefact of setting enable_seqscan=OFF.
-From what I see, the evaluation of rows number is still way off (1400X) for
the large table, even if the statistics target was doubled to 200.

-- QUERY PLAN on larger table (nodes) with SET enable_seqscan=ON
--;
 Hash Join  (cost=156171782.28..185673195.13 rows=194204592 width=40)
   Hash Cond: (cs.changeset_id = n.changeset_id)
   -  Seq Scan on changesets_selection cs  (cost=0.00..110.44 rows=6644
width=24)
   -  Hash  (cost=84959952.68..84959952.68 rows=3878771968 width=24)
 -  Seq Scan on nodes nd  (cost=0.00..84959952.68 rows=3878771968
width=24)
(5 rows) 

-Still running after 2:30 hours processing! That is why I did not provided
the actual time and rows (however, actual rows are provided on first query
plan)
-Not surprisingly, the evaluation of rows number is way off again for the
large table - same stats, same results...

It seems there is a problem with my large table statistics, even after
increase them to 200. Should I increase the statistic target to 500, or even
to 1000?
Is there something else I can trigger to get the appropriate plan?

Comments/explanations would be appreciated
Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Daniel Begin
Sent: June-03-15 06:32
To: 'Bill Moran'
Cc: pgsql-general@postgresql.org; 'Tomas Vondra'; 'Melvin Davidson'
Subject: Re: [GENERAL] Planner cost adjustments

Thank Bill,

About disks performance, all drives are identical and connected using USB3
connections and yes, I can tweak values and restart Postgres without any
hardship!-)
About seq_page_cost and random_page_cost, I am about to test different lower
values as you and Thomas propose.
Raising the statistics target is a good idea. Since most of the data have a
power law distribution it might affect the statistics.

I will do as suggested and get back to the community for further comments. I
wished to follow the procedure proposed by PT, just in case I eventually had
to on step 4 (contact PostgreSQL developers so they can improve the
planner).

And I am still open to other proposal
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Bill Moran
Sent: June-02-15 23:31
To: Daniel Begin
Cc: pgsql-general@postgresql.org; 'Tomas Vondra'; 'Melvin Davidson'
Subject: Re: [GENERAL] Planner cost adjustments

On Tue, 2 Jun 2015 14:01:35 -0400
Daniel Begin jfd...@hotmail.com wrote:

 Here is a follow-up on the step-by-step procedure proposed by PT
 
 #1 - setup postgresql planner's cost estimate settings for my hardware. 
 --
 --
 --
 Current parameters values described in section 18.7.2 haven't been 
 changed except for the effective_cache_size seq_page_cost = 1 
 random_page_cost = 4 cpu_tuple_cost = 0.01 cpu_index_tuple_cost =
 0.005 cpu_operator_cost = 0.0025 effective_cache_size = 10GB
 
 I did a bunch

Re: [GENERAL] Planner cost adjustments

2015-06-03 Thread Daniel Begin
Thank Bill,

About disks performance, all drives are identical and connected using USB3
connections and yes, I can tweak values and restart Postgres without any
hardship!-)
About seq_page_cost and random_page_cost, I am about to test different lower
values as you and Thomas propose.
Raising the statistics target is a good idea. Since most of the data have a
power law distribution it might affect the statistics.

I will do as suggested and get back to the community for further comments. I
wished to follow the procedure proposed by PT, just in case I eventually had
to on step 4 (contact PostgreSQL developers so they can improve the
planner).

And I am still open to other proposal 
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Bill Moran
Sent: June-02-15 23:31
To: Daniel Begin
Cc: pgsql-general@postgresql.org; 'Tomas Vondra'; 'Melvin Davidson'
Subject: Re: [GENERAL] Planner cost adjustments

On Tue, 2 Jun 2015 14:01:35 -0400
Daniel Begin jfd...@hotmail.com wrote:

 Here is a follow-up on the step-by-step procedure proposed by PT
 
 #1 - setup postgresql planner's cost estimate settings for my hardware. 
 --
 --
 --
 Current parameters values described in section 18.7.2 haven't been 
 changed except for the effective_cache_size seq_page_cost = 1 
 random_page_cost = 4 cpu_tuple_cost = 0.01 cpu_index_tuple_cost = 
 0.005 cpu_operator_cost = 0.0025 effective_cache_size = 10GB
 
 I did a bunch of tests on frequently used queries to see how well they 
 perform - using SET enable_seqscan = ON/OFF.
 As described earlier in this tread, the planner use Seq Scan on tables 
 even if using an Index Scan is in this case 5 times faster!
 Here are the logs of EXPLAIN ANALYSE on a query...

In an earlier message you mention that the drives are external to the
computer. I don't remember details, but external drives can be quite dicey
as far as performance goes, depending on the specific technlogy (USB vs.
external SATA vs. NAS, for example) as well as some pretty wild variances
between different brands of the same technology.
See:
http://www.databasesoup.com/2012/05/random-page-cost-revisited.html

As a result, I'm suspicious that the default values you're using for
random_page_cost and seq_page_cost are throwing things off becuase your
disks aren't performing like internally connected disks.

Correct me if I'm wrong on any of the assumptions I'm making here, but I got
the impression that you can tweak values and restart Postgres without any
hardship. If that's the case, I'm guessing that raising seq_page_cost
(possible to 2) will cause Postgres to make better decisions about what are
good plans. My suggestion is to try some different values for those two
settings, doing several tests after each change, and see if you can find a
set of values that starts getting you good plans. It appears that the
planner thinks that it can get better performance by reading from the disk
in sequence than by picking random pages, and that makes me think that the
difference between seq_page_cost and random_page_cost is bigger than the
actual behavior of the drives.

More comments below.

 osmdump=# SET enable_seqscan = ON;
 osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed 
 FROM changesets WHERE changesets.user_id IN(SELECT id FROM 
 sample.users);
 --
 --
 -
  Hash Semi Join  (cost=21.50..819505.27 rows=726722 width=24) (actual
 time=1574.914..7444.938 rows=338568 loops=1)
Hash Cond: (changesets.user_id = users.id)
-  Seq Scan on changesets  (cost=0.00..745407.22 rows=25139722 
 width=24) (actual time=0.002..4724.578 rows=25133929 loops=1)
-  Hash  (cost=14.00..14.00 rows=600 width=8) (actual 
 time=0.165..0.165
 rows=600 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 24kB
  -  Seq Scan on users  (cost=0.00..14.00 rows=600 width=8) 
 (actual
 time=0.003..0.073 rows=600 loops=1)
  Total runtime: 7658.715 ms
 (7 rows)
 
 osmdump=# SET enable_seqscan = OFF;
 osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed 
 FROM changesets WHERE changesets.user_id IN(SELECT id FROM 
 sample.users);
 --
 --
 
  Nested Loop  (cost=115.94..10001072613.45 rows=726722 
 width=24) (actual time=0.268..1490.515 rows=338568 loops=1)
-  HashAggregate  (cost=115.50..121.50 rows=600 
 width=8) (actual time=0.207..0.531 rows=600 loops=1)
  -  Seq Scan on users  (cost=100.00..114.00
 rows=600 width=8) (actual time=0.003..0.037 rows=600 loops=1)
-  Index Scan using changesets_useridndx on changesets
 (cost=0.44..1775.54 rows

[GENERAL] Problem when temp_tablespace get full?

2015-06-03 Thread Daniel Begin
Context.

A query almost filled a SSD drive (temp_tablespace) and my PC got
unresponsive;

I did SELECT pg_cancel_backend(6600); 6600 was the pid of the process and
got a successful answer (t)

However, just after that, the disk simply turned off and the system crashed.

I was able to restart the PC - The SSD drive and the data are still there
but the drive keeps going off without warning.

 

I know I will have to change my drive!-)

But ma question is: What append the temp_tablespace drive get full?

 

Daniel



Re: [GENERAL] Problem when temp_tablespace get full?

2015-06-03 Thread Daniel Begin
Sorry, my question might not have been clear…

 

I set myself the temp_tablespace to that location but did not expect the drive 
could get full;

Multiple factors may have caused the drive to turn off (not necessarily 
postgresql); 

So, if that temp_tablespace gets full, how postgresql will react/manage the 
situation?

 

Daniel

 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of David G. Johnston
Sent: June-03-15 11:54
To: Daniel Begin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Problem when temp_tablespace get full?

 

On Wed, Jun 3, 2015 at 11:14 AM, Daniel Begin jfd...@hotmail.com wrote:

 

A query

 

​OK...​

 

But ma question is: What append the temp_tablespace drive get full?

 

​The query, probably...

 

There is also a temp_tablespaces parameter, which determines the placement of 
temporary tables and indexes, as well as temporary files that are used for 
purposes such as sorting large data sets.

 

http://www.postgresql.org/docs/9.4/static/manage-ag-tablespaces.html

 

David J.

​



Re: [GENERAL] Planner cost adjustments

2015-06-02 Thread Daniel Begin
Here is a follow-up on the step-by-step procedure proposed by PT

#1 - setup postgresql planner's cost estimate settings for my hardware. 

--
Current parameters values described in section 18.7.2 haven't been changed
except for the effective_cache_size
seq_page_cost = 1
random_page_cost = 4
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025 
effective_cache_size = 10GB

I did a bunch of tests on frequently used queries to see how well they
perform - using SET enable_seqscan = ON/OFF. 
As described earlier in this tread, the planner use Seq Scan on tables even
if using an Index Scan is in this case 5 times faster!
Here are the logs of EXPLAIN ANALYSE on a query...
 
osmdump=# SET enable_seqscan = ON;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);

-
 Hash Semi Join  (cost=21.50..819505.27 rows=726722 width=24) (actual
time=1574.914..7444.938 rows=338568 loops=1)
   Hash Cond: (changesets.user_id = users.id)
   -  Seq Scan on changesets  (cost=0.00..745407.22 rows=25139722 width=24)
(actual time=0.002..4724.578 rows=25133929 loops=1)
   -  Hash  (cost=14.00..14.00 rows=600 width=8) (actual time=0.165..0.165
rows=600 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 24kB
 -  Seq Scan on users  (cost=0.00..14.00 rows=600 width=8) (actual
time=0.003..0.073 rows=600 loops=1)
 Total runtime: 7658.715 ms
(7 rows)

osmdump=# SET enable_seqscan = OFF;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);


 Nested Loop  (cost=115.94..10001072613.45 rows=726722 width=24)
(actual time=0.268..1490.515 rows=338568 loops=1)
   -  HashAggregate  (cost=115.50..121.50 rows=600 width=8)
(actual time=0.207..0.531 rows=600 loops=1)
 -  Seq Scan on users  (cost=100.00..114.00
rows=600 width=8) (actual time=0.003..0.037 rows=600 loops=1)
   -  Index Scan using changesets_useridndx on changesets
(cost=0.44..1775.54 rows=1211 width=24) (actual time=0.038..2.357 rows=564
loops=600
 Index Cond: (user_id = users.id)
 Total runtime: 1715.517 ms
(6 rows)
 

#2 - Run ANALYZE DATABASE and look at performance/planning improvement.

--
I ran ANALYZE DATABASE then rerun the query. It did not produce any
significant improvement according to the EXPLAIN ANALYSE below...

osmdump=# SET enable_seqscan = ON;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);

-
 Hash Semi Join  (cost=21.50..819511.42 rows=729133 width=24) (actual
time=1538.100..7307.743 rows=338568 loops=1)
   Hash Cond: (changesets.user_id = users.id)
   -  Seq Scan on changesets  (cost=0.00..745390.84 rows=25138084 width=24)
(actual time=0.027..4620.691 rows=25133929 loops=1)
   -  Hash  (cost=14.00..14.00 rows=600 width=8) (actual time=0.300..0.300
rows=600 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 24kB
 -  Seq Scan on users  (cost=0.00..14.00 rows=600 width=8) (actual
time=0.022..0.187 rows=600 loops=1)
 Total runtime: 7519.254 ms
(7 rows)

osmdump=# SET enable_seqscan = OFF;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);


 Nested Loop  (cost=115.94..10001090810.49 rows=729133 width=24)
(actual time=0.268..1466.248 rows=338568 loops=1)
   -  HashAggregate  (cost=115.50..121.50 rows=600 width=8)
(actual time=0.205..0.530 rows=600 loops=1)
 -  Seq Scan on users  (cost=100.00..114.00
rows=600 width=8) (actual time=0.003..0.035 rows=600 loops=1)
   -  Index Scan using changesets_useridndx on changesets
(cost=0.44..1805.83 rows=1215 width=24) (actual time=0.036..2.314 rows=564
loops=600)
 Index Cond: (user_id = users.id)
 Total runtime: 1677.447 ms
(6 rows)

#3 - Run EXPLAIN ANALYZE and look for discrepancies between the estimated
and actual times

--
Looking at above results, there are obvious discrepancies between
expected/actual rows and time!
I dug a 

[GENERAL] Planner cost adjustments

2015-05-29 Thread Daniel Begin
Hi all,

Running some queries, I found that the planner often selects sequential scan
instead of an index scan, even if the latter is way faster (one order of
magnitude faster if I consider some tests I made by setting enable_seqscan =
ON/OFF). How can I figure out what parameter I should adjust to get the
planner select an appropriate plan that would better consider my DB setup?

 

I had a look at
http://www.postgresql.org/docs/9.3/static/runtime-config-query.html but at
this point it is too much information for me;-)  Any rules of thumb, recipes
I could use to select which parameters I should look at first?

 

Daniel



Re: [GENERAL] Planner cost adjustments

2015-05-29 Thread Daniel Begin
Thank Tomas, 
Here are few answers to your questions/comments

- Write cache is the on-drive write cache and I agree on the risks... I
plugged them on a UPS.
- Spread over is done with tablespaces
- 30%  40% ratios are that 3 indexes/10 and 4 tables/10 are larger than
10GB
- Confused comments about random_page_cost setting? Well, I was just citing
the doc I found that was confusing for me (at this stage!-)
- Sadly I haven't provided any example queries? Well you are right. When I
found this strange behavior a couple of weeks ago, I did not questioned
myself too much and did not keep comparison stats, but as things are moving
forward, I am expecting the problem to rise again so I asked the list hoping
there was a simple just do that answer floating around. However, as PT
suggested, at one point I will have to run new tests and I'll then let you
know :-)
- Why are you tuning temp_buffers? Simply because looking at others config
(on the web) they tend to have more than the standard value while dealing
with large and complex queries 
- Shouldn't you tune shared_buffers instead? Yes but I forgot to mention
that... 
Shared_buffers = 128MB, since it seems that on Windows the useful range is
64MB to 512MB. 


Best regards,
Daniel
 


-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tomas Vondra
Sent: May-29-15 20:19
To: Daniel Begin; 'PT'
Cc: pgsql-general@postgresql.org; 'Melvin Davidson'
Subject: Re: [GENERAL] Planner cost adjustments

Hi,

On 05/29/15 22:56, Daniel Begin wrote:
 Omg! I was not expecting such a step-by-step procedure, thanks!
 I'll follow the guide :-)

 Since I was about to provide a bit of context as asked by Tomas, here 
 it is for those who are interested...
 Best regards,
 Daniel

 A bit of the required context...
 I am running all this on my personal PC:  Windows 64b, i7 chip, 16GB ram.
 The PostgreSQL 9.3 cluster is spread over 3X3TB external drives with 
 write caching. Most tables are static (no insert).

 My largest table looks like this...
 Records composed of:  3 bigint, 2 boolean, 1 timestamp and 1 geography
type.
 Number of records: 387013
 Table size: 369GB
 Indexes size: 425GB
   - btree(primary key): 125GB
   - btree(another field): 86GB
   - gist(geography): 241GB


Huh, I haven't really expected that. Especially on a Windows laptop with
external drives (I assume 7.2k SATA drives connected using USB or maybe
eSATA?). Write cache is the on-drive write cache? Not really a good idea to
leave that enabled (volatile cache, so a risk of data loss or data
corruption).

Also, what do you mean by spread over? Are you using tablespaces or some
sort of RAID?

  Overall, 40% of my table and 30% of indexes do not fit in cache  
(effective_cache_size=10GB) but looking at mostly used tables and  
indexes, more than 90% of what I use doesn't fit.

I don't really understand how you compute the 40% and 30%? You have ~800GB
of data+indexes, and only 16GB of RAM, so that's more like 2% of the
database size. Or do you measure the hit ratios somehow?

 On one hand, according to the documentation 
 (http://www.postgresql.org/docs/9.3/static/runtime-config-query.html),
 with a cache rate like mine, I should probably increase 
 random_page_cost to better reflect the true cost of random storage reads.

I don't follow. Haven't you said in the first post that the database often
chooses sequential scans while index scans are way faster? 
Increasing random_page_cost will only push if further towards sequential
scans, making it worse.

 On the other hand however, I found that...
 (https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server)
 This is not where you should start to search for plan problems.
 Thet random_page_cost is pretty far down this list (at the end in 
 fact). If you are getting bad plans, this shouldn't be the first thing 
 you look at, even though lowering this value may be effective.
 Instead, you should start by making sure autovacuum is working 
 properly, that you are collecting enough statistics, and that you have 
 correctly sized the memory parameters for your server--all the things 
 gone over above. After you've done all those much more important 
 things, ifyou're still getting bad plans then you should see if 
 lowering random_page_cost is still useful.

Well, so maybe you're at the point when tuning random_page_cost is the right
next step ... but sadly you haven't provided any example queries, so it's
hard to say. Can you choose a few queries and run EXPLAIN ANALYZE on them
(and post it to explain.depesz.com, and only put the links here)?


 Please find below some the database config's parameters that might be 
 of interest...

 Best regards,
 Daniel

 General config parameters I have modified temp_buffers = 512MB

Why are you tuning temp_buffers? Shouldn't you tune shared_buffers instead?
I'm not very familiar with Windows, and I vaguely remember issues with
larger shared_buffers values

Re: [GENERAL] Planner cost adjustments

2015-05-29 Thread Daniel Begin
Omg! I was not expecting such a step-by-step procedure, thanks!
I'll follow the guide :-)

Since I was about to provide a bit of context as asked by Tomas, here it is
for those who are interested...
Best regards,
Daniel

A bit of the required context...
I am running all this on my personal PC:  Windows 64b, i7 chip, 16GB ram.
The PostgreSQL 9.3 cluster is spread over 3X3TB external drives with write
caching. Most tables are static (no insert).

My largest table looks like this...
Records composed of:  3 bigint, 2 boolean, 1 timestamp and 1 geography type.
Number of records: 387013 
Table size: 369GB 
Indexes size: 425GB
 - btree(primary key): 125GB
 - btree(another field): 86GB
 - gist(geography): 241GB

Overall, 40% of my table and 30% of indexes do not fit in cache
(effective_cache_size=10GB) but looking at mostly used tables and indexes,
more than 90% of what I use doesn't fit. 

On one hand, according to the documentation
(http://www.postgresql.org/docs/9.3/static/runtime-config-query.html), with
a cache rate like mine, I should probably increase random_page_cost to
better reflect the true cost of random storage reads.

On the other hand however, I found that...
(https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server) 
This is not where you should start to search for plan problems. Thet
random_page_cost is pretty far down this list (at the end in fact). If you
are getting bad plans, this shouldn't be the first thing you look at, even
though lowering this value may be effective. Instead, you should start by
making sure autovacuum is working properly, that you are collecting enough
statistics, and that you have correctly sized the memory parameters for your
server--all the things gone over above. After you've done all those much
more important things, if you're still getting bad plans then you should see
if lowering random_page_cost is still useful. 

Please find below some the database config's parameters that might be of
interest...

Best regards,
Daniel

General config parameters I have modified
temp_buffers = 512MB
work_mem = 16MB
maintenance_work_mem = 256MB
checkpoint_segments = 64
checkpoint_completion_target = 0.8
effective_cache_size = 10GB
logging_collector = on
track_counts = on
autovacuum = on

Here are config parameters related to autovacuum 
 autovacuum_analyze_scale_factor = 0.1  
 autovacuum_analyze_threshold  = 50   
 autovacuum_freeze_max_age   = 2
 autovacuum_max_workers  = 3
 autovacuum_multixact_freeze_max_age = 4
 autovacuum_naptime = 1min 
 autovacuum_vacuum_cost_delay  = 20ms 
 autovacuum_vacuum_cost_limit   = -1   
 autovacuum_vacuum_scale_factor = 0.2  
 autovacuum_vacuum_threshold = 50   
 vacuum_cost_delay  = 0 
vacuum_cost_limit = 200   
vacuum_cost_page_dirty = 20
vacuum_cost_page_hit = 1 
vacuum_cost_page_miss = 10
vacuum_defer_cleanup_age = 0 
vacuum_freeze_min_age = 5000  
vacuum_freeze_table_age = 15000 
vacuum_multixact_freeze_min_age = 500   
vacuum_multixact_freeze_table_age= 15000

-Original Message-
From: PT [mailto:wmo...@potentialtech.com] 
Sent: May-29-15 16:35
To: Daniel Begin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Planner cost adjustments

On Fri, 29 May 2015 09:39:00 -0400
Daniel Begin jfd...@hotmail.com wrote:

 Hi all,
 
 Running some queries, I found that the planner often selects 
 sequential scan instead of an index scan, even if the latter is way 
 faster (one order of magnitude faster if I consider some tests I made 
 by setting enable_seqscan = ON/OFF). How can I figure out what 
 parameter I should adjust to get the planner select an appropriate plan
that would better consider my DB setup?
 
 I had a look at
 http://www.postgresql.org/docs/9.3/static/runtime-config-query.html 
 but at this point it is too much information for me;-)  Any rules of 
 thumb, recipes I could use to select which parameters I should look at
first?

Here's the correct way to handle this:

1) As mentioned elsewhere, first take the time to ensure that your
   cost estimate settings are reasonable for your hardware.  See
   section 18.7.2:
   http://www.postgresql.org/docs/9.4/static/runtime-config-query.html
2) If #1 doesnt' fix it, don't change enable_seqscan. Run a bunch of
   tests on the query(s) to see how well it performs. Then do
   ANALYZE DATABASE $insert_name_here; and run all the tests again.
   If performance/planning improves, then the analyze settings on
   your server aren't aggressive enough. Make changes to related
   config settings to fix.
3) If #2 doesn't uncover the problem, run EXPLAIN ANALYZE on all the
   queries in your test. It takes a bit of understanding to do this
   step, so you'll want to read up a bit and possibly ask questions
   if you have trouble interpreting the output, but you're looking
   for discrepencies between the estimated and actual times for any
   particular table. If you find them

Re: [GENERAL] FW: Constraint exclusion in partitions

2015-05-25 Thread Daniel Begin
Thank for your patience :-)

- About using PgAdmin, anecdotal problems or not, I did the whole tests again 
in plain postgresql.

- About running queries once or not, Bill and Francisco both pointed out 
somehow that I should run each query multiple times to get appropriate 
statistics. I did it for all queries - First trial always longer, all other 
stabilise around the same values. The EXPLAIN ANALYSE for first and second 
trial on each query I ran on original table and on the partitioned one can be 
found below.

However, in my case, I will have to run most of my queries only once since I 
simply need to extract sample data for a research topic - there is no 
insert/update in the DB (that is why I thought looking at first trial was more 
appropriate). 

- About adding the exclusion check constraint, thank for remembering me such a 
simple thing that could have caused all this!-) but sadly, it was set to 
partition, as expected. However, I have decided to run all the queries after 
having set the parameter to ON and restarted the database, just in case.

Even after doing all this, I did not find any improvement in execution times 
between my original fat table and the partitioned version (sometime even 
worst). If partitioning the table has improved significantly queries running 
times, I could have partitioned the tables differently to accommodate other 
query types I will have to run later in my research (I have the same problem 
for half a dozen tables). 

Since it does not seem that partitioning will do the job, I will get back to 
the original table to run my queries...

However, just in case someone knows a magical trick that can improve 
significantly the speed of my queries (but haven't told me yet!-) here are the 
details about the concerned table/indexes 

(Using https://wiki.postgresql.org/wiki/Index_Maintenance query ...)
Number of records: 387013
Table size: 369GB
Indexes size: 425GB
 - nodes_idversion_pk: 125GB 
 - nodes_changesetid_idx: 86GB
 - nodes_geom_idx: 241GB

Each record has 3 bigint, 2 boolean, 1 timestamp and 1 geography type.

I am running all this on my personal PC:  Windows 64b, i7 chip, 16GB ram.
I am using PostgreSQL 9.3 and the database cluster is spread over 2X3TB 
external drives with write caching.

Best regards,
Daniel


Results/explain/analyse follow...

--Constant 
ids-
--Explain analyse on original table for a query that will look into different 
partitions on the new table
--First attempt;
db=# EXPLAIN ANALYSE SELECT * FROM old_nodes WHERE id 
IN(10005000,105000,205000,305000);
 Index Scan using nodes_idversion_pk on old_nodes  (cost=0.71..17739.18 
rows=6726 width=66) (actual time=52.226..288.700 rows=6 loops=1)
   Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
 Total runtime: 288.732 ms
--Second attempt;
db=# EXPLAIN ANALYSE SELECT * FROM old_nodes WHERE id 
IN(10005000,105000,205000,305000);
 Index Scan using nodes_idversion_pk on old_nodes  (cost=0.71..17739.18 
rows=6726 width=66) (actual time=0.014..0.035 rows=6 loops=1)
   Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
 Total runtime: 0.056 ms

-- Explain analyse on partitioned table for a query that will look into 
different partitions
--First attempt;
db=# EXPLAIN ANALYSE SELECT * FROM nodes WHERE id 
IN(10005000,105000,205000,305000); 
 Append  (cost=0.00..933.40 rows=223 width=66) (actual time=108.903..287.068 
rows=6 loops=1)
   -  Seq Scan on nodes  (cost=0.00..0.00 rows=1 width=66) (actual 
time=0.001..0.001 rows=0 loops=1)
 Filter: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
   -  Index Scan using nodes01_idversion_pk on nodes_01  (cost=0.57..622.78 
rows=156 width=66) (actual time=108.900..108.916 rows=1 loops=1)
 Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
   -  Index Scan using nodes38_idversion_pk on nodes_38  (cost=0.57..138.25 
rows=31 width=66) (actual time=89.523..89.543 rows=1 loops=1)
 Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
   -  Index Scan using nodes63_idversion_pk on nodes_63  (cost=0.57..119.01 
rows=26 width=66) (actual time=49.978..49.998 rows=3 loops=1)
 Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
   -  Index Scan using nodes85_idversion_pk on nodes_85  (cost=0.57..53.37 
rows=9 width=66) (actual time=38.600..38.603 rows=1 loops=1)
 Index Cond: (id = ANY 
('{10005000,105000,205000,305000}'::bigint[]))
 Total runtime: 287.144 ms
--Second attempt;
db=# EXPLAIN ANALYSE SELECT * FROM nodes WHERE id 
IN(10005000,105000,205000,305000); 
 Append  (cost=0.00..933.40 rows=223 width=66) (actual time=0.012..0.065 rows=6 
loops=1)
   -  Seq 

Re: [GENERAL] FW: Constraint exclusion in partitions

2015-05-23 Thread Daniel Begin
Hello Bill, 
You wrote that my testing methodology is flawed - I hope you are right!  

However, I am a bit confused about your comments. Yes, I did edited the name
of the tables for clarity but if I miss the point I, I will do it again as I
am writing without modifying anything. Here is the procedure I follow and
results...

I use pgadmin_III sql window. I write the following query (I have changed
the id to make sure it does not use previous results still in memory)...

Select * from nodes where id=345678912; -- nodes is the real partitioned
table name

Now I select explain query from the menu and I get the following result...
Append  (cost=0.00..384.08 rows=99 width=66)
  -  Seq Scan on nodes  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = 345678912)
  -  Index Scan using nodes19_idversion_pk on nodes_19  (cost=0.56..384.08
rows=98 width=66)
Index Cond: (id = 345678912)

Now, I select run and I get one record as a result and the following
message in history tab...
-- Executing query:
Select * from nodes where id=345678912; 
Total query runtime: 62 ms.
1 row retrieved.

Now, if I use the same query on the original table using the same procedure,
here is what I get...
Select * from old_nodes where id=345678912; -- old_nodes is the real
original table name

Explain gives me the following
Index Scan using nodes_idversion_pk on old_nodes  (cost=0.70..4437.15
rows=1682 width=66)
  Index Cond: (id = 345678912)

Running the query gives me the same record with the following message in
history tab...
-- Executing query:
select * from old_nodes where id=345678912; 
Total query runtime: 62 ms.
1 row retrieved.

This time, the history tab shows that both took the same time to run (an
improvement!?)
Let's try this one using the same procedure...

Select * from old_nodes where id IN
(1005,15,25,35)
Index Scan using nodes_idversion_pk on old_nodes  (cost=0.71..17739.18
rows=6726 width=66)
  Index Cond: (id = ANY
('{1005,15,25,35}'::bigint[]))
-- Executing query:
Select * from old_nodes where id IN
(1005,15,25,35)
Total query runtime: 171 ms.
5 rows retrieved.

Select * from nodes where id IN (1005,15,25,35)
Append  (cost=0.00..933.40 rows=223 width=66)
  -  Seq Scan on nodes  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = ANY
('{1005,15,25,35}'::bigint[]))
  -  Index Scan using nodes01_idversion_pk on nodes_01  (cost=0.57..622.78
rows=156 width=66)
Index Cond: (id = ANY
('{1005,15,25,35}'::bigint[]))
  -  Index Scan using nodes38_idversion_pk on nodes_38  (cost=0.57..138.25
rows=31 width=66)
Index Cond: (id = ANY
('{1005,15,25,35}'::bigint[]))
  -  Index Scan using nodes63_idversion_pk on nodes_63  (cost=0.57..119.01
rows=26 width=66)
Index Cond: (id = ANY
('{1005,15,25,35}'::bigint[]))
  -  Index Scan using nodes85_idversion_pk on nodes_85  (cost=0.57..53.37
rows=9 width=66)
Index Cond: (id = ANY
('{1005,15,25,35}'::bigint[]))
-- Executing query:
Select * from nodes where id IN (1005,15,25,35)
Total query runtime: 140 ms.
5 rows retrieved.

This time the history tab shows that the query was shorter to run on
partitioned table (a real improvement!?)

I know, this is different from what I referred to in my original email (at
least both shows similar running time) but I swear, I did not change the
times when editing table names!-)

Do you see any glitch/flaw in the procedure I am using?
Someone has an idea about the problem behind this unexpected behavior? 

I really need to get much faster results with my queries on this large table
and partitioning was my last option...

Best regards,
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Bill Moran
Sent: May-23-15 15:23
To: Daniel Begin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] FW: Constraint exclusion in partitions


A large portion of why you describe below is the exact opposite of my own
testing (unfortunately, I don't have the actual test results any more
because I did the tests for a former employer).

In my tests, single lookups against the same column being used to partition
improved performance in direct proportion to the number of partitions. I.e.
if the tables are partitioned on id, and the lookup is for id, and the table
has 10 partitions, the query is 10x faster on the partitioned version than
the non-partitioned verison.

Queries against indexes not partitioned were slightly slower in my tests,
but nowhere near the degree that you're showing below.

I can't help but think that your testing methodology is flawed, but since
you're not showing us what you actually did, it's difficult to be sure. See
below for some specifics on what I'm

Re: [GENERAL] FW: Constraint exclusion in partitions

2015-05-23 Thread Daniel Begin
I am working with postgresql 9.3 and I understand from the documentation that 
constraint_exclusion is set to “partition” by default. Looking at my 
postgres.conf file, the concerned line is “#constraint_exclusion = partition”. 

 

Furthermore, the execution plan shows that constraint_exclusion was used at 
least for constant id and range of ids

 

Daniel

 

 

 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of melvin6925
Sent: May-23-15 15:15
To: Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] FW: Constraint exclusion in partitions

 

Did you remember to set  constraint_exclusion = on and reload the .conf ?

 

 

 

 

Sent via the Samsung Galaxy S® 6, an ATT 4G LTE smartphone

 Original message 
From: Daniel Begin jfd...@hotmail.com 
Date: 05/23/2015 14:37 (GMT-05:00) 
To: pgsql-general@postgresql.org 
Subject: Re: [GENERAL] FW: Constraint exclusion in partitions 

Following Francisco suggestion, I was able to do some tests earlier this 
morning when the partitioning process completed and all the resulting tables 
analyzed.

Here is what I got on both the original table and its partitioned counterpart 
while running the same queries. I tested them only for a couple of values but 
in summary...

Using a constant id: 
All the queries I tried took longer on the partitioned table! I got similar 
results for multiple records using IN (id value1, id value2 ...)

Using a range of ids: 
Surprisingly again, all the queries I tried took longer on the partitioned 
table!

Using a list of ids from a select clause: 
More surprisingly, the queries I tried took less time on the partitioned table 
at least when using the primary key. Using an indexed field took so long 
compared to the old table that I cancelled the execution for the new one!

Guess what, I will get back to my old fat table unless someone tells me I 
missed something obvious!
Daniel

Note: Tables/indexes description, queries and execution plans are below.






Tables/indexes description 
--
The original table has 387013 records. Primary key/index on each partition 
queries are
ALTER TABLE oldtable ADD CONSTRAINT oldtable_idversion_pk PRIMARY KEY (id, 
version); 
CREATE INDEX oldtable_groupid_idx ON oldtable USING btree (group_id);

The partitioned table has 387013 records distributed over 87 partitions. 
Primary key/index on each partition queries are
ALTER TABLE newtable_xx ADD CONSTRAINT newtablexx_idversion_pk PRIMARY KEY (id, 
version);
CREATE INDEX newtablexx_groupid_idx ON newtable_xx USING btree (group_id);
Where xx is the partition's number suffix

constant id 
---
select * from oldtable where id=123456789; 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..4437.15 
rows=1682 width=66)
  Index Cond: (id = 123456789::bigint)
--Total query runtime: 62 ms. 1 rows retrieved


select * from newtable where id=123456789; 
Append  (cost=0.00..20.19 rows=5 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = 123456789::bigint)
  -  Index Scan using newtable72_idversion_pk on newtable_72  
(cost=0.56..20.19 rows=4 width=66)
Index Cond: (id = 123456789::bigint)
--Total query runtime: 156 ms. 1 rows retrieved

I got similar results for multiple records...
select * from oldtable where id IN(1000,10,20,30); 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.71..17739.18 
rows=6726 width=66)
  Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 187 ms. 4 rows retrieved

select * from newtable where id IN(1000,10,20,30); 
Append  (cost=0.00..933.40 rows=223 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = ANY 
('{1000,10,20,30}'::bigint[]))
  -  Index Scan using newtable01_idversion_pk on newtable_01  
(cost=0.57..622.78 rows=156 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
...
  -  Index Scan using newtable85_idversion_pk on newtable_85  
(cost=0.57..53.37 rows=9 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 421 ms. 4 rows retrieved


range of ids 
---
select * from oldtable where id between 152249 and 152349;
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..383.51 
rows=144 width=66)
  Index Cond: ((id = 152249) AND (id = 152349))
Total query runtime: 47 ms. 53 rows retrieved.

select * from newtable where id between 152249 and 152349;
Append  (cost=0.00..408.16 rows=104

Re: [GENERAL] FW: Constraint exclusion in partitions

2015-05-23 Thread Daniel Begin
Oops, I was responding to the email below from melvin6925

 

 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of David G. Johnston
Sent: May-23-15 19:32
To: Daniel Begin
Cc: melvin6925; pgsql-general@postgresql.org
Subject: Re: [GENERAL] [NOVICE] Constraint exclusion in partitions

 

On Saturday, May 23, 2015, Daniel Begin jfd...@hotmail.com wrote:

I am working with postgresql 9.3 and I understand from the documentation that 
constraint_exclusion is set to “partition” by default. Looking at my 
postgres.conf file, the concerned line is “#constraint_exclusion = partition”. 

Furthermore, the execution plan shows that constraint_exclusion was used at 
least for constant id and range of ids

What is your question/concern? 

 

 

Did you remember to set  constraint_exclusion = on and reload the .conf ?

 

 

 

From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of melvin6925
Sent: May-23-15 15:15
To: Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] FW: Constraint exclusion in partitions

 

Did you remember to set  constraint_exclusion = on and reload the .conf ?

 

 

 

 

Sent via the Samsung Galaxy S® 6, an ATT 4G LTE smartphone

 Original message 
From: Daniel Begin jfd...@hotmail.com 
Date: 05/23/2015 14:37 (GMT-05:00) 
To: pgsql-general@postgresql.org 
Subject: Re: [GENERAL] FW: Constraint exclusion in partitions 

Following Francisco suggestion, I was able to do some tests earlier this 
morning when the partitioning process completed and all the resulting tables 
analyzed.

Here is what I got on both the original table and its partitioned counterpart 
while running the same queries. I tested them only for a couple of values but 
in summary...

Using a constant id: 
All the queries I tried took longer on the partitioned table! I got similar 
results for multiple records using IN (id value1, id value2 ...)

Using a range of ids: 
Surprisingly again, all the queries I tried took longer on the partitioned 
table!

Using a list of ids from a select clause: 
More surprisingly, the queries I tried took less time on the partitioned table 
at least when using the primary key. Using an indexed field took so long 
compared to the old table that I cancelled the execution for the new one!

Guess what, I will get back to my old fat table unless someone tells me I 
missed something obvious!
Daniel

Note: Tables/indexes description, queries and execution plans are below.






Tables/indexes description 
--
The original table has 387013 records. Primary key/index on each partition 
queries are
ALTER TABLE oldtable ADD CONSTRAINT oldtable_idversion_pk PRIMARY KEY (id, 
version); 
CREATE INDEX oldtable_groupid_idx ON oldtable USING btree (group_id);

The partitioned table has 387013 records distributed over 87 partitions. 
Primary key/index on each partition queries are
ALTER TABLE newtable_xx ADD CONSTRAINT newtablexx_idversion_pk PRIMARY KEY (id, 
version);
CREATE INDEX newtablexx_groupid_idx ON newtable_xx USING btree (group_id);
Where xx is the partition's number suffix

constant id 
---
select * from oldtable where id=123456789; 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..4437.15 
rows=1682 width=66)
  Index Cond: (id = 123456789::bigint)
--Total query runtime: 62 ms. 1 rows retrieved


select * from newtable where id=123456789; 
Append  (cost=0.00..20.19 rows=5 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = 123456789::bigint)
  -  Index Scan using newtable72_idversion_pk on newtable_72  
(cost=0.56..20.19 rows=4 width=66)
Index Cond: (id = 123456789::bigint)
--Total query runtime: 156 ms. 1 rows retrieved

I got similar results for multiple records...
select * from oldtable where id IN(1000,10,20,30); 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.71..17739.18 
rows=6726 width=66)
  Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 187 ms. 4 rows retrieved

select * from newtable where id IN(1000,10,20,30); 
Append  (cost=0.00..933.40 rows=223 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = ANY 
('{1000,10,20,30}'::bigint[]))
  -  Index Scan using newtable01_idversion_pk on newtable_01  
(cost=0.57..622.78 rows=156 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
...
  -  Index Scan using newtable85_idversion_pk on newtable_85  
(cost=0.57..53.37 rows=9 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 421

Re: [GENERAL] FW: Constraint exclusion in partitions

2015-05-23 Thread Daniel Begin
Following Francisco suggestion, I was able to do some tests earlier this 
morning when the partitioning process completed and all the resulting tables 
analyzed.

Here is what I got on both the original table and its partitioned counterpart 
while running the same queries. I tested them only for a couple of values but 
in summary...

Using a constant id: 
All the queries I tried took longer on the partitioned table! I got similar 
results for multiple records using IN (id value1, id value2 ...)

Using a range of ids: 
Surprisingly again, all the queries I tried took longer on the partitioned 
table!

Using a list of ids from a select clause: 
More surprisingly, the queries I tried took less time on the partitioned table 
at least when using the primary key. Using an indexed field took so long 
compared to the old table that I cancelled the execution for the new one!

Guess what, I will get back to my old fat table unless someone tells me I 
missed something obvious!
Daniel

Note: Tables/indexes description, queries and execution plans are below.






Tables/indexes description 
--
The original table has 387013 records. Primary key/index on each partition 
queries are
ALTER TABLE oldtable ADD CONSTRAINT oldtable_idversion_pk PRIMARY KEY (id, 
version); 
CREATE INDEX oldtable_groupid_idx ON oldtable USING btree (group_id);

The partitioned table has 387013 records distributed over 87 partitions. 
Primary key/index on each partition queries are
ALTER TABLE newtable_xx ADD CONSTRAINT newtablexx_idversion_pk PRIMARY KEY (id, 
version);
CREATE INDEX newtablexx_groupid_idx ON newtable_xx USING btree (group_id);
Where xx is the partition's number suffix

constant id 
---
select * from oldtable where id=123456789; 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..4437.15 
rows=1682 width=66)
  Index Cond: (id = 123456789::bigint)
--Total query runtime: 62 ms. 1 rows retrieved


select * from newtable where id=123456789; 
Append  (cost=0.00..20.19 rows=5 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = 123456789::bigint)
  -  Index Scan using newtable72_idversion_pk on newtable_72  
(cost=0.56..20.19 rows=4 width=66)
Index Cond: (id = 123456789::bigint)
--Total query runtime: 156 ms. 1 rows retrieved

I got similar results for multiple records...
select * from oldtable where id IN(1000,10,20,30); 
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.71..17739.18 
rows=6726 width=66)
  Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 187 ms. 4 rows retrieved

select * from newtable where id IN(1000,10,20,30); 
Append  (cost=0.00..933.40 rows=223 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: (id = ANY 
('{1000,10,20,30}'::bigint[]))
  -  Index Scan using newtable01_idversion_pk on newtable_01  
(cost=0.57..622.78 rows=156 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
...
  -  Index Scan using newtable85_idversion_pk on newtable_85  
(cost=0.57..53.37 rows=9 width=66)
Index Cond: (id = ANY 
('{1000,10,20,30}'::bigint[]))
--Total query runtime: 421 ms. 4 rows retrieved


range of ids 
---
select * from oldtable where id between 152249 and 152349;
Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..383.51 
rows=144 width=66)
  Index Cond: ((id = 152249) AND (id = 152349))
Total query runtime: 47 ms. 53 rows retrieved.

select * from newtable where id between 152249 and 152349;
Append  (cost=0.00..408.16 rows=104 width=66)
  -  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)
Filter: ((id = 152249) AND (id = 152349))
  -  Index Scan using newtable51_idversion_pk on newtable_51  
(cost=0.56..183.52 rows=46 width=66)
Index Cond: ((id = 152249) AND (id = 152349))
  -  Index Scan using newtable52_idversion_pk on newtable_52  
(cost=0.56..224.64 rows=57 width=66)
Index Cond: ((id = 152249) AND (id = 152349))
Total query runtime: 78 ms. 53 rows retrieved.


list of ids from a select clause 
---
--Subset provides 4 ids similar but not identical to the previous query
select * from oldtable where id IN (select * from subset); 
Nested Loop  (cost=37.45..886298.00 rows=2028512050 width=66)
  -  HashAggregate  (cost=36.75..38.75 rows=200 width=8)
-  Seq Scan on subset  (cost=0.00..31.40 rows=2140 width=8)
  -  Index 

[GENERAL] FW: Constraint exclusion in partitions

2015-05-22 Thread Daniel Begin
Sent that on pgsql-novice list but did not get any answers yet. 

Maybe someone could help me understand here J

 

 

Hi all, 

 

I have split a large table (billions of records) into multiple partitions,
hoping the access would be faster. I used an ID to make partitions check
(check (id = 100 AND id  200).) and created over 80 tables (children) that
are now filled with data.  

 

However, after I did it, I read a second time the following sentence in the
documentation and started wondering what it actually means .  Constraint
exclusion only works when the query's WHERE clause contains constants (or
externally supplied parameters)

 

I understand that the following query will use constraint exclusion and will
run faster.  

a-  Select * from parent_table where id =; -- using a constant

 

But how constraint exclusion would react with the following queries .

b-  Select * from parent_table where id between 2345 and 6789; -- using
a range of ids

c-   Select * from parent_table where id in(select ids from
anothertable); -- using a list of ids from a select

 

Since I mostly use queries of type b and c, I am wondering if partitioning
the large table was appropriate and if the queries are going to be longer to
run. 

Thank in advance

 

Daniel

 

Doc: http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html



Re: [GENERAL] Restarting DB after moving to another drive

2015-05-16 Thread Daniel Begin
A follow-up...

As expected, I was able to copy the database cluster on the new drive during 
the night. I changed the drive letter to fit the original database drive and I 
restarted the DB. Everything is now running on the new drive and I have been 
able to recreate the tablespaces. 

However, the time expected to run queries on some tables seems longer. 
- Could copying tables and indexes have had an effect on indexes? 
- How can I verify that some of the indexes were not corrupted?

Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Daniel Begin
Sent: May-15-15 15:17
To: 'Francisco Olarte'
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Thank for that comprehensive response!

And you are right about practicing restore, I never had to :-) 

However, I use pg_dump on a regular basis (custom format) but I did not know 
the difference between database/database cluster (and pg_dumpall) until I had 
to move everything because the PGDATA drive started overheating. 

Now that I better understand what is going on with backup/restore processes, 
and considering...
- The time it is going to  take to rebuild the whole cluster ;
- That I am the only user of the database;
- That everything was just fine with the database, except the temperature of 
the drive
- And considering the initial concern of this tread was about bad copy of 
symbolic links with windows 

I will make another attempt to copy everything on another drive from windows, 
unless someone tells me it is not possible.
- I will move my external tablespaces content back to pgdata and drop them for 
the time I copy the db to the new drive. 
- Doing so, I will get rid of the symbolic link (from tablespaces) from where 
originated the initial error message
- Without symbolic links, I should be able to copy the db using standard 
windows commands.
- Setting up the new drive's letter to the old one before restarting the db is 
easy -The whole process should take 12hours instead of a week.

Hoping it makes sense and that I have not missed something important (again) 
Thank for your patience :-) Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-15-15 12:20
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

HI Daniel:

On Fri, May 15, 2015 at 5:35 PM, Daniel Begin jfd...@hotmail.com wrote:
 Bonjour Francisco.
Buenos dias.

 Skimming the documentation sequentially is a cleaver advice, 
 especially since the doc is much of the time well done and exhaustive.
 Unfortunately, even if I actually did it about 1 year ago, it seems 
 this specific item slipped out of my mind :-(

Bad luck, you managed to get one of the most important commands, but it's 
solved now.

 About dump/restore operation, restoring the database cluster is running for 
 24hrs now (psql -f pgdumpallOutputfile  postgres). Since it took 13hrs to 
 dump the cluster, I begin to wonder how long it is going to take to restore 
 it...

Much longer than this, especially if as you say below you have a lot of 
indexes. It's one of the reasons many of us do not use pg_dumpall for anything 
but global objects, but use something like  the crontab which John R. Pierce 
posted ( I use a similar thing, but with an intermediate script with dumps 
critical databases more frequently, skips recreatable ( may be someone can 
confirm if that word is right ?
I mean test things which can be created from scratch ( as they come from a 
script ) ) databases and keeps several numbered copies ). Doing it this ways 
insures we can restore on criticality order if it needs to be done ( I even 
move unmodified partitions to a 'historic schema, which gets dumped only after 
a change, which cuts my backups times to a tenth )

One thing. I do not know how you are restoring the database, but when doing 
this things we use a specially tuned postgresql.conf ( fsync off, minimal 
loging, lots of worrk mems and similar things, as we do not care about 
durability ( you can just rerun initdb and redo the restore, and there is only 
1 session connected, the restoring one ).
This cuts the restore times to easily a tenth, then after ending it we restart 
the server with the normal cong. It is a must when doing this short of things.


 My main concern is about how the indexes are managed in dump/restore 
 operations. I understand that pg_dumpall actually uses pg_dump where the doc 
 says Post-data items include definitions of indexes, triggers... I would 
 not worry if the doc said that indexes are simply copied but it says 
 includes definition of indexes.
 Since some of the indexes took days to build... does someone could confirm 
 indexes are rebuilt instead of copied?
 If indexes are actually rebuilt

Re: [GENERAL] Restarting DB after moving to another drive

2015-05-15 Thread Daniel Begin
Bonjour Francisco.

Skimming the documentation sequentially is a cleaver advice, especially since 
the doc is much of the time well done and exhaustive. Unfortunately, even if I 
actually did it about 1 year ago, it seems this specific item slipped out of my 
mind :-(

About dump/restore operation, restoring the database cluster is running for 
24hrs now (psql -f pgdumpallOutputfile  postgres). Since it took 13hrs to dump 
the cluster, I begin to wonder how long it is going to take to restore it... 

My main concern is about how the indexes are managed in dump/restore 
operations. I understand that pg_dumpall actually uses pg_dump where the doc 
says Post-data items include definitions of indexes, triggers... I would not 
worry if the doc said that indexes are simply copied but it says includes 
definition of indexes.

Since some of the indexes took days to build... does someone could confirm 
indexes are rebuilt instead of copied? 
If indexes are actually rebuilt, why should it be done that way? - There must 
be good reason!

Best regards,
Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-15-15 05:12
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Hi Daniel:

On Wed, May 13, 2015 at 8:06 PM, Daniel Begin jfd...@hotmail.com wrote:
...
 - I still have a lot to learn on database management (it was simpler 
 on user's side!-)

Yep, we all do, even if we've been using it since it was called Postgres.

 Fortunately, I have found that pg_dumpall could do the job (I did not have a 
 problem with it, I just did not know about it!-).

If you didn't know about it I'll urge you to take the manual and do a 
sequential reading ( maybe not in full, this would take a long time, but at 
least skim through all of it sequentially, it's full of very interesting info 
and it's very useful and when you hit a problem you'll probably know there is 
something for it and search for it ).
For me the manual is one of the major points for using pg . pg_dumpall is a 
fundamental tool for backups, as it's the only one that dumps the global 
objects.

Good luck.

Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Restarting DB after moving to another drive

2015-05-15 Thread Daniel Begin
Thank for that comprehensive response!

And you are right about practicing restore, I never had to :-) 

However, I use pg_dump on a regular basis (custom format) but I did not know 
the difference between database/database cluster (and pg_dumpall) until I had 
to move everything because the PGDATA drive started overheating. 

Now that I better understand what is going on with backup/restore processes, 
and considering...
- The time it is going to  take to rebuild the whole cluster ;
- That I am the only user of the database;
- That everything was just fine with the database, except the temperature of 
the drive
- And considering the initial concern of this tread was about bad copy of 
symbolic links with windows 

I will make another attempt to copy everything on another drive from windows, 
unless someone tells me it is not possible.
- I will move my external tablespaces content back to pgdata and drop them for 
the time I copy the db to the new drive. 
- Doing so, I will get rid of the symbolic link (from tablespaces) from where 
originated the initial error message
- Without symbolic links, I should be able to copy the db using standard 
windows commands.
- Setting up the new drive's letter to the old one before restarting the db is 
easy 
-The whole process should take 12hours instead of a week.

Hoping it makes sense and that I have not missed something important (again)
Thank for your patience :-)
Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-15-15 12:20
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

HI Daniel:

On Fri, May 15, 2015 at 5:35 PM, Daniel Begin jfd...@hotmail.com wrote:
 Bonjour Francisco.
Buenos dias.

 Skimming the documentation sequentially is a cleaver advice, 
 especially since the doc is much of the time well done and exhaustive. 
 Unfortunately, even if I actually did it about 1 year ago, it seems 
 this specific item slipped out of my mind :-(

Bad luck, you managed to get one of the most important commands, but it's 
solved now.

 About dump/restore operation, restoring the database cluster is running for 
 24hrs now (psql -f pgdumpallOutputfile  postgres). Since it took 13hrs to 
 dump the cluster, I begin to wonder how long it is going to take to restore 
 it...

Much longer than this, especially if as you say below you have a lot of 
indexes. It's one of the reasons many of us do not use pg_dumpall for anything 
but global objects, but use something like  the crontab which John R. Pierce 
posted ( I use a similar thing, but with an intermediate script with dumps 
critical databases more frequently, skips recreatable ( may be someone can 
confirm if that word is right ?
I mean test things which can be created from scratch ( as they come from a 
script ) ) databases and keeps several numbered copies ). Doing it this ways 
insures we can restore on criticality order if it needs to be done ( I even 
move unmodified partitions to a 'historic schema, which gets dumped only after 
a change, which cuts my backups times to a tenth )

One thing. I do not know how you are restoring the database, but when doing 
this things we use a specially tuned postgresql.conf ( fsync off, minimal 
loging, lots of worrk mems and similar things, as we do not care about 
durability ( you can just rerun initdb and redo the restore, and there is only 
1 session connected, the restoring one ).
This cuts the restore times to easily a tenth, then after ending it we restart 
the server with the normal cong. It is a must when doing this short of things.


 My main concern is about how the indexes are managed in dump/restore 
 operations. I understand that pg_dumpall actually uses pg_dump where the doc 
 says Post-data items include definitions of indexes, triggers... I would 
 not worry if the doc said that indexes are simply copied but it says 
 includes definition of indexes.
 Since some of the indexes took days to build... does someone could confirm 
 indexes are rebuilt instead of copied?
 If indexes are actually rebuilt, why should it be done that way? - There must 
 be good reason!

You are out of luck, and it has a reason. First, pg_dumps does not copy, it 
dumps. It's simpler behaviour ( text output ) just output a SQL script which 
recreates everything and inserts all the data ( normally using copy for speed, 
but it's the same as inserting ). It takes care of generating a fast script ( 
meaning it creates the tables, then inserts the data, then creates indexes and 
reactivates constraints, which is faster than defining everything and inserting 
with indexes and constraints actives ).

The reason to do it in text mode is you can dump between different version 
and/or architectures, and also the dump is much smaller than the db, specially 
if you compress it ( I always do it, testing a bit you can

Re: [GENERAL] Restarting DB after moving to another drive

2015-05-13 Thread Daniel Begin
Thank John,
I was naive hoping I just had to run $ psql -f pgdumpall.output postgres...
but it makes sense since the drive on which it is all stored won't be there 
anymore.

So let's read initdb documentation :-)

Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of John R Pierce
Sent: May-13-15 14:16
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

On 5/13/2015 11:06 AM, Daniel Begin wrote:
 I am then currently running pg_dumpall on the database. I will restore the 
 cluster on the new drive once completed. However, there is still something 
 obscure in the process. The doc says pg_dumpall requires all needed 
 tablespace directories to exist before the restore. External tablespaces 
 directories are easy to create but what's about pg_default and pg_global  
 tablespace since I never created specific tablespaces for them?

those are created when you initdb the target cluster prior to restoring the 
pg_dumpall output.



-- 
john r pierce, recycling bits in santa cruz



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Restarting DB after moving to another drive

2015-05-13 Thread Daniel Begin
Thank Francisco, 
After having poked around for a while to move it with Windows, I think I've 
finally understood a couple of things...

- Since my old drive contains the PGDATA folder, I actually need to move the 
whole PostgreSQL cluster (almost).
- Windows backup/copy/xcopy failed to do the job, even using options that 
should have copied symbolic links properly.
- I am not comfortable about creating the symbolic links manually because the 
links created by PostgreSQL...
   a) Actually seem to be junctions instead of symbolic links as stated
   b) Use target folders that look like [\??\M:\pgsqlData] instead of 
[M:\pgsqlData] as I was expected
   c) The impacts of not creating them properly are not clear to me
- I still have a lot to learn on database management (it was simpler on user's 
side!-)

Fortunately, I have found that pg_dumpall could do the job (I did not have a 
problem with it, I just did not know about it!-). 

I am then currently running pg_dumpall on the database. I will restore the 
cluster on the new drive once completed. However, there is still something 
obscure in the process. The doc says pg_dumpall requires all needed tablespace 
directories to exist before the restore. External tablespaces directories are 
easy to create but what's about pg_default and pg_global  tablespace since I 
never created specific tablespaces for them? 

Thank for your patience
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-13-15 06:57
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Hi Daniel:

On Mon, May 11, 2015 at 5:30 PM, Daniel Begin jfd...@hotmail.com wrote:
 How big? According to PgAdmin my personal database is about 2TB...

I do not know what pgAdmin reports, I was thinking in how bick a dump ( using 
-Fc, which compresses on th fly ) is, to see wheter you can restore.

 How critical? Well, about a year of work!-)

Well, my fault. I know its valuable, or you wouldn't be bothering with all of 
this. By critical I meant if you can take it offline for the time needed to do 
a dump/restore, but I suppose after all this time you have a problem with this 
approach.

 Francisco wrote: just did a stop/cp/change pgdata /restart, I suppose 
 windows must have comparable ways
 This is what I have just tried when I got “Could not read symbolic link 
 “pg_tblspc/100589”: Invalid argument”

I do not know windows, but I suppose you must check this error, I mean, check 
the link in the original and the copy and see if they look right ( ls -l does 
it in linux, I suppose the windows ls or whatever tool you use to list a 
directory in windows can do it too ).

 Considering both drives are identical, could an image backup have done the 
 job properly instead of a plane copy?

I know the builtin windows copy ( in cmd ) was not good copying complex setups. 
cp for windows did it in my times, but as I say it was a long time ago and 
windows is too complex for me. I also do not know what an image backup is. In 
Linux I've done a sector-by-sector copy ( only on nearly full disks, otherwise 
cp is faster ) to a bigger disks and the OS didn't notice after the swap ( and 
neither Postgres ). On identical disks, a sector copy should work, but I 
wouldn't be surprised if windows kept some kind of info and need some other 
adjustement.

Anyway, what I mean is a normal copy should suffice, but windows is complex and 
a normal copy is very difficult to make ( at least for me ). But the error says 
it is getting a problem with a readlink. The first thing should be checking it, 
it is simple in Linux, you just do ls l on both of them and you are done, you 
even have a readlink program to use in scripts, I assume windows has a similar 
command, just check it. Given the error i would bet for an EINVAL which 
normally is due to the named file ( pg_tblspc/100589 ) not being a symlink, the 
windows equivalent to ls -l should tell you that ( my thought is somehow it is 
a directory, or a plain file, but you should be able to find it easily ). I 
cannot tell you more, also bear in mind I abandoned windows in 2001 an use 
pgadmin only for seeing queries in a grid, I do all my admin tasks with 
psql/pg_dump/pg_restore, so I cannot tell you zilch about windows or pgadmin 
specific stuff.

Regards.
Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Restarting DB after moving to another drive

2015-05-11 Thread Daniel Begin
I am working on windows and I had to move my database on another hard drive
after the original one started overheating. In order to move the DB I did
the following.

 

-Stop postgresql-x64-9.3 service - and wait until there were no more system
access to on the original drive

-Copy the entire content of the original drive to the new one (the drive is
dedicated to the DB)

-Turn off the original hard drive and reassign the old drive letter to the
new one

-Restart the DB

 

I tried to connect to the database by using PgAdmin III and I got the
following error message: 

Could not read symbolic link pg_tblspc/100589: Invalid argument

 

I concluded something went wrong and I decided to get back to the old drive.
I stopped the DB, turned off the new drive, turned on the old one and tried
to restart the postgresql service but it does not start anymore. The only
message I get is:

Postgresql-x64-9.3 service on local computer started and then stopped. Some
services stop automatically if they are not in use by other services or
programs

 

I am the only user of the database 

 

Hoping someone can help

Daniel



Re: [GENERAL] Restarting DB after moving to another drive

2015-05-11 Thread Daniel Begin
I just get it back running with the old drive - was some Windows hidden
behavior! 

However, does someone could tell me what went wrong with the procedure I
used to move the DB? 
And/or what procedure I should have used in order to get it right?

Daniel

-Original Message-
From: Raymond O'Donnell [mailto:r...@iol.ie] 
Sent: May-11-15 07:50
To: Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

On 11/05/2015 12:03, Daniel Begin wrote:
 I am working on windows and I had to move my database on another hard 
 drive after the original one started overheating. In order to move the 
 DB I did the following.
 
  
 
 -Stop postgresql-x64-9.3 service - and wait until there were no more 
 system access to on the original drive
 
 -Copy the entire content of the original drive to the new one (the 
 drive is dedicated to the DB)
 
 -Turn off the original hard drive and reassign the old drive letter to 
 the new one
 
 -Restart the DB
 
  
 
 I tried to connect to the database by using PgAdmin III and I got the 
 following error message:
 
 Could not read symbolic link pg_tblspc/100589: Invalid argument
 
  
 
 I concluded something went wrong and I decided to get back to the old 
 drive. I stopped the DB, turned off the new drive, turned on the old 
 one and tried to restart the postgresql service but it does not start 
 anymore. The only message I get is:
 
 Postgresql-x64-9.3 service on local computer started and then stopped.
 Some services stop automatically if they are not in use by other 
 services or programs

Hi there,

Sounds like you're on Windows - you can get more information from
PostgreSQL's own logs, which by default on Windows are in a directory called
pg_log under the data directory.

Ray.

--
Raymond O'Donnell :: Galway :: Ireland
r...@iol.ie



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Restarting DB after moving to another drive

2015-05-11 Thread Daniel Begin
Thank for the link! 

Just to make sure I understand properly...
When I installed Postgresql, I set $PGDATA to point on my old drive and I
must now move everything on the new one. 

In order to move everything on the new drive I must create a tablespace on
the new drive and then explicitly define this tablespace as the new location
of...

-  the postgres database
-  my personal database
- pg_default
- pg_global

Anything I missed or put in the wrong sequence?

Regards,
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Raymond O'Donnell
Sent: May-11-15 09:19
To: Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

On 11/05/2015 13:38, Daniel Begin wrote:
 I just get it back running with the old drive - was some Windows 
 hidden behavior!
 
 However, does someone could tell me what went wrong with the procedure 
 I used to move the DB?
 And/or what procedure I should have used in order to get it right?

I've never done it myself, but I understand that one way to do it is with
tablespaces:

  http://www.postgresql.org/docs/9.4/static/manage-ag-tablespaces.html

As I understand it, you create a new tablespace on the new disk, then move
your database objects to it.

How that plays with Windows, I've no idea I'd guess that changing the
drive letter might well cause Bad Things to happen.

Ray.

--
Raymond O'Donnell :: Galway :: Ireland
r...@iol.ie


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Restarting DB after moving to another drive

2015-05-11 Thread Daniel Begin
Francisco wrote: How big/critical is your database?

How big? According to PgAdmin my personal database is about 2TB...
How critical? Well, about a year of work!-)

Francisco wrote: just did a stop/cp/change pgdata /restart, I suppose windows 
must have comparable ways
This is what I have just tried when I got “Could not read symbolic link 
“pg_tblspc/100589”: Invalid argument” 

Considering both drives are identical, could an image backup have done the job 
properly instead of a plane copy?
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-11-15 11:01
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Hi Daniel.

On Mon, May 11, 2015 at 4:42 PM, Daniel Begin jfd...@hotmail.com wrote:
 Just to make sure I understand properly...
 When I installed Postgresql, I set $PGDATA to point on my old drive 
 and I must now move everything on the new one.
 In order to move everything on the new drive I must create a 
 tablespace on the new drive and then explicitly define this tablespace 
 as the new location of...
 -  the postgres database
 -  my personal database
 - pg_default
 - pg_global
 Anything I missed or put in the wrong sequence?

I do not think it is that easy. You can move nearly everything, but it will be 
slow an you are not going to get rid of the old disk.

One question, ¿ How big/critical is your database ? because all your problems 
can be solved with a dump/initdb/restore easily, and if you can leave it doing 
overnight it is th easier way. Also, I've moved directories across disks 
before, and it has worked, but I use linux which is much simpler, and just did 
an stop / cp / change pgdata / restart, I suppose windows must have comparable 
ways.

Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Restarting DB after moving to another drive

2015-05-11 Thread Daniel Begin
Interesting, 

The symbolic links on the old drive -still used by the DB- look like
windows' shortcuts to parent folder, while they are empty folders in the
copy of the database I have on the new drive...

When I do a plane copy of those links on another drive I also get the same
empty folders.  I am getting closer to the initial problem, and closer to
the solution!

Maybe an image backup of the drive would have done the job properly instead
of using a plane copy?
Daniel


-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
Sent: May-11-15 13:10
To: Daniel Begin; 'Francisco Olarte'
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Hi,
have you checked that the links in $PGDATA\pg_tblspc on the new drive are
valid ?
They possibly still point to the old drive.
I guess you have to correct them per hand before starting the moved DB.

regards,

Marc Mamin

Von: pgsql-general-ow...@postgresql.org
[pgsql-general-ow...@postgresql.org]quot; im Auftrag von quot;Daniel Begin
[jfd...@hotmail.com]
Gesendet: Montag, 11. Mai 2015 17:30
An: 'Francisco Olarte'
Cc: r...@iol.ie; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] Restarting DB after moving to another drive

Francisco wrote: How big/critical is your database?

How big? According to PgAdmin my personal database is about 2TB...
How critical? Well, about a year of work!-)

Francisco wrote: just did a stop/cp/change pgdata /restart, I suppose
windows must have comparable ways
This is what I have just tried when I got “Could not read symbolic link
“pg_tblspc/100589”: Invalid argument”

Considering both drives are identical, could an image backup have done the
job properly instead of a plane copy?
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Francisco Olarte
Sent: May-11-15 11:01
To: Daniel Begin
Cc: r...@iol.ie; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Restarting DB after moving to another drive

Hi Daniel.

On Mon, May 11, 2015 at 4:42 PM, Daniel Begin jfd...@hotmail.com wrote:
 Just to make sure I understand properly...
 When I installed Postgresql, I set $PGDATA to point on my old drive 
 and I must now move everything on the new one.
 In order to move everything on the new drive I must create a 
 tablespace on the new drive and then explicitly define this tablespace 
 as the new location of...
 -  the postgres database
 -  my personal database
 - pg_default
 - pg_global
 Anything I missed or put in the wrong sequence?

I do not think it is that easy. You can move nearly everything, but it will
be slow an you are not going to get rid of the old disk.

One question, ¿ How big/critical is your database ? because all your
problems can be solved with a dump/initdb/restore easily, and if you can
leave it doing overnight it is th easier way. Also, I've moved directories
across disks before, and it has worked, but I use linux which is much
simpler, and just did an stop / cp / change pgdata / restart, I suppose
windows must have comparable ways.

Francisco Olarte.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing large table of coordinates with GiST

2015-01-16 Thread Daniel Begin
Nathan,
I have to verify a few things before but it might be possible to proceed as you 
suggest. I will also dig a bit Paul's suggestion on geohashing. I should get 
you back once in place.

Thanks all
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Nathan Clayton
Sent: January-15-15 19:19
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Indexing large table of coordinates with GiST


On 1/15/2015 12:36 PM, Daniel Begin wrote:

 Thank, there is a lot of potential ways to resolve this problem!

 For Rob, here is a bit of context concerning my IT environment…

 Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The 
 PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 
 connection with write cache enabled and backup battery) and a 
 temp_tablespaces is pointing to a 1TB internal drive.

 Now, let me answered/questioned given proposals in the order I 
 received them…

 1-Andy, I will set maintenance_work_mem as large as I can unless 
 someone points to an important caveat.

 2-Vick, partitioning the table could have been very interesting. 
 However, I will have to query the table using both the node ID (which 
 could have provided a nice partition criterion) and/or the node 
 location (find nodes within a polygon). I am not familiar with table 
 partition but I suspect I can’t create a spatial index on a table that 
 have been partitioned (split into multiple tables that inherit from 
 the “master table). Am I right?

 3-Rémi, so many rows does not necessarily mean either raster or points 
 cloud (but it’s worth asking!-).  As I mentioned previously, I must be 
 able to query the table not only using nodes location (coordinates) 
 but also using the few other fields the table contains (but mainly 
 node IDs). So, I don’t think it could work, unless you tell me otherwise?

 4-Paul, the nodes distribution is all over the world but mainly over 
 inhabited areas. However, if I had to define a limit of some sort, I 
 would use the dateline.  Concerning spatial queries, I will want to 
 find nodes that are within the boundary of irregular polygons (stored 
 in another table). Is querying on irregular polygons is compatible 
 with geohashing?

 Regards,

 Daniel


Provided you have an integer primary key on both your node tables and polygon 
tables, would it make sense to preprocess the overlaps and have a many-to-many 
table with the node-id and polygon-id? Depending on the speed in which data is 
ingested, you could easily build triggers to run after inserts/updates to keep 
the table updated, or you could create a globally unique autoincrement field 
that tracks revisions and update everything after a given high-water mark.

Lookups and joins would be using integers and should give you much better 
performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it out into 
smaller batches on the insert if you need to so you don't blow up your memory 
usage. If needed you can have two tables partitioned on either the node-id or 
the polygon-id to speed up lookups, as this table has the potential to carry 
many times the records in either table - worst case would be a cartesian join 
if all nodes fall within all polygons):

create table node_polygon (
   node_id bigint not null,
   polygon_id bigint not null,
   primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id) select
   node_id,
   polygon_id
from
   node
   inner join polygon
 on node.point @ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing large table of coordinates with GiST

2015-01-16 Thread Daniel Begin
Nathan, and all others,

I already have links between each node and each polygon at an initial state. 
Creating a many-to-many table (nodes_polygons) and indexing it should be easy 
at this point.  However, if nodes are expected to remain static, new irregular 
polygons will be added without being able to preprocess the nodes. 

Instead of querying nodes table with these new polygons, I might rather query 
polygons table (which is much smaller and already has a GiST index) to find 
polygons that are intersecting a new one. From there, I can easily get the 
related nodes subset that will be many orders of magnitude smaller - in which 
case spatial indexing shouldn't be necessary. 

Once done for a new polygon, I could then update the polygon and nodes_polygons 
tables.

Something I miss?

Daniel



-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Nathan Clayton
Sent: January-15-15 19:19
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Indexing large table of coordinates with GiST


On 1/15/2015 12:36 PM, Daniel Begin wrote:

 Thank, there is a lot of potential ways to resolve this problem!

 For Rob, here is a bit of context concerning my IT environment…

 Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The 
 PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 
 connection with write cache enabled and backup battery) and a 
 temp_tablespaces is pointing to a 1TB internal drive.

 Now, let me answered/questioned given proposals in the order I 
 received them…

 1-Andy, I will set maintenance_work_mem as large as I can unless 
 someone points to an important caveat.

 2-Vick, partitioning the table could have been very interesting. 
 However, I will have to query the table using both the node ID (which 
 could have provided a nice partition criterion) and/or the node 
 location (find nodes within a polygon). I am not familiar with table 
 partition but I suspect I can’t create a spatial index on a table that 
 have been partitioned (split into multiple tables that inherit from 
 the “master table). Am I right?

 3-Rémi, so many rows does not necessarily mean either raster or points 
 cloud (but it’s worth asking!-).  As I mentioned previously, I must be 
 able to query the table not only using nodes location (coordinates) 
 but also using the few other fields the table contains (but mainly 
 node IDs). So, I don’t think it could work, unless you tell me otherwise?

 4-Paul, the nodes distribution is all over the world but mainly over 
 inhabited areas. However, if I had to define a limit of some sort, I 
 would use the dateline.  Concerning spatial queries, I will want to 
 find nodes that are within the boundary of irregular polygons (stored 
 in another table). Is querying on irregular polygons is compatible 
 with geohashing?

 Regards,

 Daniel


Provided you have an integer primary key on both your node tables and polygon 
tables, would it make sense to preprocess the overlaps and have a many-to-many 
table with the node-id and polygon-id? Depending on the speed in which data is 
ingested, you could easily build triggers to run after inserts/updates to keep 
the table updated, or you could create a globally unique autoincrement field 
that tracks revisions and update everything after a given high-water mark.

Lookups and joins would be using integers and should give you much better 
performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it out into 
smaller batches on the insert if you need to so you don't blow up your memory 
usage. If needed you can have two tables partitioned on either the node-id or 
the polygon-id to speed up lookups, as this table has the potential to carry 
many times the records in either table - worst case would be a cartesian join 
if all nodes fall within all polygons):

create table node_polygon (
   node_id bigint not null,
   polygon_id bigint not null,
   primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id) select
   node_id,
   polygon_id
from
   node
   inner join polygon
 on node.point @ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing large table of coordinates with GiST

2015-01-15 Thread Daniel Begin
Thank, there is a lot of potential ways to resolve this problem! 

 

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The 
PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection 
with write cache enabled and backup battery) and a temp_tablespaces is pointing 
to a 1TB internal drive.

 

Now, let me answered/questioned given proposals in the order I received them…

 

1-  Andy, I will set maintenance_work_mem as large as I can unless someone 
points to an important caveat.

2-  Vick, partitioning the table could have been very interesting. However, 
I will have to query the table using both the node ID (which could have 
provided a nice partition criterion) and/or the node location (find nodes 
within a polygon). I am not familiar with table partition but I suspect I can’t 
create a spatial index on a table that have been partitioned (split into 
multiple tables that inherit from the “master table). Am I right?

3-  Rémi, so many rows does not necessarily mean either raster or points 
cloud (but it’s worth asking!-).  As I mentioned previously, I must be able to 
query the table not only using nodes location (coordinates) but also using the 
few other fields the table contains (but mainly node IDs). So, I don’t think it 
could work, unless you tell me otherwise?

4-  Paul, the nodes distribution is all over the world but mainly over 
inhabited areas. However, if I had to define a limit of some sort, I would use 
the dateline.  Concerning spatial queries, I will want to find nodes that are 
within the boundary of irregular polygons (stored in another table). Is 
querying on irregular polygons is compatible with geohashing? 

 

Regards,

Daniel

 

__

On Thu, Jan 15, 2015 at 7:44 AM, Daniel Begin jfd...@hotmail.com wrote:

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel





[GENERAL] Indexing large table of coordinates with GiST

2015-01-15 Thread Daniel Begin
Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-) 

Any idea?

Daniel



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Daniel Begin
Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this a%8=* feature?
Best

Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)


Thank Tom,
I understand that the rationale behind choosing to create a new table 
from distinct records is that, since both approaches need full table 
scans, selecting distinct records is faster (and seems more straight 
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin



Best regards,
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload 
(rationale behind selecting a method)

Scott Marlowe scott.marl...@gmail.com writes:
 If you're de-duping a whole table, no need to create indexes, as it's 
 gonna have to hit every row anyway. Fastest way I've found has been:

 select a,b,c into newtable from oldtable group by a,b,c;

 On pass, done.

 If you want to use less than the whole row, you can use select 
 distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set 
of duplicate keys is retained, if you can identify additional columns 
that constitute a preference order for retaining/discarding dupes.  See 
the latest weather reports example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this 
operation.

regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To 
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To 
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Daniel Begin
Thank Marc (and all others)

 

I knew that nothing was for free and understanding where the costs come from
would provide me with some rationale to make my choice! 

However, I understand from your answer that there is no definitive approach
to do it right at this time (considering my specific context). Since the
tables are quite similar (number of fields and data types), I will then
probably explore some of the promising approaches proposed so far (below) to
get facts.

 

a) select distinct * into newtable from oldtable;

b) select a,b,c into newtable from oldtable group by a,b,c; 

c) select distinct a,b,c into newtable from oldtable where a%6=* group by
a,b,c;  

d) select a,b,c into newtable from oldtable where a%6=* group by a,b,c;

 

cd will be run after having set the visibility bit with a select count.
Running on 8 cores, only 6 will be used (a%6=*)

Something else to add/replace on scenarios above (a-d) ?

 

Before going further, here are some context/concerns you may wish to comment
.

- Most of the tables do not fit in my computer's 32GB memory - since they
are usually between 35GB and 350GB;

- All the tables have a bigint ID I can use with the modulo operator
(id%6=*);

 

Working on Windows platform.

- work_mem: 1024MB

- maintenance_work_mem: 16384 (Understand the value must be significantly
larger than work_mem) 

- shared_buffers: 128MB (Understand that on Windows the useful range is 64MB
to 512MB)

 

Waiting comments and hoping to get back soon with useful results 

 

Daniel

 

 

From: Marc Mamin [mailto:m.ma...@intershop.de] 
Sent: December-12-14 14:25
To: John McKown; Daniel Begin
Cc: Tom Lane; Scott Marlowe; Andy Colson; PostgreSQL General
Subject: AW: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

 


Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this a%8=* feature?
Best


 % is the modulus operator. 

Assuming a is an integer (I don't remember), then doing 8 selects of a
modulus 8 = for each of the possible results (0..7)? will each select about
1/8 of the entire table (I would guess) and the end result put together,
they will end up selecting all of the original table. I don't know, myself,
why this would be faster. But I'm not any kind of a PostgreSQL expert
either.


yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also
consider 
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future
Postgres release,
but is safe in such a case
(http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared
scan on a freshly created table
where the visiblity hint bit is not yet set:

   (see head comment in
http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/


You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new
COPY FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm, 
on the other hand, I suppose that you will avoid the extra I/O for the hint
bit 
if you first copy your data in a temp table, but then you won't be able to
parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to
each transaction, 
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a
simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin









Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)


Thank Tom,
I understand that the rationale behind choosing to create a new table
from distinct records is that, since both approaches need full table
scans, selecting distinct records is faster (and seems more straight
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one
thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group

Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-09 Thread Daniel Begin
Thank Tom, 
I understand that the rationale behind choosing to create a new table from
distinct records is that, since both approaches need full table scans,
selecting distinct records is faster (and seems more straight forward) than
finding/deleting duplicates; 

Best regards,
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe scott.marl...@gmail.com writes:
 If you're de-duping a whole table, no need to create indexes, as it's 
 gonna have to hit every row anyway. Fastest way I've found has been:

 select a,b,c into newtable from oldtable group by a,b,c;

 On pass, done.

 If you want to use less than the whole row, you can use select 
 distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes.  See the
latest weather reports example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-08 Thread Daniel Begin
Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required 
(5-9) depending on the table.

Considering that the tables are not indexed yet, am I right to think that both 
approaches will need a full table scan?
- Deleting duplicate records would need a full table scan to create temporary 
indexes to select/remove duplicate records;
- Inserting distinct records into an empty table will also need a full table 
scan to select distinct (*) from big_table;

Once said, is the indexing and selection/deletion of duplicate records faster 
than rewriting a whole table from distinct records? I am trying to find a 
rationale behind the choice - execution time, simplicity? I just don't know 
what the criteria should be and how/why it would affect the selection of an 
approach:-|

Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Andy Colson
Sent: December-08-14 11:39
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:
 On 12/7/2014 9:31 PM, Daniel Begin wrote:
 I have just completed the bulk upload of a large database. Some 
 tables have billions of records and no constraints or indexes have 
 been applied yet. About 0.1% of these records may have been 
 duplicated during the upload and I need to remove them before applying 
 constraints.

 I understand there are (at least) two approaches to get a table 
 without duplicate records…

 -   Delete duplicate records from the table based on an
 appropriate select clause;

 -   Create a new table with the results from a select distinct
 clause, and then drop the original table.

 What would be the most efficient procedure in PostgreSQL to do the 
 job considering …

 -   I do not know which records were duplicated;

 -   There are no indexes applied on tables yet;

 -   There is no OIDS on tables yet;

 -   The database is currently 1TB but I have plenty of disk
 space.

 Daniel


 How would you detect duplicate?  Is there a single field that would be 
 duplicated?  Or do you have to test a bunch of different fields?

 If its a single field, you could find dups in a single pass of the 
 table
 with:

 create index bigtable_key on bigtable(key); select key, count(*) from 
 bigtable group by key having count(*)  1;

 Save that list, and decide on some way of deleting the dups.

 The index might help the initial select, but will really help re-query 
 and delete statements.

 -Andy



I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;


This would minimize the amount of data you have to move around.  Depends on how 
hard step 3 is to write.  Index not required but would help both step 3 and 4 
be faster.

-Andy


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make 
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Removing duplicate records from a bulk upload

2014-12-07 Thread Daniel Begin
I have just completed the bulk upload of a large database. Some tables have 
billions of records and no constraints or indexes have been applied yet. About 
0.1% of these records may have been duplicated during the upload and I need to 
remove them before applying constraints.

 

I understand there are (at least) two approaches to get a table without 
duplicate records…

-   Delete duplicate records from the table based on an appropriate 
select clause;

-   Create a new table with the results from a select distinct clause, 
and then drop the original table.

 

What would be the most efficient procedure in PostgreSQL to do the job 
considering …

-   I do not know which records were duplicated;

-   There are no indexes applied on tables yet;

-   There is no OIDS on tables yet;

-   The database is currently 1TB but I have plenty of disk space.

Daniel

 



Re: [GENERAL] Inconsistency between PgAdmin III GUI and SQL window ?

2014-10-23 Thread Daniel Begin
Thank Adrian,

I just found what went wrong in my script...
As described in the initial email, I set the search path to the destination
schema (xxx) prior to execute the script. Doing so, I was excluding the
public schema from the search and then cannot have access to PostGIS
extension. By setting the search path to xxx AND public schema (set
search_path to xxx, public;) everything goes right!

Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Adrian Klaver
Sent: October-22-14 09:46
To: Daniel Begin; 'David G Johnston'
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Inconsistency between PgAdmin III GUI and SQL window
?

On 10/22/2014 03:25 AM, Daniel Begin wrote:
 David, Adrian,
 I am new to databases and since PgAdmin displays Catalogs, Event 
 Triggers, Extensions and Schema as Child of the database, I assumed 
 that Extensions were linked to a specific database, not to a specific
schema.

Yes and no. If they are installed in the pg_catalog schema then they are
available database wide:

http://www.postgresql.org/docs/9.3/static/ddl-schemas.html

In addition to public and user-created schemas, each database contains a
pg_catalog schema, which contains the system tables and all the built-in
data types, functions, and operators. pg_catalog is always effectively part
of the search path. If it is not named explicitly in the path then it is
implicitly searched before searching the path's schemas. This ensures that
built-in names will always be findable. However, you can explicitly place
pg_catalog at the end of your search path if you prefer to have user-defined
names override built-in names.

This is where you will find the procedural languages. So in psql you can do
the below to see what is installed and where:

test=# \dx
  List of installed extensions
Name| Version |   Schema   |Description 

---+-++-
---+-++---
  hstore| 1.2 | public | data type for storing sets of (key, 
value) pairs
  plpgsql   | 1.0 | pg_catalog | PL/pgSQL procedural language
  tablefunc | 1.0 | public | functions that manipulate whole 
tables, including crosstab




 After reading your answers, I had another look at PostGIS extension 
 properties and it is pretty clear it belongs to the public schema, 
 which explains the error message I got.

 Thanks
 Daniel



--
Adrian Klaver
adrian.kla...@aklaver.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Inconsistency between PgAdmin III GUI and SQL window ?

2014-10-22 Thread Daniel Begin
David, Adrian,
I am new to databases and since PgAdmin displays Catalogs, Event Triggers,
Extensions and Schema as Child of the database, I assumed that Extensions
were linked to a specific database, not to a specific schema. 

After reading your answers, I had another look at PostGIS extension
properties and it is pretty clear it belongs to the public schema, which
explains the error message I got.

Thanks
Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Adrian Klaver
Sent: October-21-14 18:58
To: Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Inconsistency between PgAdmin III GUI and SQL window
?

On 10/21/2014 03:33 PM, Daniel Begin wrote:
 I have first to admit the inconsistency is probably on my side!-)

 The task - I want to clone a table I created in public schema in 
 another schema (xxx) of the same database.
 The problem - I get an error message when creating the table using the 
 original SQL script: ERROR: type geography does not exist.

 I understand from
 http://stackoverflow.com/questions/9067335/how-to-create-table-inside
 -speci fic-schema-by-default-in-postgres that the original script 
 should work if I set the search path to the destination schema (xxx) 
 prior to execute the script (set search_path to xxx ;) but the PgAdmin 
 III SQL window does not seem aware of the geography type.

 What I do not understand is that using the GUI (contextual menu. new table
.
 new column.) within schema xxx, I can create the table as expected, 
 and the geography type is recognized.  Any idea about what is going on 
 when using the SQL window?

A quick test here showed that in the GUI New Object(table) wizard the types
are schema qualified if needed. For instance I did:

CREATE TYPE public.test_type AS (my_int  int);

When I went to another schema and created a table and then a column, the
test_type was shown as public.test_type and I could use it as the column
type. The table was created with no problem. So pgAdmin searches the schemas
for you to retrieve types when you use the creation wizard.

When you are using the SQL window it has no context other then what is
provided by the search_path. If the search_path does not include the schema
that holds the type you want, then you will get an error.

Probably the best way to see this is tail the Postgres log file where the
postgresql.conf file has log_statement = 'mod' or 'all'.


 Daniel





--
Adrian Klaver
adrian.kla...@aklaver.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Inconsistency between PgAdmin III GUI and SQL window ?

2014-10-21 Thread Daniel Begin
I have first to admit the inconsistency is probably on my side!-)

The task - I want to clone a table I created in public schema in another
schema (xxx) of the same database.
The problem - I get an error message when creating the table using the
original SQL script: ERROR: type geography does not exist.

I understand from
http://stackoverflow.com/questions/9067335/how-to-create-table-inside-speci
fic-schema-by-default-in-postgres that the original script should work if I
set the search path to the destination schema (xxx) prior to execute the
script (set search_path to xxx ;) but the PgAdmin III SQL window does not
seem aware of the geography type.

What I do not understand is that using the GUI (contextual menu. new table .
new column.) within schema xxx, I can create the table as expected, and the
geography type is recognized.  Any idea about what is going on when using
the SQL window?

Daniel



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Best practices for Large import in empty database?

2014-10-01 Thread Daniel Begin
I am about to feed an empty database with a large import of data (the size
of csv files is about 500 GB). Tables are created but I haven't added any
constraints or indexes yet. I wonder whether the best practice is to add
them before or after the import. Are there other good practices that would
ease the import process?

 

Cheers,

Daniel



Re: [GENERAL] Best practices for Large import in empty database?

2014-10-01 Thread Daniel Begin
Oups, right to the point! Thanks...

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
Sent: October-01-14 10:38
To: Daniel Begin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Best practices for Large import in empty database?

Daniel Begin jfd...@hotmail.com writes:
 I am about to feed an empty database with a large import of data (the 
 size of csv files is about 500 GB). Tables are created but I haven't 
 added any constraints or indexes yet. I wonder whether the best 
 practice is to add them before or after the import. Are there other 
 good practices that would ease the import process?

http://www.postgresql.org/docs/9.3/static/populate.html

regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PostgreSQL Portable

2014-09-11 Thread Daniel Begin
Wow, I was not expecting so many skillful feedbacks - Thanks to all

 

I am not closing the point yet since, as Steve Crawford suggested, the
solution I am looking for (as newbie) might not be optimal !-) So here is
more context.

 

All PCs run W7/64b (different hardware) and I will be the only user
accessing the DB. Once the setup completed, the DB will mainly be used for
reading the data (requests). The results will be used for statistical
analysis/data representation.

 

Thank again.

Daniel 



[GENERAL] PostgreSQL Portable

2014-09-10 Thread Daniel Begin
First, I am a Newbie regarding PostgreSQL .

 

I just started to look at PostgreSQL to implement a large GIS DB (1Tb).  The
data must reside in an external disk with eSATA connection and may be moved
to different locations (and Windows desktops/laptops). I was looking to
install PostgreSQL and PostGIS extensions on each PC (setting-up the proper
PGDATA directory to the external disk) until I read about PostgreSQL and
PgAdmin Portable .

 

http://sourceforge.net/projects/pgadminportable/

http://sourceforge.net/projects/postgresqlportable/

 

Is that a viable alternative considering the expected size of the DB? Any
comments or proposal would be appreciated J

Daniel