Re: [GENERAL] select DISTINCT

2013-09-09 Thread pg noob
Thank you Kevin and Jeff for the responses.
These are very helpful.


On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Friday, September 6, 2013, pg noob wrote:


 Hi all,

 I'm curious about some of the query estimates that I'm seeing with
 queries that use DISTINCT.
 I am using postgres 8.4.13

 I did a couple of quick tests, and found that PostgreSQL seems to do some
 expensive work to
 return DISTINCT rows.  This is contrary to what I was expecting because I
 expected that
 if it knows that it is building a distinct result set that it would
 maintain
 some kind of ordering as it built the results to be able to quickly throw
 away duplicates without any real overhead to the query.


 There is no cost-free way of doing that.   You either have to sort, or
 hash, or walk in index order, and none of those things are free.

 ...


 But it gets a bit stranger than that.
 According to the docs,
 “DISTINCT ON column will eliminate all duplicates in the specified
 column; this is equivalent to using GROUP BY column.”


 DISTINCT ON is not the same thing as DISTINCT, which is not the same as
 count(distinct ...)





 Note that it says that using distinct is EQUIVALENT to using GROUP BY.
 And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

 db=# explain analyze select count(distinct(column1)) from table1;
   QUERY
 PLAN

 --
  Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual
 time=18596.606..18596.607 rows=1 loops=1)
-  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8)
 (actual time=0.003..2391.644 rows=16151368 loops=1)
  Total runtime: 18596.631 ms
 (3 rows)

 db=# explain analyze select count(foo.column1) from (select column1 from
 table1 group by column1) as foo;


 Compare that to:

 explain analyze select count(*) from (select DISTINCT column1 from table1)
 as foo;



 Any ideas why this is?  Or what I am doing wrong?


 The code that executes count(distinct ) has never been taught how to
 use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always
 sorts, which is often much slower than a hash, and also often much less
 memory efficient.   I speculate that this is because the implementation of
 count(distinct ...) is really ancient code and never been brought up to
 date, either about hashing or about more thorough EXPLAIN estimates--I have
 been meaning to dig into it but haven't gotten around to it yet.

 You can get the increased performance by just spelling it in the more
 verbose way, but be aware that count (distinct ...) deals with NULL
 differently than select count(*) from (select DISTINCT...)  does, so they
 are not exactly identical.

 Cheers,

 Jeff



[GENERAL] select DISTINCT

2013-09-06 Thread pg noob
Hi all,

I'm curious about some of the query estimates that I'm seeing with queries
that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some
expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I
expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.  This is different
than an ORDER BY
which requires knowledge of all rows in the table before it can determine
the
ordering.

I would think that distinct is something that can be determined even from a
partial result set.

But what I find is that there is a significant difference between a count
vs. a count distinct:

db=# explain analyze select count(column1) from table1;
  QUERY
PLAN
--
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual
time=5265.143..5265.143 rows=1 loops=1)
   -  Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8)
(actual time=0.005..3392.991 rows=16163849 loops=1)
 Total runtime: 5265.170 ms
(3 rows)

db=# explain analyze select count(distinct(column1)) from table1;
  QUERY
PLAN
--
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual
time=21828.644..21828.644 rows=1 loops=1)
   -  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8)
(actual time=0.005..2742.391 rows=16163937 loops=1)
 Total runtime: 21828.736 ms
(3 rows)

Somehow the estimated query plan cost of the two queries is _exactly_ the
same but the actual
time takes 4 times longer for one than the other to execute.
And it is the query returning less rows in the result which takes more time.
That says that there is work being done to return distinct results that is
not accounted for by the
query plan.
(I tried this several times on different systems because I could not
believe it.
But it seems consistent.)

But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column;
this is equivalent to using GROUP BY column.”

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
  QUERY
PLAN
--
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual
time=18596.606..18596.607 rows=1 loops=1)
   -  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8)
(actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from
table1 group by column1) as foo;

 QUERY
PLAN

 Aggregate  (cost=631419.39..631419.40 rows=1 width=8) (actual
time=4954.187..4954.188 rows=1 loops=1)
   -  HashAggregate  (cost=631397.31..631407.12 rows=981 width=8) (actual
time=4953.477..4954.044 rows=2389 loops=1)
 -  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265
width=8) (actual time=0.004..2050.133 rows=16151805 loops=1)
 Total runtime: 4954.223 ms
(4 rows)

The GROUP BY performs much better than DISTINCT even though both these two
queries return the exact same count result.

In the above examples, column1 is type bigint and it is not indexed.

Any ideas why this is?  Or what I am doing wrong?

Thank you.


[GENERAL] postgres 9.2

2013-08-28 Thread pg noob
Hi all,

I recently ran a couple of tests where I took one of my production
systems and did a drop-in replacement of postgres 8.4 with 9.2.4.
I was expecting to see some performance improvement given the release
notes describing 9.2 as a largely performance related release.

At least for my application, which is an embedded postgresql install
with a relatively small number of client connections, I'm not seeing much
of a measurable difference at all.

I'm just wondering if others have had a similar experience where upgrading
from 8.x to 9.x has or has not improved overall performance?

Thanks.


[GENERAL] unique index corruption

2013-07-24 Thread pg noob
Hi all,

In PostgreSQL 8.4...

I am wondering if autovacuum will periodically rebuild indexes?

If not, how advisable is it to reindex periodically?

We recently had a case of unique index corruption which ended up allowing
duplicate
primary key IDs to get inserted and caused widespread data model integrity
havoc.

The system ran for several days before it was noticed.

After the problem was discovered we found that a reindex would fail because
there were
duplicate ID values.

Our thought is that if we had a scheduled cron job that would periodically
do a reindex this
corruption might have been caught sooner by detecting the reindex failure.

If a reindex is something that should be done frequently as part of regular
maintenance
why isn't there a mode of autovacuum that does this automatically?  Or
maybe there is and
I just don't know about it..?

Thank you.


Fwd: [GENERAL] odd locking behaviour

2013-07-23 Thread pg noob
Thank you Jeff and others for the responses.

One concern that I have is that even cases where there is no deadlock it is
still acquiring stronger locks than necessary.  I only discovered it
because of the deadlock issue but I presume that there are many cases where
it is acquiring a lock on the foreign table and really doesn't need to.
That would seem to lead to higher lock contention in general which although
it may not cause a deadlock in every case, does affect the overall
performance.

In my opinion this should be a case where it ought not to acquire any locks
on the foreign table at all.
It's not as though the columns relevant to the foreign key reference have
changed.

Anyway, at least it's good to understand it more.  I'll have to find some
way to work around this in my application.

Thank you!




On Mon, Jul 22, 2013 at 12:48 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Sun, Jul 21, 2013 at 9:15 PM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  hello
 
  It can be artefact of RI implementation.
 
  see
 http://michael.otacoo.com/postgresql-2/postgres-9-3-feature-highlight-for-key-share-and-for-no-key-update/
 
  Try to 9.3, please, where RI uses more gently locks

 It still behaves this way in 9.4dev.

 
  On Mon, Jul 8, 2013 at 9:54 AM, pg noob pgn...@gmail.com wrote:
 
 
  Thank you for the responses.  Is it a bug?

 I don't think so.  While PostgreSQL of course strives for maximum
 concurrency, it makes no guarantee that it uses the weakest
 theoretically possible locking in all possible cases.  But it is kind
 of unfortunate that updating the same row twice causes a lock
 escalation when it is not obvious it should do so, because as you
 found that makes avoiding deadlocks quite difficult.

 I'm rather surprised it doesn't block at the first update of the 2nd
 session, rather than waiting for the 2nd update of that session.

 Anyway, when the 2nd session re-updates the same row in the same
 transaction, it uses a 'multixact' to record this.   Doing that
 apparently defeats some locking optimization that takes place under
 simpler cases.

 Sorry, that probably isn't the definitive answer you were hoping for.

 Cheers,

 Jeff



Re: [GENERAL] odd locking behaviour

2013-07-08 Thread pg noob
Thank you for the responses.  Is it a bug?
I discovered this because of a db deadlock that shows up in my application
logs.
I can probably work around it to avoid the deadlock (with some amount of
work) but I really don't understand why it behaves as it does.



On Thu, Jul 4, 2013 at 8:40 AM, Moshe Jacobson mo...@neadwerx.com wrote:

 Confirmed reproducible on version 9.1 as well. Very odd.


 On Wed, Jul 3, 2013 at 1:30 PM, pg noob pgn...@gmail.com wrote:


 Hi all,

 I am trying to understand some odd locking behaviour.
 I apologize in advance if this is a basic question and should be widely
 understood but
 I don't see it described in the documentation as far as I could find.

 I'm using Postgres 8.4.13

 I have two tables, call them A  B for example purposes.

 Table A, with column id

 Table B
   - foreign key reference a_id matches A.id FULL
   - some other columns blah1, blah2, blah3

 I do this:

 db1: begin
   db2: begin
 db1: select A FOR UPDATE
   db2: update B set blah1 = 42; --- OK, UPDATE 1
   db2: update B set blah2 = 42; --- This blocks waiting for a lock on A!!

 Here are the exact steps to reproduce:

 CREATE TABLE A (id bigint NOT NULL);
 CREATE TABLE B (id bigint NOT NULL, a_id bigint NOT NULL, blah1 bigint,
 blah2 bigint, blah3 bigint);
 ALTER TABLE ONLY A ADD CONSTRAINT a__pkey PRIMARY KEY (id);
 ALTER TABLE B ADD CONSTRAINT fkrefa FOREIGN KEY (a_id) REFERENCES A(id)
 MATCH FULL;
 INSERT INTO A VALUES (1);
 INSERT INTO B VALUES (1, 1, 1, 2, 3);

 Now, in two DB connections, CON1 and CON2.

 CON1:
   BEGIN;
   SELECT * FROM A WHERE id = 1 FOR UPDATE;

 CON2:
   BEGIN;
   UPDATE B SET blah1 = 42 WHERE id = 1;
   UPDATE B SET blah2 = 42 WHERE id = 1; -- this blocks

 I have verified that if I drop the foreign key constraint requiring
 B.a_id match A.id
 that this behaviour does not happen and both updates succeed without
 blocking.

 I can perhaps understand why it acquires a shared lock on A when updating
 B because of
 the foreign key reference, even though it doesn't seem like it should
 require it because
 the columns being updated are not relevant to the foreign key constraint.

 That behaviour would be less than ideal but at least it would be
 understandable.

 However, why does it only try to acquire the lock on the second update

 If I do a ROLLBACK in CON1, then I see CON2 finish the UPDATE and it
 acquires a
 lock on table A.  Why?

 Thank you.




 --
 Moshe Jacobson
 Nead Werx, Inc. | Manager of Systems Engineering
 2323 Cumberland Parkway, Suite 201 | Atlanta, GA 30339
 mo...@neadwerx.com | www.neadwerx.com

 Quality is not an act, it is a habit. -- Aristotle



[GENERAL] odd locking behaviour

2013-07-03 Thread pg noob
Hi all,

I am trying to understand some odd locking behaviour.
I apologize in advance if this is a basic question and should be widely
understood but
I don't see it described in the documentation as far as I could find.

I'm using Postgres 8.4.13

I have two tables, call them A  B for example purposes.

Table A, with column id

Table B
  - foreign key reference a_id matches A.id FULL
  - some other columns blah1, blah2, blah3

I do this:

db1: begin
  db2: begin
db1: select A FOR UPDATE
  db2: update B set blah1 = 42; --- OK, UPDATE 1
  db2: update B set blah2 = 42; --- This blocks waiting for a lock on A!!

Here are the exact steps to reproduce:

CREATE TABLE A (id bigint NOT NULL);
CREATE TABLE B (id bigint NOT NULL, a_id bigint NOT NULL, blah1 bigint,
blah2 bigint, blah3 bigint);
ALTER TABLE ONLY A ADD CONSTRAINT a__pkey PRIMARY KEY (id);
ALTER TABLE B ADD CONSTRAINT fkrefa FOREIGN KEY (a_id) REFERENCES A(id)
MATCH FULL;
INSERT INTO A VALUES (1);
INSERT INTO B VALUES (1, 1, 1, 2, 3);

Now, in two DB connections, CON1 and CON2.

CON1:
  BEGIN;
  SELECT * FROM A WHERE id = 1 FOR UPDATE;

CON2:
  BEGIN;
  UPDATE B SET blah1 = 42 WHERE id = 1;
  UPDATE B SET blah2 = 42 WHERE id = 1; -- this blocks

I have verified that if I drop the foreign key constraint requiring B.a_id
match A.id
that this behaviour does not happen and both updates succeed without
blocking.

I can perhaps understand why it acquires a shared lock on A when updating B
because of
the foreign key reference, even though it doesn't seem like it should
require it because
the columns being updated are not relevant to the foreign key constraint.

That behaviour would be less than ideal but at least it would be
understandable.

However, why does it only try to acquire the lock on the second update

If I do a ROLLBACK in CON1, then I see CON2 finish the UPDATE and it
acquires a
lock on table A.  Why?

Thank you.


[GENERAL] unique constraint violations

2013-06-26 Thread pg noob
Hi all,

There are some places in our application where unique constraint violations
are difficult to avoid due to multithreading.
What we've done in most places to handle this is to retry in a loop.

Generally it starts by checking if a value already exists, if not - try to
insert it, which may cause a unique violation, and if a constraint
violation occurs, retrying the process again.

The trouble that we have with this approach is twofold.
First, it causes errors to show up in the Postgres log about unique
constraint violations.  These errors are misleading to our support folks
because they look like errors but in fact they are handled (usually) by
retries in the code and don't cause any real problems.  We'd like these to
be handled without causing errors to show up in the logs.

Second, in some cases we've done a lot of work on a transaction before
hitting a unique constraint violation.
If this happens the entire transaction gets rolled back and all the work
leading up to the constraint violation has to be redone.

As a work around for this, I have proposed the following function which
will execute an arbitrary query (with no results returned) and catch a
unique constraint violation without causing errors to show up in the
postgres log or the transaction to get rolled back.  Now what the code does
is to call this function with the query to execute.  If a unique constraint
violation happens, the caller gets a non-zero return code back and this can
be handled but the DB transaction does not get rolled back.


Schema  | public
Name| executehandleunique
Result data type| bigint
Argument data types | querystring character varying
Type| normal
Volatility  | volatile
Owner   | sysdba
Language| plpgsql
Source code |  --
: -- This function executes an arbitrary query and
catches any unique violation
: -- that may occur without causing the database
transaction to get rolled back.
: -- Warning: Don't use this for queries that may be
generated from user input
: -- as it can create a security problem.
: --
: BEGIN
:BEGIN
:   RAISE NOTICE 'executing %', querystring;
:   execute querystring;
:EXCEPTION
:   WHEN unique_violation THEN
:  return 2; -- status value 2 indicates unique
constraint violation occurred
:END;
:RETURN 0; -- status 0 indicates success
: END;
:
Description |


Our application is a closed system, meaning that it does not accept input
from untrusted sources.

Is this function safe?  (Knowing that we do not handle any untrusted input.)
Are there any potential problems with this approach?
Is there a better way to avoid unique constraint errors and the transaction
rollback?

Thank you.


Re: [GENERAL] unique constraint violations

2013-06-26 Thread pg noob
Thank you for the suggestion Steven.
Originally I did implement a solution using savepoints and that worked as a
way to keep all the work done on the transaction leading up to the
constraint violation, but errors would still show up in the Postgres log.
With this new function approach there are no errors in the log.

As far as the serialization, that might be an option, but it probably would
require substantial work to redesign our db access patterns.
Also we're still currently on 8.4 (but hope to upgrade soon!)


On Wed, Jun 26, 2013 at 2:39 PM, Steven Schlansker ste...@likeness.comwrote:


 On Jun 26, 2013, at 11:04 AM, pg noob pgn...@gmail.com wrote:

 
  Hi all,
 
  There are some places in our application where unique constraint
 violations are difficult to avoid due to multithreading.
  What we've done in most places to handle this is to retry in a loop.
 
  Generally it starts by checking if a value already exists, if not - try
 to insert it, which may cause a unique violation, and if a constraint
 violation occurs, retrying the process again.
 
  The trouble that we have with this approach is twofold.
  First, it causes errors to show up in the Postgres log about unique
 constraint violations.  These errors are misleading to our support folks
 because they look like errors but in fact they are handled (usually) by
 retries in the code and don't cause any real problems.  We'd like these to
 be handled without causing errors to show up in the logs.
 

 We have solved this problem by leveraging the new SSI feature in Postgres
 9.1 (http://wiki.postgresql.org/wiki/SSI)

 By running your insert / update inside of a serializable transaction,
 instead of getting unique failures, you get serialization failures.  These
 are expected and we just retry them.  Much less confusing log clutter,
 and no need to differentiate between expected and unexpected unique
 violations.

  Second, in some cases we've done a lot of work on a transaction before
 hitting a unique constraint violation.
  If this happens the entire transaction gets rolled back and all the work
 leading up to the constraint violation has to be redone.
 
  As a work around for this, I have proposed the following function which
 will execute an arbitrary query (with no results returned) and catch a
 unique constraint violation without causing errors to show up in the
 postgres log or the transaction to get rolled back.  Now what the code does
 is to call this function with the query to execute.  If a unique constraint
 violation happens, the caller gets a non-zero return code back and this can
 be handled but the DB transaction does not get rolled back.
 

 The approach we took wouldn't solve this problem for you.  Serialization
 failures also require restarting the transaction.


 You could also consider savepoints as a more lightweight way of undoing
 the bad updates --
 http://www.postgresql.org/docs/9.2/interactive/sql-savepoint.html


 Sounds like you might have a slightly different use case and our approach
 won't help you much, but I figured I'd mention it.