[PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Reuven M. Lerner
Hi, everyone.  I'm maintaining an application that exists as a black 
box in manufacturing plants.  The system is based on Windows, .NET, and 
PostgreSQL 8.3.  I'm a Unix kind of guy myself, but the application 
layer and system administration are being handled by other people; I'm 
just the PostgreSQL guy.


Because of the nature of the application, we don't have direct control 
over what happens.  And it turns out that at one installation, we're 
quickly running out of disk space.  The database is already taking up 
about 200 GB of space, and is growing by 1 GB or so a day.  Switching 
disks, either to a larger/faster traditional drive, or even to a SSD, is 
not an option.  (And yes, I know that SSDs have their own risks, but I'm 
just throwing that out as one option.)


Right now, the best solution to the space problem is to delete 
information associated with old records, where old is from at least 30 
days ago.  The old records are spread across a few tables, including 
many large objects.  (The application was written by people who were new 
to PostgreSQL, and didn't realize that they could use BYTEA.) 
Basically, given a foreign key B.a_id that points to table A, I want to 
DELETE all in B where A's creation date is at least 30 days ago.


Unfortunately, when we implemented this simple delete, it executed 
slower than molasses, taking about 9 hours to do its thing.   Not only 
does this seem like a really, really long time to do such deleting, but 
we have only a 4-hour window in which to run this maintenance activity, 
before the factory starts to use our black box again.


I've tried a few approaches so far, none of which have been hugely 
successful.  The fact that it takes several hours to test each theory is 
obviously a bit of a pain, and so I'm curious to hear suggestions from 
people here.


I should note that my primary concern is available RAM.  The database, 
as I wrote, is about 200 GB in size, and PostgreSQL is reporting 
(according to Windows) use of about 5 GB RAM, plus another 25 GB of 
virtual memory.  I've told the Windows folks on this project that 
virtual memory kills a database, and that it shouldn't surprise us to 
have horrible performance if the database and operating system are both 
transferring massive amounts of data back and forth.  But there doesn't 
seem to be a good way to handle this


This is basically what I'm trying to execute:

DELETE FROM  B
WHERE r_id IN (SELECT R.id
 FROM R, B
WHERE r.end_date  (NOW() - (interval '1 day' * 30))
  AND r.id = b.r_id

(1) I tried to write this as a join, rather than a subselect.  But B has 
an oid column that points to large objects, and on which we have a rule 
that removes the associated large object when a row in B is removed. 
Doing the delete as a join resulted in no such large object with an oid 
of xxx errors.  (I'm not sure why, although it might have to do with 
the rule.)


(2) I tried to grab the rows that *do* interest me, put them into a 
temporary table, TRUNCATE the existing table, and then copy the rows 
back.   I only tested that with a 1 GB subset of the data, but that took 
longer than other options.


(3) There are some foreign-key constraints on the B table.  I thought 
that perhaps doing a mass DELETE was queueing up all of those 
constraints, and possibly using up lots of memory and/or taking a long 
time to execute.  I thus rewrote my queries such that they first removed 
the constraints, then executed the DELETE, and then restored the 
constraints.  That didn't seem to improve things much either, and took a 
long time (30 minutes) just to remove the constraints.  I expected 
re-adding the constraints to take a while, but shouldn't removing them 
be relatively quick?


(4) I tried chunking the deletes, such that instead of trying to 
delete all of the records from the B table, I would instead delete just 
those associated with 100 or 200 rows from the R table.  On a 1 GB 
subset of the data, this seemed to work just fine.  But on the actual 
database, it was still far too slow.


I've been surprised by the time it takes to delete the records in 
question.  I keep trying to tell the others on this project that 
PostgreSQL isn't inherently slow, but that a 200 GB database running on 
a non-dedicated machine, with an old version (8.3), and while it's 
swapping RAM, will be slow regardless of the database software we're 
using.  But even so, 9 hours to delete 100 GB of data strikes me as a 
very long process.


Again, I continue to believe that given our hard time deadlines, and the 
fact that we're using a large amount of virtual memory, that there isn't 
really a solution that will work quickly and easily.  But I'd be 
delighted to be wrong, and welcome any and all comments and suggestions 
for how to deal with this.


Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Sent via pgsql-performance 

Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Glyn Astill
Do you have any more detailed information about the hardware, what sort of disk 
configuration does it have?

Can you get onto the machine to look at what is using those resources?  You 
mention the 25gb of virtual memory; is that being used?  If so is it being used 
by postgres or something else?  If it's being used by postgres you should 
change postgresql.conf to work within your 5gb, otherwise can you stop the 
other applications to do your delete?  A snapshot from task manager or process 
monitor of process resource usage would be useful, even better somelogging from 
perfmon including physical disk usage.


What would be even more useful is the table definitions; you mention trying to 
drop constraints to speed it up but is there anything else at play, e.g. 
triggers?




 From: Reuven M. Lerner reu...@lerner.co.il
To: pgsql-performance@postgresql.org 
Sent: Thursday, 23 February 2012, 8:39
Subject: [PERFORM] Very long deletion time on a 200 GB database
 


Hi, everyone.  I'm maintaining an application that exists as a black box in 
manufacturing plants.  The system is based on Windows, .NET, and PostgreSQL 
8.3.  I'm a Unix kind of guy myself, but the application layer and system 
administration are being handled by other people; I'm just the PostgreSQL guy.

Because of the nature of the application, we don't have direct control over 
what happens.  And it turns out that at one installation, we're quickly 
running out of disk space.  The database is already taking up about 200 GB of 
space, and is growing by 1 GB or so a day.  Switching disks, either to a 
larger/faster traditional drive, or even to a SSD, is not an option.  (And 
yes, I know that SSDs have their own risks, but I'm just throwing that out as 
one option.)

Right now, the best solution to the space problem is to delete information 
associated with old records, where old is from at least 30 days ago.  The 
old records are spread across a few tables, including many large objects.  
(The application was written by people who were new to PostgreSQL, and didn't 
realize that they could use BYTEA.) Basically, given a foreign key B.a_id that 
points to table A, I want to DELETE all in B where A's creation date is at 
least 30 days ago.

Unfortunately, when we implemented this simple delete, it executed slower than 
molasses, taking about 9 hours to do its thing.   Not only does this seem like 
a really, really long time to do such deleting, but we have only a 4-hour 
window in which to run this maintenance activity, before the factory starts to 
use our black box again.

I've tried a few approaches so far, none of which have been hugely 
successful.  The fact that it takes several hours to test each theory is 
obviously a bit of a pain, and so I'm curious to hear suggestions from people 
here.

I should note that my primary concern is available RAM.  The database, as I 
wrote, is about 200 GB in size, and PostgreSQL is reporting (according to 
Windows) use of about 5 GB RAM, plus another 25 GB of virtual memory.  I've 
told the Windows folks on this project that virtual memory kills a database, 
and that it shouldn't surprise us to have horrible performance if the database 
and operating system are both transferring massive amounts of data back and 
forth.  But there doesn't seem to be a good way to handle this

This is basically what I'm trying to execute:

DELETE FROM  B
WHERE r_id IN (SELECT R.id
     FROM R, B
    WHERE r.end_date  (NOW() - (interval '1 day' * 30))
      AND r.id = b.r_id

(1) I tried to write this as a join, rather than a subselect.  But B has an 
oid column that points to large objects, and on which we have a rule that 
removes the associated large object when a row in B is removed. Doing the 
delete as a join resulted in no such large object with an oid of xxx 
errors.  (I'm not sure why, although it might have to do with the rule.)

(2) I tried to grab the rows that *do* interest me, put them into a temporary 
table, TRUNCATE the existing table, and then copy the rows back.   I only 
tested that with a 1 GB subset of the data, but that took longer than other 
options.

(3) There are some foreign-key constraints on the B table.  I thought that 
perhaps doing a mass DELETE was queueing up all of those constraints, and 
possibly using up lots of memory and/or taking a long time to execute.  I thus 
rewrote my queries such that they first removed the constraints, then executed 
the DELETE, and then restored the constraints.  That didn't seem to improve 
things much either, and took a long time (30 minutes) just to remove the 
constraints.  I expected re-adding the constraints to take a while, but 
shouldn't removing them be relatively quick?

(4) I tried chunking the deletes, such that instead of trying to delete all 
of the records from the B table, I would instead delete just those associated 
with 100 or 200 rows from the R table.  On a 1 GB subset of the data, this 
seemed to work 

Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Marcin Mańk
 DELETE FROM  B
 WHERE r_id IN (SELECT R.id
     FROM R, B
    WHERE r.end_date  (NOW() - (interval '1 day' * 30))
      AND r.id = b.r_id


How about:

 DELETE FROM  B
 WHERE r_id IN (SELECT distinct R.id
     FROM R  WHERE r.end_date  (NOW() - (interval '1 day' * 30))

?

Greetings
Marcin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Richard Huxton

On 23/02/12 08:39, Reuven M. Lerner wrote:
(4) I tried chunking the deletes, such that instead of trying to 
delete all of the records from the B table, I would instead delete 
just those associated with 100 or 200 rows from the R table.  On a 1 
GB subset of the data, this seemed to work just fine.  But on the 
actual database, it was still far too slow.


This is the approach I'd take. You don't have enough control / access to 
come up with a better solution. Build a temp table with 100 ids to 
delete. Time that, and then next night you can increase to 200 etc until 
it takes around 3 hours.


Oh - and get the Windows admins to take a look at disk activity - the 
standard performance monitor can tell you more than enough. If it is 
swapping constantly, performance will be atrocious but even if the disks 
are just constantly busy then updates and deletes can be very slow.


--
  Richard Huxton
  Archonet Ltd


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] : Cost calculation for EXPLAIN output

2012-02-23 Thread Venkat Balaji
Hello,

I am trying to understand the analysis behind the cost attribute in
EXPLAIN output.

postgres = # explain select * from table_event where seq_id=8520960;


QUERY PLAN
-
 Index Scan using te_pk on table_event  (cost=0.00..13.88  rows=1  width=62)
   Index Cond: (sequence_id = 8520960)

The cost is 13.88 to fetch 1 row by scanning an Primary Key indexed
column.

Isn't the cost for fetching 1 row is too high ?

On the same table, the cost calculation for scanning the full table is
looking justified --

postgres=# explain select * from table_event;

  QUERY PLAN

 Seq Scan on table_event  (cost=0.00..853043.44 rows=38679544 width=62)
(1 row)

(disk pages read * seq_page_cost) + (rows scanned * cpu_tuple_cost) = (466248 *
1) + (38679544 * 0.01) = 853043.44

By the way below are the details -

Version - Postgres-9.0

Table size is- 3643 MB
+Indexes the size is - 8898 MB

I am looking for a way to reduce cost as much as possible because the query
executes 10+ times a day.

Any thoughts ?

Thanks,
VB


[PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
I have a database where I virtually never delete and almost never do
updates. (The updates might change in the future but for now it's okay to
assume they never happen.) As such, it seems like it might be worth it to
set autovacuum=off or at least make it so vacuuming hardly ever occurs.
Actually, the latter is probably the more robust solution, though I don't
know how to do that (hence me writing this list). I did try turning
autovacuum off but got:

ERROR: parameter autovacuum cannot be changed now
SQL state: 55P02
Not sure what, if anything, I can do about that.

Thanks,
-Alessandro


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Thom Brown
On 22 February 2012 23:50, Alessandro Gagliardi alessan...@path.com wrote:
 I have a database where I virtually never delete and almost never do
 updates. (The updates might change in the future but for now it's okay to
 assume they never happen.) As such, it seems like it might be worth it to
 set autovacuum=off or at least make it so vacuuming hardly ever occurs.
 Actually, the latter is probably the more robust solution, though I don't
 know how to do that (hence me writing this list). I did try turning
 autovacuum off but got:

 ERROR: parameter autovacuum cannot be changed now
 SQL state: 55P02

 Not sure what, if anything, I can do about that.

Autovacuum is controlled by how much of a table has changed, so if a
table never changes, it never gets vacuumed (with the exceptional case
being a forced vacuum freeze to mitigate the transaction id
wrap-around issue).  The settings which control this are
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.
Therefore it isn't necessary to disable autovacuum.

But if you are adamant about disabling it, you need to change it in
your postgresql.conf file and restart the server.

-- 
Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Claudio Freire
On Thu, Feb 23, 2012 at 5:39 AM, Reuven M. Lerner reu...@lerner.co.il wrote:
 Unfortunately, when we implemented this simple delete, it executed slower
 than molasses, taking about 9 hours to do its thing.   Not only does this
 seem like a really, really long time to do such deleting, but we have only a
 4-hour window in which to run this maintenance activity, before the factory
 starts to use our black box again.

PG 8.3 had horrible hash joins for big tables, so you might try set
enable_hashjoin=false prior to your query. I suspect this is not your
biggest problem, though...

 I should note that my primary concern is available RAM.  The database, as I
 wrote, is about 200 GB in size, and PostgreSQL is reporting (according to
 Windows) use of about 5 GB RAM, plus another 25 GB of virtual memory.

You really *have* to look into that situation. 25GB of *active*
virtual memory? That would mean a thrashing server, and an utterly
unresponsive one from my experience. Your data is probably wrong,
because if I had a server *using* 30G of RAM only 5 of which are
physical, I wouldn't even be able to open a remote desktop to it.

I bet your numbers are wrong. In any case, you have to look into the
matter. Any amount of swapping will kill performance for postgres,
throwing plans out of whack, turning sequential I/O into random I/O, a
mess.

 told the Windows folks on this project that virtual memory kills a database,
 and that it shouldn't surprise us to have horrible performance if the
 database and operating system are both transferring massive amounts of data
 back and forth.  But there doesn't seem to be a good way to handle this

There is, tune postgresql.conf to use less memory.

 This is basically what I'm trying to execute:

 DELETE FROM  B
 WHERE r_id IN (SELECT R.id
     FROM R, B
    WHERE r.end_date  (NOW() - (interval '1 day' * 30))
      AND r.id = b.r_id

DELETE is way faster when you have no constraints, no primary key, no indices.

If you could stop all database use in that 4-hour period, it's
possible dropping indices, FKs and PK, delete, and recreating the
indices/FKs/PK will run fast enough.
You'll have to test that on some test server though...

 (3) There are some foreign-key constraints on the B table.  I thought that
 perhaps doing a mass DELETE was queueing up all of those constraints, and
 possibly using up lots of memory and/or taking a long time to execute.  I
 thus rewrote my queries such that they first removed the constraints, then
 executed the DELETE, and then restored the constraints.

That's not enough. As I said, you have to drop the PK and all indices too.

 That didn't seem to
 improve things much either, and took a long time (30 minutes) just to remove
 the constraints.  I expected re-adding the constraints to take a while, but
 shouldn't removing them be relatively quick?

That means your database is locked (a lot of concurrent access), or
thrashing, because dropping a constraint is a very quick task.
If you stop your application, I'd expect dropping constraints and
indices to take almost no time.

 (4) I tried chunking the deletes, such that instead of trying to delete
 all of the records from the B table, I would instead delete just those
 associated with 100 or 200 rows from the R table.  On a 1 GB subset of the
 data, this seemed to work just fine.  But on the actual database, it was
 still far too slow.

Check the hash join thing.
Check/post an explain of the delete query, to see if it uses hash
joins, and which tables are hash-joined. If they're big ones, 8.3 will
perform horribly.

 I've been surprised by the time it takes to delete the records in question.
  I keep trying to tell the others on this project that PostgreSQL isn't
 inherently slow, but that a 200 GB database running on a non-dedicated
 machine, with an old version (8.3), and while it's swapping RAM, will be
 slow regardless of the database software we're using.  But even so, 9 hours
 to delete 100 GB of data strikes me as a very long process.

Deletes in MVCC is more like an update. It's a complex procedure to
make it transactional, that's why truncate is so much faster.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Andrew Dunstan



On 02/23/2012 05:07 AM, Marcin Mańk wrote:

DELETE FROM  B
WHERE r_id IN (SELECT R.id
 FROM R, B
WHERE r.end_date  (NOW() - (interval '1 day' * 30))
  AND r.id = b.r_id


How about:

  DELETE FROM  B
  WHERE r_id IN (SELECT distinct R.id
  FROM R  WHERE r.end_date  (NOW() - (interval '1 day' * 30))

?



Or possibly without the DISTINCT.  But I agree that the original query 
shouldn't have B in the subquery - that alone could well make it crawl.


What is the distribution of end_dates? It might be worth running this in 
several steps, deleting records older than, say, 90 days, 60 days, 30 days.


cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 6:34 AM, Thom Brown wrote:

On 22 February 2012 23:50, Alessandro Gagliardialessan...@path.com  wrote:

I have a database where I virtually never delete and almost never do
updates. (The updates might change in the future but for now it's okay to
assume they never happen.) As such, it seems like it might be worth it to
set autovacuum=off or at least make it so vacuuming hardly ever occurs.
Actually, the latter is probably the more robust solution, though I don't
know how to do that (hence me writing this list). I did try turning
autovacuum off but got:

ERROR: parameter autovacuum cannot be changed now
SQL state: 55P02

Not sure what, if anything, I can do about that.


Autovacuum is controlled by how much of a table has changed, so if a
table never changes, it never gets vacuumed (with the exceptional case
being a forced vacuum freeze to mitigate the transaction id
wrap-around issue).  The settings which control this are
autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.
Therefore it isn't necessary to disable autovacuum.

But if you are adamant about disabling it, you need to change it in
your postgresql.conf file and restart the server.



Agreed, don't disable autovacuum.  It's not that demanding, and if you 
do need it and forget to run it, it might cause you more problems.


I have a db that's on a VM that doesnt get hit very much.  I've noticed 
IO is a little busy (we are talking small percents of percents less than 
one) but still more that I thought should be happening on a db with next 
to no usage.


I found setting autovacuum_naptime = 6min made the IO all but vanish.

And if I ever get a wild hair and blow some stuff away, the db will 
clean up after me.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Reuven M. Lerner
Hi, everyone.  Thanks for all of the help and suggestions so far; I'll 
try to respond to some of them soon.  Andrew wrote:



How about:

DELETE FROM B
WHERE r_id IN (SELECT distinct R.id
FROM R WHERE r.end_date (NOW() - (interval '1 day' * 30))

?



Or possibly without the DISTINCT. But I agree that the original query
shouldn't have B in the subquery - that alone could well make it crawl.


I put B in the subquery so as to reduce the number of rows that would be 
returned, but maybe that was indeed backfiring on me.  Now that I think 
about it, B is a huge table, and R is a less-huge one, so including B in 
the subselect was probably a mistake.




What is the distribution of end_dates? It might be worth running this in
several steps, deleting records older than, say, 90 days, 60 days, 30 days.


I've suggested something similar, but was told that we have limited time 
to execute the DELETE, and that doing it in stages might not be possible.


Reuven


--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread k...@rice.edu
On Thu, Feb 23, 2012 at 05:25:46PM +0200, Reuven M. Lerner wrote:
 
 What is the distribution of end_dates? It might be worth running this in
 several steps, deleting records older than, say, 90 days, 60 days, 30 days.
 
 I've suggested something similar, but was told that we have limited
 time to execute the DELETE, and that doing it in stages might not be
 possible.
 
 Reuven
 

In cases like this, I have often found that doing the delete in smaller
pieces goes faster, sometimes much faster, than the bigger delete.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Steve Crawford

On 02/23/2012 12:39 AM, Reuven M. Lerner wrote:

Hi, everyone...
This is basically what I'm trying to execute:

DELETE FROM  B
WHERE r_id IN (SELECT R.id
 FROM R, B
WHERE r.end_date  (NOW() - (interval '1 day' * 30))
  AND r.id = b.r_id


I don't recall which versions like which approach, but have you tried 
...WHERE EXISTS (SELECT... instead of WHERE IN? Depending on the version 
of PostgreSQL, one or the other may yield a superior result.



(2) I tried to grab the rows that *do* interest me, put them into a 
temporary table, TRUNCATE the existing table, and then copy the rows 
back.   I only tested that with a 1 GB subset of the data, but that 
took longer than other options.




Was the 1GB subset the part you were keeping or the part you were 
deleting? Which part was slow (creating the temp table or copying it back)?


Try running EXPLAIN on the SELECT query that creates the temporary table 
and try to optimize that. Also, when copying the data back, you are 
probably having to deal with index and foreign keys maintenance. It will 
probably be faster to drop those, copy the data back then recreate them.


I know you are a *nix-guy in a Windows org so your options are limited, 
but word-on-the-street is that for high-performance production use, 
install PostgreSQL on *nix.


Cheers,
Steve


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Shaun Thomas

On 02/23/2012 02:39 AM, Reuven M. Lerner wrote:


I should note that my primary concern is available RAM. The database, as
I wrote, is about 200 GB in size, and PostgreSQL is reporting (according
to Windows) use of about 5 GB RAM, plus another 25 GB of virtual memory.


O_o

That... that would probably swap just constantly. No end. Just swap all 
day long. But maybe not. Please tell us the values for these settings:


* shared_buffers
* work_mem
* maintenance_work_mem
* checkpoint_segments
* checkpoint_timeout

It also wouldn't be a bad idea to see how many concurrent connections 
there are, because that may determine how much memory all the backends 
are consuming. In any case, if it's actually using 25GB of virtual 
memory, any command you run that doesn't happen to be in cache, will 
just immediately join a giant logjam.



I've told the Windows folks on this project that virtual memory kills a
database, and that it shouldn't surprise us to have horrible performance
if the database and operating system are both transferring massive
amounts of data back and forth. But there doesn't seem to be a good way
to handle this


You kinda can, by checking those settings and sanitizing them. If 
they're out of line, or too large, they'll create the need for more 
virtual memory. Having the virtual memory there isn't necessarily bad, 
but using it is.



DELETE FROM B
WHERE r_id IN (SELECT R.id
FROM R, B
WHERE r.end_date  (NOW() - (interval '1 day' * 30))
AND r.id = b.r_id


Just to kinda help you out syntactically, have you ever tried a DELETE 
FROM ... USING? You can also collapse your interval notation.


DELETE FROM B
 USING R
 WHERE R.id = B.r_id
   AND R.end_date  CURRENT_DATE - INTERVAL '30 days';

But besides that, the other advise you've received is sound. Since your 
select-truncate-insert attempt was also slow, I suspect you're having 
problems with foreign key checks, and updating the index trees. 
Maintaining an existing index can be multiples slower than filling an 
empty table and creating the indexes afterwards.


So far as your foreign keys, if any of the child tables don't have an 
index on the referring column, your delete performance will be 
atrocious. You also need to make sure the types of the columns are 
identical. Even a numeric/int difference will be enough to render an 
index unusable.


We have a 100GB *table* with almost 200M rows and even deleting from 
that in many of our archive tests doesn't take anywhere near 9 hours. 
But I *have* seen a delete take that long when we had a numeric primary 
key, and an integer foreign key. Even a handful of records can cause a 
nested loop sequence scan, which will vastly inflate delete time.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@peak6.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] : Cost calculation for EXPLAIN output

2012-02-23 Thread Kevin Grittner
Venkat Balaji venkat.bal...@verse.in wrote:
 

 The cost is 13.88 to fetch 1 row by scanning an Primary Key
 indexed column.
 
 Isn't the cost for fetching 1 row is too high ?
 
I don't know, how many index pages will need to be randomly accessed
in addition to the random heap access?  How many dead versions of
the row will need to be visited besides the row which is actually
visible?  How many of these pages are in shared_buffers?  How many
of these pages are in OS cache?
 
 I am looking for a way to reduce cost as much as possible because
 the query executes 10+ times a day.
 
Well, you can reduce the cost all you want by dividing all of the
costing factors in postgresql.conf by the same value, but that won't
affect query run time.  That depends on the query plan which is
chosen.  The cost is just an abstract number used for comparing the
apparent resources needed to run a query through each of the
available plans.  What matters is that the cost factors accurately
reflect the resources used; if not you should adjust them.
 
If you calculate a ratio between run time and estimated cost, you
should find that it remains relatively constant (like within an
order of magnitude) for various queries.  Since you didn't show
actual run times, we can't tell whether anything need adjustment.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] : Cost calculation for EXPLAIN output

2012-02-23 Thread Shaun Thomas

On 02/23/2012 06:21 AM, Venkat Balaji wrote:


The cost is 13.88 to fetch 1 row by scanning an Primary Key
indexed column.

Isn't the cost for fetching 1 row is too high ?


Not really. The cost is really just an estimate to rank alternate 
query plans so the database picks the least expensive plan. The number 
'13.88' is basically meaningless. It doesn't translate to any real-world 
equivalent. What you actually care about is the execution time. If it 
takes 0.25ms or something per row, that's what really matters.


For what it's worth, it looks like you have the right query plan, there. 
Scan the primary key for one row. What's wrong with that? Our systems 
have tables far larger than yours, handling 300M queries per day that 
are far more expensive than a simple primary key index scan. You'll be 
fine. :)


I suggest you set log_min_duration_statement to something like 1000, to 
send any query that takes longer than 1 second to the PG logs. 
Concentrate on those queries, because ones like this are already working 
right.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@peak6.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
I should have been more clear. I virtually never delete or do updates, but
I insert *a lot*. So the table does change quite a bit, but only in one
direction.

I was unable to disable autovacuum universally (due to the
cant_change_runtime_param error) but I was able to disable it on individual
tables. Still, I know this is heavy handed and sub-optimal. I tried set
autovacuum_naptime='6min' but got the same 55P02 error. Should/can I set
that per table?

I did look at autovacuum_vacuum_threshold and autovacuum_vacuum_scale_**factor
but couldn't make sense out of them. (Besides, I'd probably get the
same 55P02 error if I tried to change them.)

On Thu, Feb 23, 2012 at 7:18 AM, Andy Colson a...@squeakycode.net wrote:

 On 2/23/2012 6:34 AM, Thom Brown wrote:

 On 22 February 2012 23:50, Alessandro Gagliardialessan...@path.com
  wrote:

 I have a database where I virtually never delete and almost never do
 updates. (The updates might change in the future but for now it's okay to
 assume they never happen.) As such, it seems like it might be worth it to
 set autovacuum=off or at least make it so vacuuming hardly ever occurs.
 Actually, the latter is probably the more robust solution, though I don't
 know how to do that (hence me writing this list). I did try turning
 autovacuum off but got:

 ERROR: parameter autovacuum cannot be changed now
 SQL state: 55P02

 Not sure what, if anything, I can do about that.


 Autovacuum is controlled by how much of a table has changed, so if a
 table never changes, it never gets vacuumed (with the exceptional case
 being a forced vacuum freeze to mitigate the transaction id
 wrap-around issue).  The settings which control this are
 autovacuum_vacuum_threshold and autovacuum_vacuum_scale_**factor.
 Therefore it isn't necessary to disable autovacuum.

 But if you are adamant about disabling it, you need to change it in
 your postgresql.conf file and restart the server.


 Agreed, don't disable autovacuum.  It's not that demanding, and if you do
 need it and forget to run it, it might cause you more problems.

 I have a db that's on a VM that doesnt get hit very much.  I've noticed IO
 is a little busy (we are talking small percents of percents less than one)
 but still more that I thought should be happening on a db with next to no
 usage.

 I found setting autovacuum_naptime = 6min made the IO all but vanish.

 And if I ever get a wild hair and blow some stuff away, the db will clean
 up after me.

 -Andy



Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Thom Brown
On 23 February 2012 17:35, Alessandro Gagliardi alessan...@path.com wrote:
 I should have been more clear. I virtually never delete or do updates, but I
 insert a lot. So the table does change quite a bit, but only in one
 direction.

The same thing applies.  VACUUM cleans up dead tuples, which INSERTs
don't create, only DELETE and UPDATEs do.

-- 
Thom

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Greg Spiegelberg
On Thu, Feb 23, 2012 at 8:25 AM, Reuven M. Lerner reu...@lerner.co.ilwrote:


 I've suggested something similar, but was told that we have limited time
 to execute the DELETE, and that doing it in stages might not be possible.


Just so happens I had this exact problem last week on a rather large table.
 * DELETE FROM x WHERE id IN (SELECT id FROM y ...) was horrible at best.
 * DELETE FROM x USING y WHERE ... was nearly as bad
Both of the above were taking hours and looking more like it would stretch
into days.

What I discovered however was this little technique that I'm sure all  the
Pg gods will smote me for using however it worked for me.

BEGIN;
LOCK x IN SHARE UPDATE EXCLUSIVE;  -- to prevent VACUUM's
SELECT x.ctid INTO TEMPORARY TABLE recs_to_delete FROM x,y WHERE x.id=y.id;
DELETE FROM x USING recs_to_delete r WHERE r.ctid=x.ctid;
COMMIT;

I know there are perils in using ctid but with the LOCK it should be safe.
 This transaction took perhaps 30 minutes and removed 100k rows and once
the table was VACUUM'd afterward it freed up close to 20 GB on the file
system.

HTH
-Greg


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
Hm. Okay, so just to be perfectly clear, my database with all its INSERTs,
but no DELETEs or UPDATEs should not be VACUUMing anyway, so disabling
auto-vacuum is redundant (and possibly hazardous).

FWIW, I did notice a speed increase after disabling auto-vacuum on several
of my tables though that could have been a coincidence. Is there any way
that these tables could have been getting vacuumed (or some thing else)
despite the fact that they are not receiving updates or deletes? Or must
that have been a coincidence?

While we're on the subject, I welcome any pointers with regard to tuning a
database that is being used in this way. Any cache sizes I should be
messing with? Etc.

Thank you,
-Alessandro

On Thu, Feb 23, 2012 at 9:45 AM, Thom Brown t...@linux.com wrote:

 On 23 February 2012 17:35, Alessandro Gagliardi alessan...@path.com
 wrote:
  I should have been more clear. I virtually never delete or do updates,
 but I
  insert a lot. So the table does change quite a bit, but only in one
  direction.

 The same thing applies.  VACUUM cleans up dead tuples, which INSERTs
 don't create, only DELETE and UPDATEs do.

 --
 Thom



Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Steve Crawford

On 02/23/2012 09:35 AM, Alessandro Gagliardi wrote:
I should have been more clear. I virtually never delete or do updates, 
but I insert /a lot/. So the table does change quite a bit, but only 
in one direction.


I was unable to disable autovacuum universally (due to the 
cant_change_runtime_param error) but I was able to disable it on 
individual tables. Still, I know this is heavy handed and sub-optimal. 
I tried set autovacuum_naptime='6min' but got the same 55P02 error. 
Should/can I set that per table?


I did look at autovacuum_vacuum_threshold 
and autovacuum_vacuum_scale_factor but couldn't make sense out of 
them. (Besides, I'd probably get the same 55P02 error if I tried to 
change them.)


See:
http://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html

The documentation has information like This parameter can only be set 
in the postgresql.conf file or on the server command line. that will 
tell you in advance which settings will fail when you attempt to set 
them through SQL statements.


But autovacuum is pretty smart about not vacuuming tables until 
reasonably necessary. And beware that autovacuum is also controlling 
when to analyze a table. Mass inserts are probably changing the 
characteristics of your table such that it needs to be analyzed to allow 
the planner to properly optimize your queries.


Have you identified that vacuum is actually causing a problem? If not, 
I'd leave it alone. The system tables have a lot of information on table 
vacuuming and analyzing:


select
relname,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze,
vacuum_count,
autovacuum_count,
analyze_count,
autoanalyze_count
from
   pg_stat_user_tables;

Cheers,
Steve



Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Shaun Thomas

On 02/23/2012 11:56 AM, Greg Spiegelberg wrote:


I know there are perils in using ctid but with the LOCK it should be
safe. This transaction took perhaps 30 minutes and removed 100k rows
and once the table was VACUUM'd afterward it freed up close to 20 GB
on the file system.


It took *30 minutes* to delete 100k rows? And 100k rows were using 20GB? 
Is that off by an order of magnitude?


Using the ctid is a cute trick, though. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@peak6.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Andy Colson

On 2/23/2012 12:05 PM, Shaun Thomas wrote:

On 02/23/2012 11:56 AM, Greg Spiegelberg wrote:


I know there are perils in using ctid but with the LOCK it should be
safe. This transaction took perhaps 30 minutes and removed 100k rows
and once the table was VACUUM'd afterward it freed up close to 20 GB
on the file system.


It took *30 minutes* to delete 100k rows? And 100k rows were using 20GB?
Is that off by an order of magnitude?

Using the ctid is a cute trick, though. :)



And I'm not sure the LOCK is necessary, while googling for delete from 
table limit 10 I ran across this thread:


http://archives.postgresql.org/pgsql-hackers/2010-11/msg02028.php

They use it without locks.

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Greg Spiegelberg
On Thu, Feb 23, 2012 at 11:11 AM, Andy Colson a...@squeakycode.net wrote:

 On 2/23/2012 12:05 PM, Shaun Thomas wrote:

 On 02/23/2012 11:56 AM, Greg Spiegelberg wrote:

  I know there are perils in using ctid but with the LOCK it should be
 safe. This transaction took perhaps 30 minutes and removed 100k rows
 and once the table was VACUUM'd afterward it freed up close to 20 GB
 on the file system.


 It took *30 minutes* to delete 100k rows? And 100k rows were using 20GB?
 Is that off by an order of magnitude?

 Using the ctid is a cute trick, though. :)


 And I'm not sure the LOCK is necessary, while googling for delete from
 table limit 10 I ran across this thread:

 http://archives.postgresql.**org/pgsql-hackers/2010-11/**msg02028.phphttp://archives.postgresql.org/pgsql-hackers/2010-11/msg02028.php

 They use it without locks.


I used LOCK simply because if a VACUUM FULL x; slipped in between the
SELECT and the DELETE the ctid's could conceivably change.

-Greg


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Tom Lane
Greg Spiegelberg gspiegelb...@gmail.com writes:
 I used LOCK simply because if a VACUUM FULL x; slipped in between the
 SELECT and the DELETE the ctid's could conceivably change.

VACUUM FULL can't slip in there, because you'd have AccessShareLock
just from the SELECT.  The real problem goes like this:

1. You SELECT some ctid and save it in the other table.
2. Somebody else updates or deletes that row.
3. Plain VACUUM comes along and frees the dead TID.
4. Somebody else (maybe not same somebody as #2) inserts a new
   row at that TID position.
5. You DELETE that TID.  Ooops.

So you might say okay, the point of the lock is to block plain vacuum,
not vacuum full.  I'm still a bit worried about whether the technique
is entirely safe, though, because of page pruning which can happen
anyway.  What this really boils down to is: how sure are you that no
other userland activity is going to update or delete any of the targeted
rows between the SELECT INTO and the DELETE?  If you're sure, then this
is safe without the extra lock.  Otherwise, I wouldn't trust it.

It might be worth having the SELECT that creates the temp table be a
SELECT FOR UPDATE on the target table, so as to ensure you've locked
down the targeted rows against anybody else.  This is not free though,
as it'll mean extra writes of all the modified tuples.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 10:01 AM, Steve Crawford 
scrawf...@pinpointresearch.com wrote:

 **
 The documentation has information like This parameter can only be set in
 the postgresql.conf file or on the server command line. that will tell
 you in advance which settings will fail when you attempt to set them
 through SQL statements.

 Ah. I missed that. Sorry for asking stupid questions.


 But autovacuum is pretty smart about not vacuuming tables until reasonably
 necessary. And beware that autovacuum is also controlling when to analyze a
 table. Mass inserts are probably changing the characteristics of your table
 such that it needs to be analyzed to allow the planner to properly optimize
 your queries.

 Okay, that makes more sense to me; because the stats would be changing
quickly and so while vacuuming may not be necessary, analyzing would be. At
the same time, I can't afford to analyze if it's causing my inserts to take
over 50 ms. Something else I should add: if my selects are slow, that's
annoying; but if my inserts are slow, that could be disastrous. Does
analyze increase the efficiency of inserts or just selects? (I assumed the
latter.) Obviously, I will need to analyze sometimes, but perhaps not
nearly as often as postgres would predict under the circumstances.


 Have you identified that vacuum is actually causing a problem? If not, I'd
 leave it alone. The system tables have a lot of information on table
 vacuuming and analyzing:

 Not indubitably, but circumstantially, I did notice that significantly
fewer of my commits were taking over 50 ms after I set
autovacuum_enabled=off on many of my tables. Unfortunately, it was not an
isolated experiment, so I can't really be sure. At the same time, I'm
hesitant to turn it back on until I'm sure it either didn't make a
difference or I've got a better plan for how to deal with this.


 select
 relname,
 last_vacuum,
 last_autovacuum,
 last_analyze,
 last_autoanalyze,
 vacuum_count,
 autovacuum_count,
 analyze_count,
 autoanalyze_count
 from
pg_stat_user_tables;

 Apparently the last four columns don't exist in my database. As for the
first four, that is somewhat illuminating. It looks like the
last_autovacuum that occurred on any of my tables was late Monday evening
(almost two days before I set autovacuum_enabled=off). The last_autoanalyze
on one of the tables where I set autovacuum_enabled=off was yesterday at
10:30, several hours before I disabled auto-vacuum. (I've had others since
then on tables where I didn't disable auto-vacuum.) It looks like
disabling auto-vacuum also disabled auto-analyze (did it?) but it also
looks like that might not have been the continuous problem I thought it was.

So if it's not auto-vacuuming that's making my inserts so slow, what is it?
I'm batching my inserts (that didn't seem to help at all actually, but
maybe cause I had already turned off synchronous_commit anyway). I've
gotten rid of a bunch of indices (especially those with low
cardinality–that I did around the same time as disabling auto-vacuum, so
that could account for the coincidental speed up). I'm not sure what else I
could be doing wrong. It's definitely better than it was a few days ago,
but I still see LOG: duration: 77.315 ms statement: COMMIT every minute
or two.

Thank you,
-Alessandro


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Peter van Hardenberg
On Thu, Feb 23, 2012 at 10:38 AM, Alessandro Gagliardi
alessan...@path.com wrote:
 On Thu, Feb 23, 2012 at 10:01 AM, Steve Crawford
 scrawf...@pinpointresearch.com wrote:
 So if it's not auto-vacuuming that's making my inserts so slow, what is it?
 I'm batching my inserts (that didn't seem to help at all actually, but maybe
 cause I had already turned off synchronous_commit anyway). I've gotten rid
 of a bunch of indices (especially those with low cardinality–that I did
 around the same time as disabling auto-vacuum, so that could account for the
 coincidental speed up). I'm not sure what else I could be doing wrong. It's
 definitely better than it was a few days ago, but I still see LOG:
 duration: 77.315 ms statement: COMMIT every minute or two.


Have you considered that you may have lock contention? Sampling
pg_locks may be illuminating; based on your description the lock
contention would be intermittent, so I wouldn't trust an n=1 test.

-p

-- 
Peter van Hardenberg
San Francisco, California
Everything was beautiful, and nothing hurt. -- Kurt Vonnegut

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Tom Lane
I just reread the original post and noted this:

Reuven M. Lerner reu...@lerner.co.il writes:
 (1) I tried to write this as a join, rather than a subselect.  But B has 
 an oid column that points to large objects, and on which we have a rule 
 that removes the associated large object when a row in B is removed. 

A rule?  Really?  That's probably bad enough in itself, but when you
write an overcomplicated join delete query, I bet the resulting plan
is spectacularly bad.  Have you looked at the EXPLAIN output for this?

I'd strongly recommend getting rid of the rule in favor of a trigger.
Also, as already noted, the extra join inside the IN sub-select is
probably hurting far more than it helps.

 (3) There are some foreign-key constraints on the B table.

If those are FK references *to* the B table, make sure the other end
(the referencing column) is indexed.  Postgres doesn't require an index
on a referencing column, but deletes in the referenced table will suck
if you haven't got one.

I don't think any of the fancy stuff being discussed in the thread is
worth worrying about until you've got these basic issues dealt with.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 12:38 PM, Alessandro Gagliardi wrote:


Does analyze increase the efficiency of inserts or just selects? (I
assumed the latter.) Obviously, I will need to analyze sometimes, but


That depends on if you have triggers that are doing selects.  But in 
general you are correct, analyze wont help inserts.


checkpoint_segments can help insert speed, what do you have that set to?

Also how you insert can make things faster too. (insert vs prepared vs COPY)

Also, if you have too many indexes on a table that can cause things to 
slow down.


Your IO layer needs to be fast too.  Have you watched vmstat and iostat?

Have you read up on synchronous_commit?

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Steve Crawford

On 02/23/2012 10:38 AM, Alessandro Gagliardi wrote:
On Thu, Feb 23, 2012 at 10:01 AM, Steve Crawford 
scrawf...@pinpointresearch.com 
mailto:scrawf...@pinpointresearch.com wrote:


The documentation has information like This parameter can only be
set in the postgresql.conf file or on the server command line.
that will tell you in advance which settings will fail when you
attempt to set them through SQL statements.

Ah. I missed that. Sorry for asking stupid questions.
No problem and not stupid. With the manual running to hundreds of pages 
plus information on wikis and mailing-list histories spanning hundreds 
of thousands of messages sometimes knowing where to look is 90% of the 
battle.


But autovacuum is pretty smart about not vacuuming tables until
reasonably necessary. And beware that autovacuum is also
controlling when to analyze a table. Mass inserts are probably
changing the characteristics of your table such that it needs to
be analyzed to allow the planner to properly optimize your queries.

Okay, that makes more sense to me; because the stats would be changing 
quickly and so while vacuuming may not be necessary, analyzing would 
be. At the same time, I can't afford to analyze if it's causing my 
inserts to take over 50 ms. Something else I should add: if my selects 
are slow, that's annoying; but if my inserts are slow, that could 
be disastrous...


You need to rethink things a bit. Databases can fail in all sorts of 
ways and can slow down during bursts of activity, data dumps, etc. You 
may need to investigate some form of intermediate buffering.


...Apparently the last four columns don't exist in my database. As for 
the first four, that is somewhat illuminating
Then you are not running a current version of PostgreSQL so the first 
step to performance enhancement is to upgrade. (As a general rule - 
there are occasionally specific cases where performance decreases.)
So if it's not auto-vacuuming that's making my inserts so slow, what 
is it? I'm batching my inserts (that didn't seem to help at all 
actually, but maybe cause I had already turned off synchronous_commit 
anyway).
How are you batching them? Into a temp table that is copied to the main 
table? As a bunch of insert statements within a single connection (saves 
probably considerable time due to eliminating multiple connection 
setups)? With one PREPARE and multiple EXECUTE (to save repeated 
planning time - I'm not sure this will buy you much for simple inserts, 
though)? With COPY (much faster as many records are inserted in a single 
statement but if one fails, all fail)?


And what is the 50ms limit? Is that an average? Since you are batching, 
it doesn't sound like you need every statement to complete in 50ms. 
There is always a tradeoff between overall maximum throughput and 
maximum allowed latency.


I've gotten rid of a bunch of indices (especially those with low 
cardinality–that I did around the same time as disabling auto-vacuum, 
so that could account for the coincidental speed up).
Yes, inserts require the indexes to be updated so they can slow down 
inserts and updates.


I'm not sure what else I could be doing wrong. It's definitely better 
than it was a few days ago, but I still see LOG: duration: 77.315 ms 
statement: COMMIT every minute or two.


That's a huge topic ranging from hardware (CPU speed, RAM, 
spindle-count, disk-type, battery-backed write caching), OS (you *are* 
running on some sort of *nix, right?), OS tuning, PG tuning, etc. 
Fortunately the biggest benefit comes from some basic tuning.


I recommend you abandon this thread as it presupposes a now seemingly 
incorrect cause of the problem and start a new one titled something like 
Tuning for high insert rate where you describe the problem you want to 
solve. See http://wiki.postgresql.org/wiki/Guide_to_reporting_problems 
for a good guide to the information that will be helpful in diagnosis.


Cheers,
Steve



Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Jesper Krogh

On 23/02/12 09:39, Reuven M. Lerner wrote:
Hi, everyone.  I'm maintaining an application that exists as a black 
box in manufacturing plants.  The system is based on Windows, .NET, 
and PostgreSQL 8.3.  I'm a Unix kind of guy myself, but the 
application layer and system administration are being handled by other 
people; I'm just the PostgreSQL guy.


Just thinking loud. It looks like (just guessing)
that the application needs store data worth 1 month back and
it was put into production under the assumption that it would
never fill up or deletion easily could be done under maintaince
windows. And that now turns out not to be the case.

I would stuff in a  trigger function on the table that automatically
does the cleanup.. It could be a BEFORE INSERT OR UPDATE
TRIGGER that just tries to prune 2-3 rows of the table if they
have exceeded the keep-back time. Just installing that in the
maintance window would allow the system to self-heal over time.

If the maintaince window allows for more cleanup, then manually
do some deletions. Now the black-box is self-healing.

--
Jesper

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Disable-spinlocks while compiling postgres 9.1 for ARM Cortex A8

2012-02-23 Thread Jayashankar K B
Hi All,

I am trying to compile Postgres Source code for ARM cortex A8 architecture.
While compiling, I got an error message which read selected processor does not 
support `swpb r4,r4,[r3]' 
One of the Postgres forums at the location 
http://postgresql.1045698.n5.nabble.com/BUG-6331-Cross-compile-error-aborts-Works-if-disable-spinlock-is-used-td5068738.html;
Mentioned that by using -disable-spinlocks, we can overcome the error at the 
cost of performance. I did the same and it compiled successfully.
But the INSTALL guide in Postgres source code mentioned that I should inform 
the Postgres community in case I am forced to use -disable spinlocks to let the 
code compile.
Hence this email. So please suggest me what I should do now.  What sort of 
performance penalty will be there if I use this option? What actually is the 
significance of this parameter?
Please guide me.

This is the configure command I used
./configure CC=/opt/toolchain/bin/armv7l-timesys-linux-gnueabi-gcc 
--target=armv7l-timesys-linux-gnueabi 
--prefix=/home/jayashankar/WorkingDirectory/Postgres9.1_Cortex 
--host=x86_64-unknown-linux-gnu CFLAGS='-march=armv7-a -mtune=cortex-a8 
-mfpu=vfpv3 -mthumb' --disable-spinlocks

Thanks and Regards
Jayashankar



Larsen  Toubro Limited

www.larsentoubro.com

This Email may contain confidential or privileged information for the intended 
recipient (s) If you are not the intended recipient, please do not use or 
disseminate the information, notify the sender and delete it from your system.


Re: [PERFORM] Disable-spinlocks while compiling postgres 9.1 for ARM Cortex A8

2012-02-23 Thread Tom Lane
Jayashankar K B jayashankar...@lnties.com writes:
 Hi All,
 I am trying to compile Postgres Source code for ARM cortex A8 architecture.
 While compiling, I got an error message which read selected processor does 
 not support `swpb r4,r4,[r3]' 
 One of the Postgres forums at the location 
 http://postgresql.1045698.n5.nabble.com/BUG-6331-Cross-compile-error-aborts-Works-if-disable-spinlock-is-used-td5068738.html;
 Mentioned that by using -disable-spinlocks, we can overcome the error at the 
 cost of performance. I did the same and it compiled successfully.
 But the INSTALL guide in Postgres source code mentioned that I should inform 
 the Postgres community in case I am forced to use -disable spinlocks to let 
 the code compile.
 Hence this email. So please suggest me what I should do now.

Try this patch:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=068e08eebbb2204f525647daad3fe15063b77820

BTW, please don't cross-post to multiple PG mailing lists; there's very
seldom a good reason to do that.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
I'm unable to make sense of pg_locks. The vast majority are
locktype='transactionid', mode='ExclusiveLock', granted=t. There are some
'relation' locks with mode='RowExclusiveLock' and fewer with
'AccessShareLock'. I have no idea what I should be looking for here.

On Thu, Feb 23, 2012 at 10:42 AM, Peter van Hardenberg p...@pvh.ca wrote:

 On Thu, Feb 23, 2012 at 10:38 AM, Alessandro Gagliardi
 alessan...@path.com wrote:
  around the same time as disabling auto-vacuum, so that could account for
 the
  coincidental speed up). I'm not sure what else I could be doing wrong.
 It's
  definitely better than it was a few days ago, but I still see LOG:
  duration: 77.315 ms statement: COMMIT every minute or two.
 

 Have you considered that you may have lock contention? Sampling
 pg_locks may be illuminating; based on your description the lock
 contention would be intermittent, so I wouldn't trust an n=1 test.

 -p

 --
 Peter van Hardenberg
 San Francisco, California
 Everything was beautiful, and nothing hurt. -- Kurt Vonnegut



Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 11:07 AM, Andy Colson a...@squeakycode.net wrote:

 That depends on if you have triggers that are doing selects.  But in
 general you are correct, analyze wont help inserts.

 I do have some, actually. I have a couple trigger functions like:

CREATE OR REPLACE FUNCTION locations_quiet_unique_violation()
  RETURNS trigger AS
$BODY$
BEGIN
  IF EXISTS (SELECT 1 FROM public.locations WHERE geohash = NEW.geohash)
THEN
RETURN NULL;
  ELSE
RETURN NEW;
  END IF;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

that are triggered thusly:

CREATE TRIGGER locations_check_unique_violation
  BEFORE INSERT
  ON locations
  FOR EACH ROW
  EXECUTE PROCEDURE locations_quiet_unique_violation();

I left auto-vacuum enabled for those tables.

checkpoint_segments can help insert speed, what do you have that set to?

 40. Checking http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serverit 
 looks like setting that as high as 256 would not necessarily be
unreasonable. What do you think?


 Also how you insert can make things faster too. (insert vs prepared vs
 COPY)

 I'm doing this all with INSERT. Is COPY that much faster? I don't know
anything about prepared.


 Also, if you have too many indexes on a table that can cause things to
 slow down.

 Yeah, got that. I removed a bunch. I'd rather not remove what's left
unless I have to.


 Your IO layer needs to be fast too.  Have you watched vmstat and iostat?

 I don't know if I have access to vmstat and iostat. Heroku is hosting this
for me on AWS.


 Have you read up on synchronous_commit?

 Only a tiny bit. A couple people suggested disabling it since my database
is being hosted on AWS so I did that. It seems a bit risky but perhaps
worth it.


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Andy Colson

On 2/23/2012 2:40 PM, Alessandro Gagliardi wrote:


checkpoint_segments can help insert speed, what do you have that set to?

40. Checking
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server it looks
like setting that as high as 256 would not necessarily be unreasonable.
What do you think?


I'd say go slow.  Try a little bit and see if it helps.  I don't 
actually have high insert rate problems, so I don't actually know from 
experience.




Also how you insert can make things faster too. (insert vs prepared
vs COPY)

I'm doing this all with INSERT. Is COPY that much faster? I don't know
anything about prepared.


If you can batch multiple records then COPY is the fastest method.  (Of 
course your triggers might be the cause for the slowness and not insert 
speed).


Depending on the language you are using to insert records, you can 
prepare a query and only send the arguments vs sending the entire sql 
statement every time.


In pseudo-perl code I'd:
my $q = $db-prepare('insert into table(col1, vol2) values ($1, $2)');

$q-execute('one', 'two');
$q-execute('three', 'four');
$q-execute('five', 'six');

This is faster because the insert... is only sent over the wire and 
parsed once.  Then only the arguments are sent for each execute.


Speed wise, I think it'll go:
1) slowest: individual insert statements
2) prepared statements
3) fastest: COPY

Again.. assuming the triggers are not the bottleneck.

Have you run an insert by hand with 'EXPLAIN ANALYZE'?

-Andy




Have you read up on synchronous_commit?

Only a tiny bit. A couple people suggested disabling it since my
database is being hosted on AWS so I did that. It seems a bit risky but
perhaps worth it.



I would think they are running on battery backed IO, with boxes on UPS, 
so I'd guess its pretty safe.  It would also depend on your commit size. 
 If you are batching a million records into one commit, you might loose 
all of them.


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 11:26 AM, Steve Crawford 
scrawf...@pinpointresearch.com wrote:

 **
 You need to rethink things a bit. Databases can fail in all sorts of ways
 and can slow down during bursts of activity, data dumps, etc. You may need
 to investigate some form of intermediate buffering.

 Currently my buffer (such as it is) is 
 Kestrelhttp://robey.github.com/kestrel/ which
queues up INSERTs and then executes them one at a time. This keeps the rest
of the app from being held back, but it becomes a problem when the queue
fills up faster than it can drain. For one particularly heavy logger, I
tried writing it all to an unconstrained table with the idea that I would
copy that table (using INSERT . . .  SELECT . . .) into another table with
constraints, reducing the data in the process (deduping and such). Problem
was, even my constraint-less table wasn't fast enough. Perhaps buffering to
a local file and then using COPY would do the trick.

  ...Apparently the last four columns don't exist in my database. As for
 the first four, that is somewhat illuminating

 Then you are not running a current version of PostgreSQL so the first step
 to performance enhancement is to upgrade. (As a general rule - there are
 occasionally specific cases where performance decreases.)

 We're using 9.0.6. Peter, how do you feel about upgrading? :)

How are you batching them? Into a temp table that is copied to the main
 table? As a bunch of insert statements within a single connection (saves
 probably considerable time due to eliminating multiple connection setups)?
 With one PREPARE and multiple EXECUTE (to save repeated planning time - I'm
 not sure this will buy you much for simple inserts, though)? With COPY
 (much faster as many records are inserted in a single statement but if one
 fails, all fail)?

 The second one (a bunch of insert statements within a single connection).
As I mentioned above, I was going to try the temp table thing, but that
wasn't fast enough. COPY might be my next attempt.


 And what is the 50ms limit? Is that an average? Since you are batching, it
 doesn't sound like you need every statement to complete in 50ms. There is
 always a tradeoff between overall maximum throughput and maximum allowed
 latency.

 No, not average. I want to be able to do 100-200 INSERTs per second (90%
of those would go to one of two tables, the other 10% would go to any of a
couple dozen tables). If 1% of my INSERTs take 100 ms, then the other 99%
must take no more than 9 ms to complete.
...actually, it occurs to me that since I'm now committing batches of 1000,
a 100ms latency per commit wouldn't be bad at all! I'll have to look into
that (Either that or my batching isn't working like I thought it was.)


 I recommend you abandon this thread as it presupposes a now seemingly
 incorrect cause of the problem and start a new one titled something like
 Tuning for high insert rate where you describe the problem you want to
 solve. See http://wiki.postgresql.org/wiki/Guide_to_reporting_problemsfor a 
 good guide to the information that will be helpful in diagnosis.

 I'll leave the title as is since I think simply renaming this message
would cause more confusion than it would prevent. But this gives me
something to chew on and when I need to return to this topic, I'll do just
that.

Thanks,
-Alessandro


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Steve Crawford

On 02/23/2012 01:07 PM, Alessandro Gagliardi wrote:

The second one (a bunch of insert statements within a single 
connection). As I mentioned above, I was going to try the temp table 
thing, but that wasn't fast enough. COPY might be my next attempt.

insert into...;
insert into...;
insert into...;
... is really (ignoring statement preparation time):
begin;
insert into...;
commit;
begin;
insert into...;
commit;
begin;
insert into...;
commit;

It's possible that you might get a nice boost by wrapping the inserts 
into a transaction:

begin;
insert into...;
insert into...;
insert into...;
...
commit;

This only requires all that disk-intensive stuff that protects your data 
once at the end instead of 1000 times for you batch of 1000.


COPY is even better. I just ran a quick test by restoring a table on my 
desktop hacking db (untuned, few years old PC, single SATA disk, modest 
RAM and lots of resource competition). The 22+ million rows restored in 
282 seconds which is a rate somewhat north of 78,000 records/second or 
about 0.13ms/record.


You may want to eliminate that trigger, which only seems to exist to 
silence errors from uniqueness violations, and copy the incoming data 
into a temp table then move the data with a variant of:
INSERT INTO main_table (SELECT ... FROM incoming_table WHERE NOT EXISTS 
((SELECT 1 from main_table WHERE ...))


Cheers,
Steve

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 1:37 PM, Steve Crawford 
scrawf...@pinpointresearch.com wrote:

 It's possible that you might get a nice boost by wrapping the inserts into
 a transaction:
 begin;
 insert into...;
 insert into...;
 insert into...;
 ...
 commit;

 This only requires all that disk-intensive stuff that protects your data
 once at the end instead of 1000 times for you batch of 1000.

 I think that is essentially what I am doing. I'm using psycopg2 in a
python script that runs continuously on a queue. It opens a connection and
creates a cursor when it begins. It then passes that cursor into a function
along with the data (read off the queue) that needs to be inserted. I
run cur.execute(SAVEPOINT insert_savepoint;) followed by cur.execute(q)
(where q is the insert statement). If there's an error I
run cur.execute(ROLLBACK TO SAVEPOINT insert_savepoint;) otherwise I
increment a counter. Once the counter exceeds 999, I run conn.commit() and
reset the counter. I believe that psycopg2 is essentially doing what you
are suggesting. The fact that the data does not appear in the database
until conn.commit() tells me that it's not committing anything until then.


 COPY is even better. I just ran a quick test by restoring a table on my
 desktop hacking db (untuned, few years old PC, single SATA disk, modest RAM
 and lots of resource competition). The 22+ million rows restored in 282
 seconds which is a rate somewhat north of 78,000 records/second or about
 0.13ms/record.

 I'll try that. Of course, the fact that the database is stored in AWS
complicates matters. Regardless, it sounds like COPY should be considerably
faster.


Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 1:11 PM, Peter van Hardenberg p...@heroku.comwrote:

 My hunch is still that your issue is lock contention.

 How would I check that? I tried looking at pg_locks but I don't know what
to look for.


 We have many customers who do much more than this throughput, though
 I'm not sure what level of resourcing you're current at. You might
 consider experimenting with a larger system if you're having
 performance problems.

 Heh. I thought you might say that. :) It's definitely worth considering,
but as youmight expect, I want to exhaust other options first. For
customers who do much more (or even comparable) throughput, can you tell me
how big of a system they require?

Also, as per Andy's suggestion, I'd like to try
doubling checkpoint_segments. However, it appears that that is one of those
variables that I cannot change from pgAdmin. I don't suppose there's any
way to change this without rebooting the database?


Re: [PERFORM] Very long deletion time on a 200 GB database

2012-02-23 Thread Reuven M. Lerner

Hi, everyone.

So it turns out that we're not using 25 GB of virtual memory.  (That's 
what I had been shown yesterday, and it was a bit surprising, to say the 
least...)


A few statistics that I managed to get from the Windows 
developers/system administrators:


- The machine has a total of 3.5 GB of RAM
- shared_buffers was set to 256 MB (yes, MB!)
- Virtual memory usage by our process is 3 MB (yes, MB)
- CPU is virtually idle when running the deletes, using about 1% of CPU
- No other processes are accessing the database when we're running the 
maintenance; there are a total of three server processes, but two are idle.


(I was a bit surprised, to say the least, by the low number on 
shared_buffers, given that I believe it's one of the first things I told 
them to increase about 18 months ago.)


As for Tom's point about rules, I know that rules are bad, and I'm not 
sure why the system is using a rule rather than a trigger.I'll see 
if I can change that to a trigger, but I have very indirect control over 
the machines, and every change requires (believe it or not) writing a 
.NET program that runs my changes, rather than just a textual script 
that deploys them.


The only foreign keys are from the B table (i.e., the table whose 
records I want to remove) to other tables.  There are no REFERENCES 
pointing to the B table.  That said, I hadn't realized that primary keys 
and indexes can also delay the DELETE.


For the latest round of testing, I quadrupled shared_buffers to 1 GB, 
turned off hash joins (as suggested by someone), and also simplified the 
query (based on everyone's suggestions).  In the tests on my own 
computer (with a somewhat random 1 GB snapshot of the 200 GB database), 
the simplified query was indeed much faster, so I'm optimistic.


Several people suggested that chunking the deletes might indeed help, 
which makes me feel a bit better.  Unfortunately, given the time that it 
takes to run the queries, it's hard to figure out the right chunk size. 
 Whoever suggested doing it in time slices had an interesting idea, but 
I'm not sure if it'll be implementable given our constraints.


Thanks again to everyone for your help.  I'll let you know what happens...

Reuven

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance