Re: [PERFORM] performance while importing a very large data set in to database

2009-12-06 Thread Kris Kewley
Does postgres have the concept of pinning procs, functions, etc to  
cache.


As you mention, typically frequently used statements are cached  
improving performance.


If waiting for the DBMS to do this is not an option then pinning  
critical ones should improve performance immediately following start up.


This is an approach I have used with oracle to address this situation.

Kris

On 5-Dec-09, at 15:42, Scott Marlowe scott.marl...@gmail.com wrote:


On Wed, Dec 2, 2009 at 4:31 PM, Ashish Kumar Singh
ashishkumar.si...@altair.com wrote:

Hello Everyone,

I have a very bit big database around 15 million in size, and the  
dump file

is around 12 GB.

While importing this dump in to database I have noticed that  
initially query

response time is very slow but it does improves with time.

Any suggestions to improve performance after dump in imported in to  
database

will be highly appreciated!


This is pretty normal.  When the db first starts up or right after a
load it has nothing in its buffers or the kernel cache.  As you access
more and more data the db and OS learned what is most commonly
accessed and start holding onto those data and throw the less used
stuff away to make room for it.  Our production dbs run at a load
factor of about 4 to 6, but when first started and put in the loop
they'll hit 25 or 30 and have slow queries for a minute or so.

Having a fast IO subsystem will help offset some of this, and
sometimes select * from bigtable might too.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org 
)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] DELETE performance problem

2009-11-24 Thread Kris Kewley
 Even though the column in question is not unique on t2 could you not  
index it? That should improve the performance of the inline query.


Are dates applicable in any way? In some cases adding a date field,  
partitioning or indexing on that and adding where datex days. That  
can be an effective way to limit records searched.


Kris

On 24-Nov-09, at 9:59, Jerry Champlin jchamp...@absolute-performance.com 
 wrote:


You may want to consider using partitioning.  That way you can drop  
the

appropriate partition and never have the overhead of a delete.

Jerry Champlin|Absolute Performance Inc.|Mobile:  303-588-2547

-Original Message-
From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Luca  
Tettamanti

Sent: Tuesday, November 24, 2009 6:37 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] DELETE performance problem

Hello,
I've run in a severe performance problem with the following statement:

DELETE FROM t1 WHERE t1.annotation_id IN (
   SELECT t2.annotation_id FROM t2)

t1 contains about 48M record (table size is 5.8GB), while t2  
contains about

60M
record (total size 8.6GB). annotation_id is the PK in t1 but not in  
t2 (it's
not even unique, in fact there are duplicates - there are about 20M  
distinct

annotation_id in this table). There are no FKs on either tables.
I've killed the query after 14h(!) of runtime...

I've reproduced the problem using a only the ids (extracted from the  
full

tables) with the following schemas:

test2=# \d t1
Table public.t1
   Column |  Type  | Modifiers
---++---
annotation_id | bigint | not null
Indexes:
   t1_pkey PRIMARY KEY, btree (annotation_id)

test2=# \d t2
Table public.t2
   Column |  Type  | Modifiers
---++---
annotation_id | bigint |
Indexes:
   t2_idx btree (annotation_id)

The query above takes about 30 minutes to complete. The slowdown is  
not as
severe, but (IMHO) the behaviour is strange. On a win2k8 with 8.3.8  
using
procexp I see the process churning the disk and using more memory  
until it

hits
some limit (at about 1.8GB) then the IO slows down considerably. See  
this

screenshot[1].
This is exactly what happens with the full dataset.

This is the output of the explain:

test2= explain analyze delete from t1 where annotation_id in (select
annotation
_id from t2);
  QUERY  
PLAN


--- 
--- 
--


-
Hash Join  (cost=1035767.26..2158065.55 rows=181605 width=6) (actual
time=64339
5.565..1832056.588 rows=26185953 loops=1)
  Hash Cond: (t1.annotation_id = t2.annotation_id)
  -  Seq Scan on t1  (cost=0.00..661734.12 rows=45874812 width=14)  
(actual

tim
e=0.291..179119.487 rows=45874812 loops=1)
  -  Hash  (cost=1033497.20..1033497.20 rows=181605 width=8) (actual
time=6433
93.742..643393.742 rows=26185953 loops=1)
-  HashAggregate  (cost=1031681.15..1033497.20 rows=181605
width=8) (a
ctual time=571807.575..610178.552 rows=26185953 loops=1)
  -  Seq Scan on t2  (cost=0.00..879289.12 rows=60956812
width=8)
(actual time=2460.595..480446.581 rows=60956812 loops=1)
Total runtime: 2271122.474 ms
(7 rows)

Time: 2274723,284 ms


An identital linux machine (with 8.4.1) shows the same issue; with  
strace I

see
a lots of seeks:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
90.370.155484  15 10601   read
 9.100.0156495216 3   fadvise64
 0.390.000668   0  5499   write
 0.150.000253   0 10733   lseek
 0.000.00   0 3   open
 0.000.00   0 3   close
 0.000.00   0 3   semop
-- --- --- - - 
100.000.172054 26845   total

(30s sample)

Before hitting the memory limit (AS on win2k8, unsure about Linux)  
the

trace
is the following:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
100.000.063862   0321597   read
 0.000.00   0 3   lseek
 0.000.00   076   mmap
-- --- --- - - 
100.000.063862321676   total


The machines have 8 cores (2 Xeon E5320), 8GB of RAM. Postgres data
directory
is on hardware (Dell PERC5) raid mirror, with the log on a separate  
array.
One machine is running linux 64bit (Debian/stable), the other win2k8  
(32

bit).

shared_buffers = 512MB
work_mem = 512MB
maintenance_work_mem = 1GB