Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Marc Morin
Yes, that is our application.   We have implemented both scenarios...

1- partitions loaded without indexes on them.. And build index when
partition is full.  Slow to drill down into incomplete partitions.
2- paritions with index as loaded.  Slow, on insert (problem mentioned)
but good to drill down

So, I'd like my cake and eat it too... :-)

I'd like to have my indexes built as rows are inserted into the
partition so help with the drill down...

 -Original Message-
 From: Bucky Jordan [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, September 26, 2006 5:26 PM
 To: Marc Morin; Tom Lane
 Cc: Markus Schaber; pgsql-performance@postgresql.org
 Subject: RE: [PERFORM] Decreasing BLKSZ 
 
   The bottom line here is likely to be you need more RAM :-(
  
  Yup.  Just trying to get a handle on what I can do if I 
 need more than 
  16G Of ram... That's as much as I can put on the installed based of 
  servers 100s of them.
  
  
   I wonder whether there is a way to use table partitioning to make 
   the insert pattern more localized?  We'd need to know a lot more 
   about your insertion patterns to guess how, though.
  
 regards, tom lane
  
  We're doing partitioning as well.
  
 I'm guessing that you basically have a data collection 
 application that sends in lots of records, and a reporting 
 application that wants summaries of the data? So, if I 
 understand the problem correctly, you don't have enough ram 
 (or may not in the future) to index the data as it comes in. 
 
 Not sure how much you can change the design, but what about 
 either updating a summary table(s) as the records come in 
 (trigger, part of the transaction, or do it in the 
 application) or, index periodically? In otherwords, load a 
 partition (say a day's worth) then index that partition all 
 at once. If you're doing real-time analysis that might not 
 work so well though, but the summary tables should. 
 
 I assume the application generates unique records on its own 
 due to the timestamp, so this isn't really about checking for 
 constraint violations? If so, you can probably do away with 
 the index on the tables that you're running the inserts on.
 
 - Bucky
 

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Decreasing BLKSZ

2006-09-25 Thread Marc Morin



Our application has 
a number of inserters posting rows of network statistics into a database. 
This is occuring continously. The following is an example of a stats table 
(simplified but maintains key concepts).


CREATE TABLE stats 

(
 logtime timestamptz,
 key 
int,
 stat1 
bigint,
 stat2 
bigint,
 stat3 
bigint,
 PRIMARY KEY 
(key,logtime)
);
CREATE INDEX x ON 
stats(logtime);

There are on the 
order of 1M unique values for "key" and a new row for each key value will be 
inserted say every 15 minutes. These rows are divided up between a number 
of different inserting elements, but that isn't relevant.

The problem is, the 
insert pattern has low correlation with the (key,logtime) index. In 
this case, would need 1M blocks in my shared_buffer space to prevent a 
read-modify-write type of pattern happening during the inserts (given a large 
enough database).

Wondering about 
lowering the BLKSZ value so that the total working set of blocks required can be 
maintained in my shared buffers. Our database only has 8G of memory and 
likely need to reduce BLKSZ to 512

Any comment on other 
affects or gotchas with lowering the size of BLKSZ? Currently, our 
database is thrashing its cache of blocks we we're getting only ~100 
inserts/second, every insert results in a evict-read-modify 
operation.


Ideally, like to 
keep the entire working set of blocks in memory across insert periods so that 
the i/o looks more like write full blocks

Thanks
Marc




Re: [PERFORM] Decreasing BLKSZ

2006-09-25 Thread Marc Morin
 Would it be possible to change the primary key to 
 (logtime,key)? This could help keeping the working window small.

No, the application accessing the data wants all the rows between start
and end time for a particular key value.  

 
 Secondly, the real working set is smaller, as the rows are 
 all inserted at the end of the table, filling each page until 
 it's full, so only the last pages are accessed. There's no 
 relation between the index order, and the order of data on 
 disk, unless you CLUSTER.

I'd theorizing that my problem is in updating the index itself and not
the heap.  Insert order
Refers to the order by which the applications are inserting the rows and
as such, the order by
Which the index is being updated.  This in turn, is causing the b-tree
to be traverse.  Problem
Is the working set of blocks at the bottom of the btree is too big for
my cache.

 
  Any comment on other affects or gotchas with lowering the size of 
  BLKSZ?  Currently, our database is thrashing its cache of blocks we 
  we're getting only ~100 inserts/second, every insert results in a 
  evict-read-modify operation.
 
 I'm not shure that's the correct diagnosis.
 
 Do you have one transaction per insert? Every transaction 
 means a forced sync to the disk, so you won't get more than 
 about 100-200 commits per second, depending on your actual 
 disk rotation speed.

No, an insert consists of roughly 10,000+ rows per transaction block.  

 
 To improve concurrency of the numer of inserters running in 
 parallel, try to tweak the config variables commit_delay and 
 commit_sibling, so you get a higher overall throughput at 
 cost of an increased delay per connection, and increase the 
 number of inserters. Using sensible tweaking, the throughput 
 should scale nearly linear with the number of backens. :-)

I don't think this will help us here due to large transactions already.

 
 If feasible for your application, you can also bundle several 
 log entries into a single transaction. If you're CPU bound, 
 you can use COPY instead of INSERT or (if you can wait for 
 8.2) the new multi-row INSERT to further improve performance, 
 but I doubt that you're CPU bound.

 
 The only way to really get over the sync limit is to have 
 (at least) the WAL on a battery backed ram / SSD media that 
 has no spinning disk
 physical limit, or abandon crash safety by turning fsync off.

Again, problem is not with WAL writing, already on it's own raid1 disk
pair.  The 
I/O pattern we see is about 1-2% load on WAL and 100% load on the array
holding the indexes and tables. Throughput is very low, something like
150k-200K bytes/second of real  rows being deposited on the disk.

The disks are busy seeking all over the disk platter to fetch a block,
add a single row, then seek to another spot and write back a previously
dirty buffer

 
 Thanks,
 Markus.
 --
 Markus Schaber | Logical TrackingTracing International AG
 Dipl. Inf. | Software Development GIS
 
 Fight against software patents in Europe! www.ffii.org 
 www.nosoftwarepatents.org
 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Decreasing BLKSZ

2006-09-25 Thread Marc Morin
 
 
 The bottom line here is likely to be you need more RAM :-(

Yup.  Just trying to get a handle on what I can do if I need more than
16G
Of ram... That's as much as I can put on the installed based of
servers 100s of them.

 
 I wonder whether there is a way to use table partitioning to 
 make the insert pattern more localized?  We'd need to know a 
 lot more about your insertion patterns to guess how, though.
 
   regards, tom lane

We're doing partitioning as well.
 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Query runs too long for indexed tables

2006-04-04 Thread Marc Morin
Wondering if

Update firma1.rid set toode=null where toode is not null and not
exists(select 1 from firma1.toode where toode=rid.toode); 

Would be faster... Problem appears to be the seqscan of seqscan... No?

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Scott Marlowe
 Sent: Tuesday, April 04, 2006 3:49 PM
 To: Andrus
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Query runs too long for indexed tables
 
 On Tue, 2006-04-04 at 14:37, Andrus wrote:
  I have relatively small tables (toode and rid) in fast server.
  Both tables are indexed on toode field.
  
  Following query takes long time to run.
  toode field type is char(20). It is difficult to change 
 this field type.
  
  Any idea how to speed up this query ?
  
  UPDATE firma1.rid SET toode=NULL
 WHERE toode IS NOT NULL AND
 toode NOT IN (SELECT TOODE  FROM firma1.TOODE);
  
  Query returned successfully: 0 rows affected, 594813 ms 
 execution time.
  
  explain window shows:
  
  Seq Scan on rid  (cost=2581.07..20862553.77 rows=51848 width=1207)
Filter: ((toode IS NOT NULL) AND (NOT (subplan)))
SubPlan
  -  Materialize  (cost=2581.07..2944.41 rows=14734 width=84)
-  Seq Scan on toode  (cost=0.00..2350.34 rows=14734 
  width=84)
 
 Let me guess, you've updated it a lot and aren't familiar with Vacuum?
 
 run a vacuum full on your database.  schedule a vacuum (plain 
 one) to run every so often (hours or days are a good interval 
 for most folks)
 
 If that's NOT your problem, then please, let us know.  
 
 ---(end of 
 broadcast)---
 TIP 2: Don't 'kill -9' the postmaster
 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Query runs too long for indexed tables

2006-04-04 Thread Marc Morin
Explain analyze would be nice ;-) 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Andrus
 Sent: Tuesday, April 04, 2006 3:37 PM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] Query runs too long for indexed tables
 
 I have relatively small tables (toode and rid) in fast server.
 Both tables are indexed on toode field.
 
 Following query takes long time to run.
 toode field type is char(20). It is difficult to change this 
 field type.
 
 Any idea how to speed up this query ?
 
 UPDATE firma1.rid SET toode=NULL
WHERE toode IS NOT NULL AND
toode NOT IN (SELECT TOODE  FROM firma1.TOODE);
 
 Query returned successfully: 0 rows affected, 594813 ms 
 execution time.
 
 explain window shows:
 
 Seq Scan on rid  (cost=2581.07..20862553.77 rows=51848 width=1207)
   Filter: ((toode IS NOT NULL) AND (NOT (subplan)))
   SubPlan
 -  Materialize  (cost=2581.07..2944.41 rows=14734 width=84)
   -  Seq Scan on toode  (cost=0.00..2350.34 
 rows=14734 width=84)
 
 
 Andrus. 
 
 
 
 ---(end of 
 broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Trouble managing planner for timestamptz columns

2006-03-10 Thread Marc Morin
 
We have large tables that hold statistics based on time. They are of the
form.

CREATE TABLE stats (
id serial primary key,
logtime timestamptz,
d1 int,
s1 bigint
);

CREATE INDEX idx on stats(logtime);

Some of these tables have new data inserted at a rate of 500,000+ rows /
hour.  The entire table will grow to being 10's to 100's of millions of
rows in size.  (Yes, we are also paritioning these, it's the size of an
individual partition that we're talking about).

We tend to analyze these tables every day or so and this doesn't always
prove to be sufficient

Our application is a reporting application and the end users typically
like to query the newest data the most. As such,  the queries of the
form...


select 
 *
from stats
inner join dimension_d1 using (d1)
where logtime between X and Y and d1.something = value; 

This usually results in a hash join (good thing) where the dimension
table is loaded into the hash table and it index scans stats using idx
index.

The trouble starts when both X and Y are times after the last analyze.
This restriction clause is outside the range of values in the historgram
created by the last analyze.  Postgres's estimate on the number of rows
returned here is usually very low and incorrect, as you'd expect...  

Trouble can occur when the planner will flip its decision and decide
to hash join by loading the results of the index scan on idx into the
hash table instead of the dimension table  

Since the table is so large and the system is busy (disk not idle at
all), doing an analyze on this table in the production system can take
1/2 hour!  (statistics collector set to 100).  We can't afford to
analyze more often...

It certainly would be nice if postgres could understand somehow that
some columns are dynamic and that it's histogram could be stretched to
the maximal values or some other technique for estimating rows to the
right of the range of values in the histogram...

Or have some concept of error bars on it's planner decisions

Suggestions? Comments?


Marc

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Trouble managing planner for timestamptz columns

2006-03-10 Thread Marc Morin
Well this analyze just took 12 minutes...  Stats target of 100.

# time psql xxx xxx -c analyze elem_trafficstats_1
ANALYZE

real12m1.070s
user0m0.001s
sys 0m0.015s 


A large table, but by far, not the largest...  Have about 1 dozen or so
tables like this, so analyzing them will take 3-4 hours of time...  No
weird datatypes, just bigints for facts, timestamptz and ints for
dimensions.

My problem is not the analyze itself, it's the fact that our db is
really busy doing stuff Analyze I/O is competing... I am random I/O
bound like crazy.

If I set the stats target to 10, I get

# time psql  xxx -c set session default_statistics_target to
10;analyze elem_trafficstats_1
ANALYZE

real2m15.733s
user0m0.009s
sys 0m2.255s 

Better, but not sure what side affect this would have.

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: Friday, March 10, 2006 1:31 PM
 To: Marc Morin
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] Trouble managing planner for 
 timestamptz columns 
 
 Marc Morin [EMAIL PROTECTED] writes:
  We tend to analyze these tables every day or so and this doesn't 
  always prove to be sufficient
 
 Seems to me you just stated your problem.  Instead of having 
 the planner make wild extrapolations, why not set up a cron 
 job to analyze these tables more often?  Or use autovacuum 
 which will do it for you.
 
  Since the table is so large and the system is busy (disk 
 not idle at 
  all), doing an analyze on this table in the production 
 system can take
  1/2 hour!  (statistics collector set to 100).
 
 I'd believe that for vacuum analyze, but analyze alone should 
 be cheap.
 Have you perhaps got some weird datatypes in the table?  
 Maybe you should back off the stats target a bit?
 
 We do support analyzing selected columns, so you might try 
 something like a cron job analyzing only the timestamp 
 column, with a suitably low stats target for that column.  
 This would yield numbers far more reliable than any 
 extrapolation the planner could do.
 
   regards, tom lane
 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] 10+hrs vs 15min because of just one index

2006-02-12 Thread Marc Morin
We've done a lot of testing on large DB's with a lot of inserts and
have a few comments.

The updates are treated as a large insert as we all know from pg's
point of view.

We've run into 2 classes of problems: excessing WAL checkpoints and
affects of low correlation.

WAL log write's full 8K block for first modification, then only changes.
This can be the source of undesireable behaviour during large batch
inserts like this.  

From your config, a check point will be forced when

(checkpoint_segments * 16 M)  rows * (8K/N*h + (1-h)*8K) * B

Where h is the hitrate or correlation between the update scan and the
index. Do you have a sense of what this is?  In the limits, we have 100%
correlation or 0% correlation.  N is the lower cost of putting the
change in the WAL entry, not sure what this is, but small, I am
assuming, say N=100.  B is the average number of blocks changed per
updated row (assume B=1.1 for your case, heap,serial index have very
high correlation)

In the 0% correlation case, each updated row will cause the index update
to read/modify the block. The modified block will be entirely written to
the WAL log.  After (30 * 16M) / (8K) / 1.1 ~ 55k rows, a checkpoint
will be forced and all modified blocks in shared buffers will be written
out.

Increasing checkpoint_segments to 300 and seeing if that makes a
difference. If so, the excessive WAL checkpoints are your issue. If
performance is exactly the same, then I would assume that you have close
to 0% correlation between the rows in the heap and index.

Can you increase shared_buffers? With a low correlation index, the only
solution is to hold the working set of blocks in memory.  Also, make
sure that the checkpoint segments are big enough for you to modify them
in place, don't want checkpoints occurring

Note that the more updates you do, the larger the tables/index become
and the worse the problem becomes.  Vacuuming the table is an answer
but unfortunately, it tends to decrease correlation from our
observations. :-(

From our observations, dropping index and rebuilding them is not always
practical, depends on your application; table will be exclusively locked
during the transaction due to drop index. 

I haven't looked at pg's code for creating an index, but seriously
suspect it's doing an extern sort then insert into the index.  Such
operations would have 100% correlation from the index insert point of
view and the sort could be in memory or the tape variety (more
efficient i/o pattern).

Summary, # of indexes, index correlation, pg's multi versioning,
shared_buffers and checkpoint_segments are interconnected in weird and
wonderful ways... Seldom have found simple solutions to performance
problems.

Marc


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Aaron Turner
 Sent: Friday, February 10, 2006 3:17 AM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] 10+hrs vs 15min because of just one index
 
 So I'm trying to figure out how to optimize my PG install 
 (8.0.3) to get better performance without dropping one of my indexes.
 
 Basically, I have a table of 5M records with 3 columns:
 
 pri_key (SERIAL)
 data char(48)
 groupid integer
 
 there is an additional unique index on the data column.
 
 The problem is that when I update the groupid column for all 
 the records, the query takes over 10hrs (after that I just 
 canceled the update).  Looking at iostat, top, vmstat shows 
 I'm horribly disk IO bound (for data not WAL, CPU 85-90% 
 iowait) and not swapping.
 
 Dropping the unique index on data (which isn't used in the 
 query), running the update and recreating the index  runs in 
 under 15 min. 
 Hence it's pretty clear to me that the index is the problem 
 and there's really nothing worth optimizing in my query.
 
 As I understand from #postgresql, doing an UPDATE on one 
 column causes all indexes for the effected row to have to be 
 updated due to the way PG replaces the old row with a new one 
 for updates.  This seems to explain why dropping the unique 
 index on data solves the performance problem.
 
 interesting settings:
 shared_buffers = 32768
 maintenance_work_mem = 262144
 fsync = true
 wal_sync_method = open_sync
 wal_buffers = 512
 checkpoint_segments = 30
 effective_cache_size = 1
 work_mem = default (1024 i think?)
 
 box:
 Linux 2.6.9-11EL (CentOS 4.1)
 2x Xeon 3.4 HT
 2GB of RAM (but Apache and other services are running)
 4 disk raid 10 (74G Raptor) for data
 4 disk raid 10 (7200rpm) for WAL
 
 other then throwing more spindles at the problem, any suggestions?
 
 Thanks,
 Aaron
 
 --
 Aaron Turner
 http://synfin.net/
 
 ---(end of 
 broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] partitioning and locking problems

2006-02-02 Thread Marc Morin
Using a separate lock table is what we've decided to do in this
particular case to serialize #1 and #3.  Inserters don't take this lock
and as such will not be stalled. 

 -Original Message-
 From: Markus Schaber [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, February 02, 2006 7:44 AM
 To: Marc Morin
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] partitioning and locking problems
 
 Hi, Marc,
 
 Marc Morin wrote:
 
  1- long running report is running on view
  2- continuous inserters into view into a table via a rule
  3- truncate or rule change occurs, taking an exclusive lock.
  Must wait for #1 to finish.
  4- new reports and inserters must now wait for #3.
  5- now everyone is waiting for a single query in #1.   Results
  in loss of insert data granularity (important for our application).
 
 Apart from having two separate views (one for report, one for 
 insert) as Richard suggested:
 
 If you have fixed times for #3, don't start any #1 that won't 
 finish before it's time for #3.
 
 You could also use the LOCK command on an empty lock table at 
 the beginning of each #1 or #3 transaction to prevent #3 from 
 getting the view lock before #1 is finished.
 
 
 HTH,
 Markus
 
 --
 Markus Schaber | Logical TrackingTracing International AG
 Dipl. Inf. | Software Development GIS
 
 Fight against software patents in EU! www.ffii.org 
 www.nosoftwarepatents.org
 
 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] partitioning and locking problems

2006-02-01 Thread Marc Morin
Tom,

Do you mean it would be impossible to change the code so that existing
selects continue to use the pre-truncated table until they commit? Or
just require a more extensive change?

The update/insert rule change appears to be more more doable? No? 

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, February 01, 2006 12:50 AM
 To: Marc Morin
 Cc: pgsql-performance@postgresql.org
 Subject: Re: [PERFORM] partitioning and locking problems 
 
 Marc Morin [EMAIL PROTECTED] writes:
  Would like to understand the implications of changing postgres'
  code/locking for rule changes and truncate to not require 
 locking out 
  select statements?
 
 It won't work...
 
   regards, tom lane
 
 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] partitioning and locking problems

2006-01-31 Thread Marc Morin
We have a large database system designed around partitioning.  Our
application is characterized with
 
- terabytes of data
- billions of rows in dozens of base tables (and 100s of paritions)
- 24x7 insert load of new data that cannot be stopped, data is time
sensitive.
- periodic reports that can have long running queries with query times
measured in hours
 
We have 2 classes of maintenance activities that are causing us
problems:
- periodically we need to change an insert rule on a view to point to a
different partition.
- periodically we need to delete data that is no longer needed.
Performed via truncate.
 
Under both these circumstances (truncate and create / replace rule) the
locking behaviour of these commands can cause locking problems for us.
The scenario is best illustrated as a series of steps:
 

1- long running report is running on view
2- continuous inserters into view into a table via a rule
3- truncate or rule change occurs, taking an exclusive lock.
Must wait for #1 to finish.
4- new reports and inserters must now wait for #3.
5- now everyone is waiting for a single query in #1.   Results
in loss of insert data granularity (important for our application).

 
Would like to understand the implications of changing postgres'
code/locking for rule changes and truncate to not require locking out
select statements?  
 
The following is a simplified schema to help illustrate the problem.
 

create table a_1
(
pkey int primary key
);
create table a_2
(
pkey int primary key
);
 
create view a as select * from a_1 union all select * from a_2;
 
create function change_rule(int) returns void as
'
begin
execute ''create or replace rule insert as on insert to a do
instead insert into a_''||$1||''(pkey) values(NEW.pkey)'';
end;
' language plpgsql;
 
-- change rule, execute something like the following
periodically
select change_rule(1);

 
We've looked at the code and the rule changes appear easy but we are
concerned about the required changes for truncate.
 
Thanks
Marc


---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] sort/limit across union all

2005-11-18 Thread Marc Morin



We have a large DB 
with partitioned tables in postgres. We have had trouble with a 
ORDER/LIMIT type query. The order and limit are not pushed down to the 
sub-tables

CREATE TABLE base 
(
 foo int 
);

CREATE TABLE 
bar_0
 extra int
) INHERITS 
(base);
ALTER TABLE bar ADD 
PRIMARY KEY (foo);

-- repeated for 
bar_0... bar_40

SELECT foo FROM base 
ORDER BY foo LIMIT 10;

is real slow. What 
is required to make the query planner generate the following instead... (code 
change i know, but how hard would it be?)

SELECT
 foo
FROM
(
 
SELECT
 
*
 FROM bar_0
 ORDER BY foo LIMIT 
10
UNION ALL
 SELECT
 
*
 FROM bar_1
 ORDER BY foo LIMIT 
10

) AS base
ORDER BY foo
LIMIT 10;




Re: [PERFORM] Sort performance on large tables

2005-11-08 Thread Marc Morin
I have run into this type of query problem as well.  I solved it in my
application by the following type of query.

SELECT tlid
FROM completechain AS o
WHERE not exists ( 
SELECT 1
FROM completechain
WHERE tlid=o.tlid and ogc_fid!=o.ogc_fid
);

Assumes of course that you have an index on tlid.

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Charlie Savage
 Sent: Tuesday, November 08, 2005 2:05 AM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] Sort performance on large tables
 
 Hi everyone,
 
 I have a question about the performance of sort.
 
 Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 
 RC 1 with PostGIS, 1 built-in 80 GB IDE drive, 1 SATA Seagate 
 400GB drive.  The IDE drive has the OS and the WAL files, the 
 SATA drive the database. 
  From hdparm the max IO for the IDE drive is about 50Mb/s and 
 the SATA drive is about 65Mb/s.  Thus a very low-end machine 
 - but it used just for development (i.e., it is not a 
 production machine) and the only thing it does is run a 
 PostgresSQL database.
 
 I have a staging table called completechain that holds US 
 tiger data (i.e., streets and addresses for the US).  The 
 table is approximately 18GB.  Its big because there is a lot 
 of data, but also because the table is not normalized (it 
 comes that way).
 
 I want to extract data out of the file, with the most 
 important values being stored in a column called tlid.  The 
 tlid field is an integer, and the values are 98% unique.  
 There is a second column called ogc_fid which is unique (it 
 is a serial field).  I need to extract out unique TLID's 
 (doesn't matter which duplicate I get rid of).  To do this I 
 am running this query:
 
 SELECT tlid, min(ogc_fid)
 FROM completechain
 GROUP BY tlid;
 
 The results from explain analyze are:
 
 GroupAggregate  (cost=10400373.80..11361807.88 rows=48071704 
 width=8) (actual time=7311682.715..8315746.835 rows=47599910 loops=1)
   -  Sort  (cost=10400373.80..10520553.06 rows=48071704 
 width=8) (actual time=7311682.682..7972304.777 rows=48199165 loops=1)
 Sort Key: tlid
 -  Seq Scan on completechain  (cost=0.00..2228584.04 
 rows=48071704 width=8) (actual time=27.514..773245.046 
 rows=48199165 loops=1)
 Total runtime: 8486057.185 ms
   
 Doing a similar query produces the same results:
 
 SELECT DISTINCT ON (tlid), tlid, ogc_fid FROM completechain;
 
 Note it takes over 10 times longer to do the sort than the 
 full sequential scan.
 
 Should I expect results like this?  I realize that the 
 computer is quite low-end and is very IO bound for this 
 query, but I'm still surprised that the sort operation takes so long.
 
 Out of curiosity, I setup an Oracle database on the same 
 machine with the same data and ran the same query.  Oracle 
 was over an order of magnitude faster. Looking at its query 
 plan, it avoided the sort by using HASH GROUP BY.  Does 
 such a construct exist in PostgreSQL (I see only hash joins)?
 
 Also as an experiment I forced oracle to do a sort by running 
 this query:
 
 SELECT tlid, min(ogc_fid)
 FROM completechain
 GROUP BY tlid
 ORDER BY tlid;
 
 Even with this, it was more than a magnitude faster than Postgresql. 
 Which makes me think I have somehow misconfigured postgresql 
 (see the relevant parts of postgresql.conf below).
 
 Any idea/help appreciated.
 
 Thanks,
 
 Charlie
 
 
 ---
 
 #-
 --
 # RESOURCE USAGE (except WAL)
 #-
 --
 
 shared_buffers = 4  # 4 buffers * 8192 
 bytes/buffer = 327,680,000 bytes
 #shared_buffers = 1000# min 16 or 
 max_connections*2, 8KB each
 
 temp_buffers = 5000
 #temp_buffers = 1000  # min 100, 8KB each
 #max_prepared_transactions = 5# can be 0 or more
 # note: increasing max_prepared_transactions costs ~600 bytes 
 of shared memory # per transaction slot, plus lock space (see 
 max_locks_per_transaction).
 
 work_mem =  16384# in Kb
 #work_mem = 1024  # min 64, size in KB
 
 maintenance_work_mem = 262144# in kb
 #maintenance_work_mem = 16384 # min 1024, size in KB
 #max_stack_depth = 2048   # min 100, size in KB
 
 # - Free Space Map -
 
 max_fsm_pages = 6 
 #max_fsm_pages = 2# min 
 max_fsm_relations*16, 6 bytes each
 
 #max_fsm_relations = 1000 # min 100, ~70 bytes each
 
 # - Kernel Resource Usage -
 
 #max_files_per_process = 1000 # min 25
 #preload_libraries = ''
 
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0# 0-1000 milliseconds
 #vacuum_cost_page_hit = 1 # 0-1 credits
 #vacuum_cost_page_miss = 10   # 0-1 credits