Re: [PERFORM] Selects query stats?

2006-05-24 Thread Robin Ericsson

On 5/23/06, Dan Gorman [EMAIL PROTECTED] wrote:


In any other DB (oracle, mysql) I know how many queries (selects) per second
the database is executing. How do I get this
number out of postgres?


Mysql does AFAIR only count the number of queries and then uses the
seconds since startup to estimate the number of queries per second.
If your server is hammered with queries 1 hour a day it's not giving
you a fair result.


--
   regards,
   Robin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke
I have a system that currently inserts ~ 250 million rows per day (I  
have about 10k more raw data than that, but I'm at the limit of my  
ability to get useful insert performance out of postgres).


Things I've already done that have made a big difference:
- modified postgresql.conf shared_buffers value
- converted to COPY from individual insert statements
- changed BLCKSZ to 32768

I currently get ~35k/sec inserts on a table with one index (~70k/sec  
inserts if I don't have any indexes).


The indexed field is basically a time_t (seconds since the epoch),  
autovacuum is running (or postgres would stop choosing to use the  
index). The other fields have relatively lower cardinality.


Each days worth of data gets inserted into its own table so that I  
can expire the data without too much effort (since drop table is much  
faster than running a delete and then vacuum).


I would really like to be able to have 1 (or 2) more indexes on the  
table since it takes a while for a sequential scan of 250million rows  
to complete, but CPU time goes way up.


In fact, it looks like I'm not currently IO bound, but CPU-bound. I  
think some sort of lazy-index generation (especially if it could be  
parallelized to use the other processors/cores that currently sit  
mostly idle) would be a solution. Is anyone working on something like  
this? Any other ideas? Where should I look if I want to start to  
think about creating a new index that would work this way (or am I  
just crazy)?


Thanks for any insight!

--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Dave Dutcher

If you can live with possible database corruption, you could try turning
Fsync off.  For example if you could just reinsert the data on the off
chance a hardware failure corrupts the database, you might get a decent
improvement.

Also have you tried creating the index after you have inserted all your
data?  (Or maybe copy already disables the indexes while inserting?)



 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Daniel J. Luke
 Sent: Wednesday, May 24, 2006 2:45 PM
 To: pgsql-performance@postgresql.org
 Subject: [PERFORM] Getting even more insert performance 
 (250m+rows/day)
 
 
 I have a system that currently inserts ~ 250 million rows per day (I  
 have about 10k more raw data than that, but I'm at the limit of my  
 ability to get useful insert performance out of postgres).
 
 Things I've already done that have made a big difference:
 - modified postgresql.conf shared_buffers value
 - converted to COPY from individual insert statements
 - changed BLCKSZ to 32768
 
 I currently get ~35k/sec inserts on a table with one index (~70k/sec  
 inserts if I don't have any indexes).
 
 The indexed field is basically a time_t (seconds since the epoch),  
 autovacuum is running (or postgres would stop choosing to use the  
 index). The other fields have relatively lower cardinality.
 
 Each days worth of data gets inserted into its own table so that I  
 can expire the data without too much effort (since drop table 
 is much  
 faster than running a delete and then vacuum).
 
 I would really like to be able to have 1 (or 2) more indexes on the  
 table since it takes a while for a sequential scan of 
 250million rows  
 to complete, but CPU time goes way up.
 
 In fact, it looks like I'm not currently IO bound, but CPU-bound. I  
 think some sort of lazy-index generation (especially if it could be  
 parallelized to use the other processors/cores that currently sit  
 mostly idle) would be a solution. Is anyone working on 
 something like  
 this? Any other ideas? Where should I look if I want to start to  
 think about creating a new index that would work this way (or am I  
 just crazy)?
 
 Thanks for any insight!
 
 --
 Daniel J. Luke
 ++
 | * [EMAIL PROTECTED] * |
 | *-- http://www.geeklair.net -* |
 ++
 |   Opinions expressed are mine and do not necessarily   |
 |  reflect the opinions of my employer.  |
 ++
 
 
 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke

On May 24, 2006, at 4:02 PM, Dave Dutcher wrote:
If you can live with possible database corruption, you could try  
turning

Fsync off.  For example if you could just reinsert the data on the off
chance a hardware failure corrupts the database, you might get a  
decent

improvement.


I tried, but I didn't see much of an improvement (and it's not really  
acceptable for this application).


Also have you tried creating the index after you have inserted all  
your

data?  (Or maybe copy already disables the indexes while inserting?)


The data gets inserted in batches every 5 minutes and I potentially  
have people querying it constantly, so I can't remove and re-create  
the index.


--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke

On May 24, 2006, at 4:03 PM, Steinar H. Gunderson wrote:
Have you tried fiddling with the checkpointing settings? Check your  
logs --
if you get a warning about checkpoints being too close together,  
that should

give you quite some boost.


no warnings in the log (I did change the checkpoint settings when I  
set up the database, but didn't notice an appreciable difference in  
insert performance).


Apart from that, you should have quite a bit to go on -- somebody  
on this
list reported 2 billion rows/day earlier, but it might have been on  
beefier

hardware, of course. :-)


Probably :) I'll keep searching the list archives and see if I find  
anything else (I did some searching and didn't find anything that I  
hadn't already tried).


Thanks!

--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Larry Rosenman
Daniel J. Luke wrote:
 On May 24, 2006, at 4:02 PM, Dave Dutcher wrote:
 If you can live with possible database corruption, you could try
 turning Fsync off.  For example if you could just reinsert the data
 on the off chance a hardware failure corrupts the database, you
 might get a decent improvement.
 
 I tried, but I didn't see much of an improvement (and it's not really
 acceptable for this application).
 
 Also have you tried creating the index after you have inserted all
 your data?  (Or maybe copy already disables the indexes while
 inserting?) 
 
 The data gets inserted in batches every 5 minutes and I potentially
 have people querying it constantly, so I can't remove and re-create
 the index.
 
are the batches single insert's, or within a big transaction?

I.E., does the inserts look like:
INSERT
INSERT
INSERT

or

BEGIN
INSERT
INSERT
INSERT
COMMIT

If the former, the latter is a big win.

Also, what release(s) are you running?

LER

-- 
Larry Rosenman  
Database Support Engineer

PERVASIVE SOFTWARE. INC.
12365B RIATA TRACE PKWY
3015
AUSTIN TX  78727-6531 

Tel: 512.231.6173
Fax: 512.231.6597
Email: [EMAIL PROTECTED]
Web: www.pervasive.com 

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Steinar H. Gunderson
On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:
 no warnings in the log (I did change the checkpoint settings when I  
 set up the database, but didn't notice an appreciable difference in  
 insert performance).

How about wal_buffers? Upping it might not help all that much if only one
thread is writing, but you might give it a try...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Mark Lewis
 The data gets inserted in batches every 5 minutes and I potentially  
 have people querying it constantly, so I can't remove and re-create  
 the index.

How live does your data need to be?  One possibility would be to use a
separate table for each batch instead of a separate table per day,
create the indexes after the import and only after the indexes have been
created make the table available for user queries.

You'd be trading latency for throughput in that case.

Also, you mentioned that you're CPU-bound, but that you have multiple
CPU's.  In that case, performing N concurrent imports (where N is the
number of processor cores available) might be a win over a single-
threaded import.

-- Mark Lewis

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke

On May 24, 2006, at 4:13 PM, Steinar H. Gunderson wrote:

On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:

no warnings in the log (I did change the checkpoint settings when I
set up the database, but didn't notice an appreciable difference in
insert performance).


How about wal_buffers? Upping it might not help all that much if  
only one

thread is writing, but you might give it a try...


I tried, but I didn't notice a difference.

I should probably emphasize that I appear to be CPU bound (and I can  
double my # of rows inserted per second by removing the index on the  
table, or half it by adding another index).


I really should run gprof just to verify.

--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke

On May 24, 2006, at 4:12 PM, Larry Rosenman wrote:

are the batches single insert's, or within a big transaction?
If the former, the latter is a big win.


One big transaction every 5 minutes using 'COPY FROM' (instead of  
inserts).



Also, what release(s) are you running?


8.1.x (I think we're upgrading from 8.1.3 to 8.1.4 today).

--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Larry Rosenman
Daniel J. Luke wrote:
 On May 24, 2006, at 4:12 PM, Larry Rosenman wrote:
 are the batches single insert's, or within a big transaction?
 If the former, the latter is a big win.
 
 One big transaction every 5 minutes using 'COPY FROM' (instead of
 inserts).
 
 Also, what release(s) are you running?
 
 8.1.x (I think we're upgrading from 8.1.3 to 8.1.4 today).
 
Had to ask :) 

Also, is pg_xlog on the same or different spindles from the rest of the
PG Data directory?

LER

-- 
Larry Rosenman  
Database Support Engineer

PERVASIVE SOFTWARE. INC.
12365B RIATA TRACE PKWY
3015
AUSTIN TX  78727-6531 

Tel: 512.231.6173
Fax: 512.231.6597
Email: [EMAIL PROTECTED]
Web: www.pervasive.com 

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Daniel J. Luke

On May 24, 2006, at 4:24 PM, Larry Rosenman wrote:
Also, is pg_xlog on the same or different spindles from the rest of  
the

PG Data directory?


It's sitting on the same disk array (but I'm doing 1 transaction  
every 5 minutes, and I'm not near the array's sustained write  
capacity, so I don't think that's currently limiting performance).


--
Daniel J. Luke
++
| * [EMAIL PROTECTED] * |
| *-- http://www.geeklair.net -* |
++
|   Opinions expressed are mine and do not necessarily   |
|  reflect the opinions of my employer.  |
++




PGP.sig
Description: This is a digitally signed message part


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Jim C. Nasby
On Wed, May 24, 2006 at 04:09:54PM -0400, Daniel J. Luke wrote:
 On May 24, 2006, at 4:03 PM, Steinar H. Gunderson wrote:
 Have you tried fiddling with the checkpointing settings? Check your  
 logs --
 if you get a warning about checkpoints being too close together,  
 that should
 give you quite some boost.
 
 no warnings in the log (I did change the checkpoint settings when I  
 set up the database, but didn't notice an appreciable difference in  
 insert performance).

Keep in mind that the default warning time of 30 seconds is pretty
conservative; you'd want to bump that up to 300 seconds or so, probably.

As for the suggestion of multiple insert runs at a time, I suspect that
would just result in a lot of contention for some mutex/semaphore in the
index.

Your best bet really is to run gprof and post those results. It's also
possible that this is fixed be a recent patch to HEAD that reduces the
amount of traffic on the index metapage, something gprof would probably
confirm.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Selects query stats?

2006-05-24 Thread Jim C. Nasby
On Wed, May 24, 2006 at 12:27:41PM +0200, Robin Ericsson wrote:
 On 5/23/06, Dan Gorman [EMAIL PROTECTED] wrote:
 
 In any other DB (oracle, mysql) I know how many queries (selects) per 
 second
 the database is executing. How do I get this
 number out of postgres?
 
 Mysql does AFAIR only count the number of queries and then uses the
 seconds since startup to estimate the number of queries per second.
 If your server is hammered with queries 1 hour a day it's not giving
 you a fair result.

Somehow that doesn't surprise me...

In any case, if we at least provide a raw counter, it's not that hard to
turn that into selects per second over some period of time.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread kynn




I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   -  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 -  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper((outer.id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
 -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

=_1148485808-20617-3--


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread Joshua D. Drake



kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:


What is the explain analyze?



 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   -  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
 -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   -  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

=_1148485808-20617-3--


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread Tom Lane
[EMAIL PROTECTED] writes:
  Limit  (cost=19676.75..21327.99 rows=6000 width=84)
-  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
  Hash Cond: (upper((outer.id)::text) = upper((inner.id)::text))
  -  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
 width=46)
  -  Hash  (cost=19676.00..19676.00 rows=300 width=38)
-  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
 width=38)

Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?  There's
something wacko about your statistics, methinks.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Getting even more insert performance (250m+rows/day)

2006-05-24 Thread Ian Westmacott
We were able to achieve 2B (small) rows per day sustained with
very little latency.  It is beefy hardware, but things that did
help include WAL on its own I/O channel, XFS, binary copy,
and tuning bgwriter and checkpoint settings for the application
and hardware.  Things that didn't help much were shared_buffers
and wal_buffers.  But our application is single-writer, and a
small number of readers.

Although there is tons of great advice in this and other forums,
I think you just have to do a lot of experimentation with careful
measurement to find what's right for your application/environment.
i.e., YMMV.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Steinar H.
Gunderson
Sent: Wednesday, May 24, 2006 4:04 PM
To: Daniel J. Luke
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Getting even more insert performance
(250m+rows/day)


On Wed, May 24, 2006 at 03:45:17PM -0400, Daniel J. Luke wrote:
 Things I've already done that have made a big difference:
 - modified postgresql.conf shared_buffers value
 - converted to COPY from individual insert statements
 - changed BLCKSZ to 32768

Have you tried fiddling with the checkpointing settings? Check your logs --
if you get a warning about checkpoints being too close together, that should
give you quite some boost.

Apart from that, you should have quite a bit to go on -- somebody on this
list reported 2 billion rows/day earlier, but it might have been on beefier
hardware, of course. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq