Re: [GENERAL] Slow query performance

2008-11-02 Thread Joris Dobbelsteen

Kevin Galligan wrote, On 29-10-08 23:35:

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state 
= 'NY';


explain analyze returned the following...

 Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual 
time=389529.895..389529.897 rows=1 loops=1)
   ->  Bitmap Heap Scan on bigdatatable  (cost=285410.65..5172649.63 
rows=2795968 width=0) (actual time=6727.848..387159.175 
rows=2553273 loops=1)

 Recheck Cond: ((state)::text = 'NY'::text)
 Filter: ((age >= 22) AND (age <= 40))
 ->  Bitmap Index Scan on idx_jstate  (cost=0.00..284711.66 
rows=15425370 width=0) (actual time=6298.950..6298.950 
rows=16821828 loops=1)

   Index Cond: ((state)::text = 'NY'::text)
 Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, but then 
the bitmap heap scan and aggregate jump up to 6 mintues.


Indeed. Its cause is that PostGreSQL must traverse the data in order to 
verify if the data is valid for the transaction. This means A LOT of 
data must be retrieved from disk.


The only real thing you can do is reduce I/O load, by reducing the 
amount of data that must be traversed (or ensuring the data is stored 
closely together, but thats really hard to get right). This requires 
optimizing your database design for that single goal.
This will not make it scale any better than it currently does, however. 
The query will scale O(N) with the size of your table, you want other 
techniques to do better.


Another thing is spending extra money on hardware that can sustain 
higher I/O seek rates (more and/or faster spindles).


- Joris


More detail on the table design and other stuff in a bit...


On Wed, Oct 29, 2008 at 6:18 PM, Scott Marlowe <[EMAIL PROTECTED] 
> wrote:


On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <[EMAIL PROTECTED]
> wrote:
 > I'm approaching the end of my rope here.  I have a large database.
 > 250 million rows (ish).  Each row has potentially about 500 pieces of
 > data, although most of the columns are sparsely populated.

A couple of notes here.  PostgreSQL stores null values as a single bit
in a bit field, making sparsely populated tables quite efficient as
long as you store the non-existent values as null and not '' or some
other real value.

Have you run explain analyze on your queries yet?  Pick a slow one,
run explain analyze on it and post it and we'll see what we can do.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow query performance

2008-10-31 Thread Isak Hansen
On Wed, Oct 29, 2008 at 9:18 PM, Kevin Galligan <[EMAIL PROTECTED]> wrote:
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.
>
*snip*
>
> So, went the other direction completely.  I rebuilt the database with
> a much larger main table.  Any values with 5% or greater filled in
> rows were added to this table.  Maybe 130 columns.  Indexes applied to
> most of these.  Some limited testing with a smaller table seemed to
> indicate that queries on a single table without a join would work much
> faster.
>
> So, built that huge table.  now query time is terrible.  Maybe a
> minute or more for simple queries.

Are indexes on sparsely populated columns already handled efficiently,
or could partial indexes with only non-null values improve things?


Isak

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow query performance

2008-10-31 Thread Nick Mellor
Hi Kevin,

I'm not deeply knowledgeable about PostgreSQL, but my guess is that 2 things
are doing you in:

(1) scanning all those nulls during SELECTs (even though PostgreSQL is
efficient at nulls, there are still tens or hundreds of billions of them)

(2) All those single-field indexes, and aggregations between them

Both (1) and (2) make it very difficult for PG to cache effectively.

You have the advantage that your data is read-only at query time, so I'd
suggest this (fairly lengthy) experiment:

Go back to a split-table format, where you have:

- one rectangular table containing those fields which are always, or nearly
always, filled (the "Full" table.) You talk about there being "some knowns".
Does that mean that you know some combinations of two or more fields will be
selected on very frequently? Optimise any of these combinations in the
"full" table fields using multi-field indexes across these combinations
(e.g. (state,age).) Put full single-field indexes on all fields in the
"Full" table.

- one or more tables (one initially, see below) containing the fields that
are mostly null (the "Sparse" table.) Store the sparse data in a "depivoted"
form. Explanation follows:


"Sparse" Table
---

Instead of (the original "sparse" table):

Id,field1,field2,field3,field4,...field500
Rec1,...
Rec2,...
...
RecM,...

store the sparse data as:

Id,fieldname,value
Rec1,field1name,field1value(Rec1)
Rec1,field2name,field1value(Rec2)
...
Rec1,field500name,field500value(Rec1)
Rec2,field1name,field1value(Rec2)
...
RecM,field500name,field500value(RecM)

I.e. one row per cell in the "sparse" table, and an Id to link to the "Full"
table.

For null values, don't store a depivoted record at all. I'd estimate this
would give you a few billion rows at most in this table.

(If anyone has a better name for this process than "depivoting", please pass
it along!)

In the depivoted table, put single-field indexes and multi-field indexes on
every combination of Id, fieldname, value in the depivoted data:
(Id)
(fieldname)
(value)
(Id,fieldname)
(Id,value)
(fieldname,value)

You might eventually have to keep a different depivoted table for each type
of field value (boolean, integer, character varying etc) but you could do a
dirty experiment by converting all values to CHARACTER VARYING and having a
look at query performance using the new structure before doing further work.


Rationale
---

"Depivoting" makes sense to me because your data is so sparse. The huge
number of nulls may be causing severe aggregation bottlenecks and many cache
misses. All those indexes will absolutely kill the query performance of any
database in terms of hard disk seeks, data cache limits and aggregation
time. I'm not surprise that both MySQL and PostgreSQL struggle.

Too many indexes probably gave you a diminishing return on a table this big
because 15G RAM on your server is way too small a cache for so many fields
and indexes.

"Depivoting" eliminates the need to scan the huge number of nulls in the
dataset. Even if nulls are very efficiently handled in PostgreSQL, you're
talking about hundreds of billions of them, and hundreds of billions of
anything is never going to be quick. Better not to process them at all.

"De-pivoting" will (of course) eventually mean rewriting all your querying
code, and you'll probably need to "rebuild" the sparse data into a wide
table format at some point. But if the result set is small, this should be a
small price to pay for better SELECT query performance.

If you want to do a quick and dirty experiment, I have an MS Access app that
depivots arbitrarily wide tables. I'd be glad to pass it along, although
there is a risk it's too small a gun for the job. But if Access manages okay
with the depivoted table, it might be worth a try. Based on 500 fields, 250M
records, 2% filled it looks like it might depivot your table overnight, or
better. You'd finish with about 2.5 billion rows.

Best wishes,

Nick

> -Original Message-----
> From: [EMAIL PROTECTED] [mailto:pgsql-general-
> [EMAIL PROTECTED] On Behalf Of Kevin Galligan
> Sent: Thursday, 30 October 2008 7:18 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Slow query performance
> 
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.
> 
> What I'm trying to do is, essentially, search for sub-sets of that
> data based on arbitrary queries of those data columns.  the queries
> would be relatively simple ("dirbtl is not null and qqrq between 20
> and 40").  After the database is built, it is read only.
> 
> So, I started with maybe 10-15 fields in 

Re: FW: [GENERAL] Slow query performance

2008-10-29 Thread Kevin Galligan
I agree with the concept.  The issue is the application is an open ended
query tool.  So, pretty much whatever they feel like entering is valid.  I
totally understand the indexes aren't very selective.  I guess I just don't
know what the next step is.  There aren't a lot of assumptions I can make
about how the queries are run.

I'm in the uniquely bad situation that there exists something with very
similar data that works, so its impossible to tell them "You can't run a
state query by itself with some other data.  You always needs an age bound"
or whatever.  The other application works reasonably fast.  I also don't
know how its built, which makes my life that much more difficult.

All of my experience is with small or medium sized databases.  the front end
itself is done, but I've had the client on the hook for a while now trying
to sort this out.  Not sure what the next step is.

I tried the other extreme end.  Age in its own table.  Just 'account
integer' and 'fval smallint'.  fval is the age value.

explain analyze select count(*) from jage where fval between 22 and 33;
   QUERY
PLAN
-
 Aggregate  (cost=1499316.66..1499316.67 rows=1 width=0) (actual
time=75985.403..75985.404 rows=1 loops=1)
   ->  Bitmap Heap Scan on jage  (cost=365374.78..1456268.39 rows=17219307
width=0) (actual time=7930.354..56879.811 rows=18016538 loops=1)
 Recheck Cond: ((fval >= 22) AND (fval <= 33))
 ->  Bitmap Index Scan on idx_tjage  (cost=0.00..361069.96
rows=17219307 width=0) (actual time=7084.535..7084.535 rows=18016538
loops=1)
   Index Cond: ((fval >= 22) AND (fval <= 33))
 Total runtime: 76015.215 ms


Still took over a minute to do the count.  It seems like the index scan
happens pretty fast, but the last steps go from 7 or 8 seconds to over a
minute.

On Wed, Oct 29, 2008 at 7:52 PM, Dann Corbit <[EMAIL PROTECTED]> wrote:

>*From:* Kevin Galligan [mailto:[EMAIL PROTECTED]
> *Sent:* Wednesday, October 29, 2008 4:34 PM
> *To:* Dann Corbit
> *Cc:* pgsql-general@postgresql.org
> *Subject:* Re: FW: [GENERAL] Slow query performance
>
>
>
> Sorry for the lack of detail.  Index on both state and age.  Not a
> clustered on both as the queries are fairly arbitrary (that's the short
> answer.  The long answer is that, at least with those columns, something
> like that MAY be an option later but I don't know enough now).
>
> I don't have the gui admin set up, as I'm doing this over ssh.  Will get
> the full table definition in a bit.  The short answer is simple btree
> indexes on the columns being searched.  I was applying simple indexes on all
> the columns, as there will be queries like "where [col] is not null",
> although in retrospect, that's fairly pointless unless the column has very
> little data.  Even then, maybe.
>
> Anyway, looking at the output, the time goes from 6727.848 to 387159.175
> during the bitmap heap scan (I was reading it wrong about the Aggregate
> line).  Considering the size involved, is this something that postgre has
> decided is too big to be done in memory?  That would be my wild guess.
>
> Ran another query.  this one even simpler.  Still quite long...
>
> explain analyze select count(*) from bigdatatable where age between 22 and
> 23 and state = 'NY';
>QUERY PLAN
>
> -
>  Aggregate  (cost=37.41..37.42 rows=1 width=0) (actual
> time=217998.706..217998.707 rows=1 loops=1)
>->  Index Scan using idx_jage on bigdatatable  (cost=0.00..37.41 rows=1
> width=0) (actual time=247.209..217988.584 rows=10303 loops=1)
>  Index Cond: ((age >= 22) AND (age <= 23))
>  Filter: ((state)::text = 'NY'::text)
>  Total runtime: 217998.800 ms
>
> Abbreviated schema below.  The table is huge.  Originally I had a design
> with a main "anchor" table that had all records, and most of those columns
> were in other tables I would join for the search.  This didn't perform very
> well, so I decided to go the other way and see how it worked.  I did a count
> on all data, and anything with records in fewer than 5 percent of the rows,
> I put in their own table.  Everything else is in this big one.  The indexing
> strategy is a joke right now.  I just applied one to each.  This is still in
> the testing phase.
>
> I had set this up on mysql.  The joins on the full

Re: FW: [GENERAL] Slow query performance

2008-10-29 Thread Dann Corbit
From: Kevin Galligan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 29, 2008 4:34 PM
To: Dann Corbit
Cc: pgsql-general@postgresql.org
Subject: Re: FW: [GENERAL] Slow query performance

 

Sorry for the lack of detail.  Index on both state and age.  Not a
clustered on both as the queries are fairly arbitrary (that's the short
answer.  The long answer is that, at least with those columns, something
like that MAY be an option later but I don't know enough now).

I don't have the gui admin set up, as I'm doing this over ssh.  Will get
the full table definition in a bit.  The short answer is simple btree
indexes on the columns being searched.  I was applying simple indexes on
all the columns, as there will be queries like "where [col] is not
null", although in retrospect, that's fairly pointless unless the column
has very little data.  Even then, maybe.

Anyway, looking at the output, the time goes from 6727.848 to 387159.175
during the bitmap heap scan (I was reading it wrong about the Aggregate
line).  Considering the size involved, is this something that postgre
has decided is too big to be done in memory?  That would be my wild
guess.

Ran another query.  this one even simpler.  Still quite long...

explain analyze select count(*) from bigdatatable where age between 22
and 23 and state = 'NY';
   QUERY
PLAN

-
 Aggregate  (cost=37.41..37.42 rows=1 width=0) (actual
time=217998.706..217998.707 rows=1 loops=1)
   ->  Index Scan using idx_jage on bigdatatable  (cost=0.00..37.41
rows=1 width=0) (actual time=247.209..217988.584 rows=10303 loops=1)
 Index Cond: ((age >= 22) AND (age <= 23))
 Filter: ((state)::text = 'NY'::text)
 Total runtime: 217998.800 ms

Abbreviated schema below.  The table is huge.  Originally I had a design
with a main "anchor" table that had all records, and most of those
columns were in other tables I would join for the search.  This didn't
perform very well, so I decided to go the other way and see how it
worked.  I did a count on all data, and anything with records in fewer
than 5 percent of the rows, I put in their own table.  Everything else
is in this big one.  The indexing strategy is a joke right now.  I just
applied one to each.  This is still in the testing phase.

I had set this up on mysql.  The joins on the full size db turned out to
be terrible.  I'm currently setting up the "one large table" design on
mysql to see how that works.

The obvious answer would be that the table is huge, so when the query is
running, its grabbing all that data and not using much of it.  True.
However, I'm not sure how to approach the design now.  Its rarely going
to need data from anything other than a few columns, but joining across
10's or 100's of millions of records didn't seem that much fun either.
Thoughts?

CREATE TABLE bigdatatable (
account integer,
city character varying(20),
zip character(5),
dincome character(1),
sex character(1),
mob boolean,
religion character(1),
groupcd character(1),
lastdata character varying(4),
countycd character varying(3),
state character varying(2),
dutype character varying(1),
orders integer,
countysz character varying(1),
language character varying(2),
cbsacode character varying(5),
cbsatype character varying(1),
age smallint,
dob date,
ccard boolean,
lor integer,
bankcard boolean,
lastord date,
total integer,
lmob boolean,
homeown character varying(1),
ord1st date,
ordlast date,
married boolean,
deptcard boolean,
ordtotm smallint,
ordlastm date,
ord1stm date,
orddolm smallint,
pcuser boolean,
homeval character varying(1),
mailresp boolean,
lhomepc boolean,
dirrspby boolean,
mgift boolean,
lebuyer boolean,
payother smallint,
lhomdecr boolean,
driver boolean,
ordtote smallint,
ord1ste date,
ordlaste date,
orddole smallint,
mhmdecor boolean,
oddsnend smallint,
aptot smallint,
apwk smallint,
apdol smallint,
payccrd smallint,
landval smallint,
mfapparl boolean,
mgengift boolean,
homeblt smallint,
homebydt date,
educate character varying(1),
children boolean,
payvisa smallint,
hmfr smallint,
maghlth smallint,
homebypx integer,
gfhol smallint,
mbeauty boolean,
apwmtot smallint,
apwmwk smallint,
apwmdol smallint,
travlseg integer,
lhealth boolean,
lcharity boolean,
moutdoor boolean,
occupatn character varying(4),
fundrais boolean,
msports boolean,
hg smallint,
magfam smallint,
melectrc boolean,
lelectrc boolean,
   

FW: [GENERAL] Slow query performance

2008-10-29 Thread Dann Corbit
From: Kevin Galligan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 29, 2008 3:16 PM
To: Dann Corbit
Subject: Re: [GENERAL] Slow query performance

 

Columns are as follows:

account  | integer   |
city | character varying(20) |
zip  | character(5)  |
dincome  | character(1)  |
sex  | character(1)  |
mob  | boolean   |
religion | character(1)  |
groupcd  | character(1)  |
lastdata | character varying(4)  |
countycd | character varying(3)  |
state| character varying(2)  |
dutype   | character varying(1)  |
orders   | integer   |
countysz | character varying(1)  |
ethnic   | character varying(2)  |
language | character varying(2)  |
cbsacode | character varying(5)  |
cbsatype | character varying(1)  |
age  | smallint  |
dob  | date  |
ccard| boolean   |
lor  | integer   |
bankcard | boolean   |
lastord  | date  |
total| integer   |
lmob | boolean   |
homeown  | character varying(1)  |
ord1st   | date  |
ordlast  | date  |
married  | boolean   |
deptcard | boolean   |
>>

You did not show us the indexes.

If you have pgadmin III, go to the table and copy/paste the actual
definition, including indexes.

<<
>From here its about another 100 columns with either booleans or
smallints, mostly null values.

I eventually killed the vacuum.  I will run it again, but was just going
through the indexes.  All were of this format...

"INFO:  index "idx_jordlast" now contains 265658026 row versions in
728409 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.88s/0.13u sec elapsed 90.38 sec."

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state
= 'NY';
>>

Is there an index on state and age?

Is there an index on state?

Is there an index on age?

That is important missing information.

If there is no index on either column, then you will do a table scan.

If all of your slow queries look like the above, then create a clustered
index on state,age

<<
I know count is not optimized on postgresql like it is on mysql due to
transaction isolation (at least that's what I've read.  Makes sense to
me).  I understand it'll take time to actually count the rows.  However,
here's the output of 'explain analyze select count(*) from bigdatatable
where age between 22 and 40 and state = 'NY';'

 Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual
time=389529.895..389529.897 rows=1 loops=1)
   ->  Bitmap Heap Scan on bigdatatable  (cost=285410.65..5172649.63
rows=2795968 width=0) (actual time=6727.848..387159.175
rows=2553273 loops=1)
 Recheck Cond: ((state)::text = 'NY'::text)
 Filter: ((age >= 22) AND (age <= 40))
 ->  Bitmap Index Scan on idx_jstate  (cost=0.00..284711.66
rows=15425370 width=0) (actual time=6298.950..6298.950 ro
ws=16821828 loops=1)
   Index Cond: ((state)::text = 'NY'::text)
 Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, which is
fine.  then it looks like "Aggregate" suddenly jumps up to 6 minutes.

I know the database design is crude.  Its really just a big flat table.
I didn't put too much into weeding out which columns should be indexed
and which shouldn't (just slapped an index on each).  Happy to do that
work, but right now I'm in panic mode and just need to figure out which
way to start going.  I had a design on mysql which worked pretty good at
10 to 20 % of full size, but degraded quite a bit at full size.
compounding this is there is another implementation we've seen that uses
the full size of similar data and returns actual results in seconds (I
originally planned to used a 5% size db for estimated results, then the
full size for getting the actual data.  This plan was rejected :(

Any thoughts?  It seemed to work OK when I had a table with 10 cols but
about the same data length.  That may have been an artificial test,
though.

Again.  This is read-only once the data is set up.  Client wants to run
pretty much arbitrary queries, so its hard to isolate certain things for
optimization, although there are some "knowns".

Will start the full vacuum process again.

Thanks in advance,
-Kevin

On Wed, Oct 29, 2008 at 4:52 PM, Dann Corbit <[EMAIL PROTECTED]> wrote:
>> -Original Message-
>> From: [EMAIL PROTECTED] [mailto:pgsql-general-
>> [EMAIL PROTECTED] On Behalf Of Kevin Galligan
>> Sent: Wednesday, October 29, 2008 1:18 PM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] Slow query perform

Re: [GENERAL] Slow query performance

2008-10-29 Thread Kevin Galligan
An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state =
'NY';

explain analyze returned the following...

 Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual
time=389529.895..389529.897 rows=1 loops=1)
   ->  Bitmap Heap Scan on bigdatatable  (cost=285410.65..5172649.63
rows=2795968 width=0) (actual time=6727.848..387159.175
rows=2553273 loops=1)
 Recheck Cond: ((state)::text = 'NY'::text)
 Filter: ((age >= 22) AND (age <= 40))
 ->  Bitmap Index Scan on idx_jstate  (cost=0.00..284711.66
rows=15425370 width=0) (actual time=6298.950..6298.950
rows=16821828 loops=1)
   Index Cond: ((state)::text = 'NY'::text)
 Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, but then the
bitmap heap scan and aggregate jump up to 6 mintues.

More detail on the table design and other stuff in a bit...


On Wed, Oct 29, 2008 at 6:18 PM, Scott Marlowe <[EMAIL PROTECTED]>wrote:

> On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <[EMAIL PROTECTED]>
> wrote:
> > I'm approaching the end of my rope here.  I have a large database.
> > 250 million rows (ish).  Each row has potentially about 500 pieces of
> > data, although most of the columns are sparsely populated.
>
> A couple of notes here.  PostgreSQL stores null values as a single bit
> in a bit field, making sparsely populated tables quite efficient as
> long as you store the non-existent values as null and not '' or some
> other real value.
>
> Have you run explain analyze on your queries yet?  Pick a slow one,
> run explain analyze on it and post it and we'll see what we can do.
>


Re: [GENERAL] Slow query performance

2008-10-29 Thread Scott Marlowe
On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <[EMAIL PROTECTED]> wrote:
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.

A couple of notes here.  PostgreSQL stores null values as a single bit
in a bit field, making sparsely populated tables quite efficient as
long as you store the non-existent values as null and not '' or some
other real value.

Have you run explain analyze on your queries yet?  Pick a slow one,
run explain analyze on it and post it and we'll see what we can do.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Slow query performance

2008-10-29 Thread Kevin Galligan
I'm approaching the end of my rope here.  I have a large database.
250 million rows (ish).  Each row has potentially about 500 pieces of
data, although most of the columns are sparsely populated.

What I'm trying to do is, essentially, search for sub-sets of that
data based on arbitrary queries of those data columns.  the queries
would be relatively simple ("dirbtl is not null and qqrq between 20
and 40").  After the database is built, it is read only.

So, I started with maybe 10-15 fields in a main table, as most records
have values for those fields.  Then had individual tables for the
other values.  The idea is that the percentage of rows with values
drops off significantly after those main tables.  That, an each
individual query looks at probably 3 or 4 fields in total.  The
performance of those queries was pretty bad.  Its got to join large
numbers of values, which didn't really work out well.

So, went the other direction completely.  I rebuilt the database with
a much larger main table.  Any values with 5% or greater filled in
rows were added to this table.  Maybe 130 columns.  Indexes applied to
most of these.  Some limited testing with a smaller table seemed to
indicate that queries on a single table without a join would work much
faster.

So, built that huge table.  now query time is terrible.  Maybe a
minute or more for simple queries.

I'm running vacuum/analyze right now (which is also taking forever, BTW).

The box has 15 g of ram.  I made the shared_buffers setting to 8 or 9
gig.  My first question, what would be better to bump up to increase
the performance?  I thought that was the field to jack up to improve
query time or index caching, but I've read conflicting data.  The 15
ram is available.

I originally had this in mysql.  Also bad performance.  I understand
how to optimize that much better, but it just wasn't cutting it.

Anyway, help with tuning the settings would be greatly appreciated.
Advice on how best to lay this out would also be helpful (I know its
difficult without serious detail).

Thanks in advance,
-Kevin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] slow query performance

2003-11-02 Thread Alvaro Herrera
On Sat, Nov 01, 2003 at 10:18:14AM +1300, Mark Kirkwood wrote:
> 
> Dave Weaver wrote:
> 
> >>- clustering the "obs" table on "station"
> >
> >Sorry, I don't understand what you mean by this - can you explain?

> Supposing obs_pkey is on (station, valid_time):
> 
> cluster obs_pkey on obs

Be aware that doing this on the 7.1.3 version you are running will drop
the other indexes on the table, and some other metadata about it (grants,
inheritance, foreign key relationships IIRC).

-- 
Alvaro Herrera ()
"Coge la flor que hoy nace alegre, ufana. ¿Quién sabe si nacera otra mañana?"

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [GENERAL] slow query performance

2003-10-31 Thread Mark Kirkwood
Dave Weaver wrote:

- clustering the "obs" table on "station"
   

Sorry, I don't understand what you mean by this - can you explain?

 

Supposing obs_pkey is on (station, valid_time):

cluster obs_pkey on obs

will re-order the rows in obs based on the index obs_pkey. (This is 
clustering on 'station' and 'valid_time', to do just station you could 
use an index on just 'station').

The down side is that the row ordering is slowly lost as rows are 
updated, so periodic running of the cluster command is needed - this is 
a pain as it will take a while for 13 million row table.

regards

Mark

---(end of broadcast)---
TIP 8: explain analyze is your friend


[GENERAL] slow query performance

2003-10-31 Thread Dave Weaver

I'm having severe performance issues with a conceptually simple
database.  The database has one table, containing weather observations.
The table currently has about 13.5 million rows, and is being updated
constantly. The database is running on a dual 550MHz PIII with 512MB RAM.

On the whole, queries are of the form:

SELECT ? FROM obs WHERE station = ?
AND valid_time < ? AND valid_time > ?
or:
SELECT ? FROM obs WHERE station IN (?, ?, ...)
AND valid_time < ? AND valid_time > ?

Queries like these are taking around 4 to 5 minutes each, which seems
excessively slow to me (or are my expectations far too optimistic?).

For instance:
SELECT station, air_temp FROM obs
 WHERE station = 'EGBB'
AND valid_time > '28/8/03 00:00'
AND valid_time < '28/10/03 00:00'

takes 4 mins 32 secs.

An EXPLAIN of the above query says:
  NOTICE:  QUERY PLAN:

  Index Scan using obs_pkey on obs  (cost=0.00..9.01 rows=1 width=20)

A simple "SELECT count(*) from obs" query takes around that sort of time
too.

I have run "vacuumdb --analyze obs", to little effect.

How can I speed this up? Where am I going wrong? Is there a problem with
the table structure, or the indexes? Does the continual updating of the
database (at the rate of somewhere between 1-3 entries per second) cause
problems?

The table and indexes are defined as follows:


   Table "obs"
 Attribute  |   Type   | Modifier 
+--+--
 valid_time | timestamp with time zone | 
 metar_air_temp | double precision | 
 relative_humidity  | double precision | 
 pressure_change| double precision | 
 ceiling| double precision | 
 metar_dew_point| double precision | 
 metar_gusts| double precision | 
 wet_bulb_temperature   | double precision | 
 past_weather   | text | 
 visibility | double precision | 
 metar_visibility   | double precision | 
 precipitation  | double precision | 
 station| character(10)| 
 pressure_msl   | double precision | 
 metar_min_temperature_6hr  | double precision | 
 precipitation_period   | double precision | 
 metar_wet_bulb | double precision | 
 saturation_mixing_ratio| double precision | 
 metar_pressure | double precision | 
 metar_sky_cover| text | 
 dew_point  | double precision | 
 wind_direction | double precision | 
 actual_time| timestamp with time zone | 
 gust_speed | double precision | 
 high_cloud_type| text | 
 precipitation_24hr | double precision | 
 metar_precipitation_24hr   | double precision | 
 pressure_tendency  | text | 
 metar_relative_humidity| double precision | 
 low_cloud_type | text | 
 metar_max_temperature_6hr  | double precision | 
 middle_cloud_type  | text | 
 air_temp   | double precision | 
 low_and_middle_cloud_cover | text | 
 metar_wind_dir | double precision | 
 metar_weather  | text | 
 snow_depth | double precision | 
 metar_snow_depth   | double precision | 
 min_temp_12hr  | double precision | 
 present_weather| text | 
 wind_speed | double precision | 
 snow_cover | text | 
 metar_wind_speed   | double precision | 
 metar_ceiling  | double precision | 
 max_temp_12hr  | double precision | 
 mixing_ratio   | double precision | 
 pressure_change_3hr| double precision | 
 total_cloud| integer  | 
 max_temp_24hr  | double precision | 
 min_temp_24hr  | double precision | 
 snow_amount_6hr| double precision | 
Indices: obs_pkey,
 obs_station,
 obs_valid_time

   Index "obs_pkey"
 Attribute  |   Type   
+--
 valid_time | timestamp with time zone
 station| character(10)
unique btree

Index "obs_station"
 Attribute | Type  
---+---
 station   | character(10)
btree

Index "obs_valid_time"
 Attribute  |   Type   

Re: [GENERAL] slow query performance

2003-10-30 Thread Tom Lane
"Dave Weaver" <[EMAIL PROTECTED]> writes:
> Is the upgrade likely to make a difference?

I'm not sure if it would help for this specific query, but in general
each major PG release has useful performance improvements over the
previous one.

What I'm wondering about is an index-bloat problem (see the
pgsql-performance archives for discussions).  Do you do a lot of updates
or deletes on this table, or is it just inserts?  What is the physical
size of the table and its index?  The output of VACUUM VERBOSE for this
table would be useful to show.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [GENERAL] slow query performance

2003-10-30 Thread scott.marlowe
On Thu, 30 Oct 2003, Dave Weaver wrote:

> Jeff wrote:
> > Dave Weaver wrote:
> > > For instance:
> > > SELECT station, air_temp FROM obs
> > >  WHERE station = 'EGBB'
> > > AND valid_time > '28/8/03 00:00'
> > >  AND valid_time < '28/10/03 00:00'
> > > 
> > > takes 4 mins 32 secs.
> > 
> > How many rows should that return?
> > [explain analyze will tell you that]
> 
> "explain analyze" doesn't seem to be part of this postgres version
> (or I misunderstood something).
> That particular query returned 24 rows.

Back then it was just explain.  explain analyze actually runs the query 
and tells you how long each thing too etc...  i.e. it gives you the "I 
imagine I'll get this many rows back and it'll cost this much" part, then 
the cold hard facts of how many rows really came back, and how long it 
really too.  Quite a nice improvement.

> > and while that runs is your disk thrashing? vmstat's bi/bo columns will
> > tell you.
> 
> The machine's over the other side of the building, so I can't physically
> see if the disk is thrashing.
> I'm not sure how to interpret the vmstat output; running "vmstat 1" shows
> me bi/bo both at zero (mostly) until I start the query. Then bi shoots up
> to around 2500 (bo remains around zero) until the query finishes.

Your disk is likely trashing.

Can you set sort_mem on that old version of pgsql to something higher?

set sort_mem = 32768;

or something similar?

> > 7.1 is quite old, but I do understand the pain of migraing to 7.3 [or
> > .4beta] with huge db's
> 
> Is the upgrade likely to make a difference?
> I'm still none-the-wiser wether the problem I have is due to:
> 1 Postgres version
> 2 Database size
> 3 Table structure
> 4 Configuration issues
> 5 Slow hardware
> 6 All of the above
> 7 None of the above
> 8 Something else

Yes, the upgrade is very likely to make a difference.  The average 
performance gain for each version since then has been, in my experience, 
anywhere from a few percentage points faster to many times faster, 
depending on what you were trying to do.

Why not download 7.4beta5 and see if you can get it to import the data 
from 7.1.3?  It's close to going production, and in my opinion, 7.4beta5 
is probably at least as stable as 7.1.3 considering the number of unfixed 
bugs likely to be hiding out there.  My guess is that you'll find your 
workstation running 74beta5 with one IDE hard drive outrunning your server 
with 7.1.3 on it.  Seriously.

We're running 7.2.4 where I work, and the change from 7.1 to 7.2 was huge 
for us, especially the non-full vacuums.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] slow query performance

2003-10-30 Thread Jeff

> > For instance:
> > SELECT station, air_temp FROM obs
> >  WHERE station = 'EGBB'
> > AND valid_time > '28/8/03 00:00'
> >  AND valid_time < '28/10/03 00:00'
> > 
> > takes 4 mins 32 secs.

How many rows should that return?
[explain analyze will tell you that]

and while that runs is your disk thrashing? vmstat's bi/bo columns will
tell you.

7.1 is quite old, but I do understand the pain of migraing to 7.3 [or
.4beta] with huge db's

-- 
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [GENERAL] slow query performance

2003-10-30 Thread Dave Weaver
Jeff wrote:
> Dave Weaver wrote:
> > For instance:
> > SELECT station, air_temp FROM obs
> >  WHERE station = 'EGBB'
> > AND valid_time > '28/8/03 00:00'
> >  AND valid_time < '28/10/03 00:00'
> > 
> > takes 4 mins 32 secs.
> 
> How many rows should that return?
> [explain analyze will tell you that]

"explain analyze" doesn't seem to be part of this postgres version
(or I misunderstood something).
That particular query returned 24 rows.


> and while that runs is your disk thrashing? vmstat's bi/bo columns will
> tell you.

The machine's over the other side of the building, so I can't physically
see if the disk is thrashing.
I'm not sure how to interpret the vmstat output; running "vmstat 1" shows
me bi/bo both at zero (mostly) until I start the query. Then bi shoots up
to around 2500 (bo remains around zero) until the query finishes.


> 7.1 is quite old, but I do understand the pain of migraing to 7.3 [or
> .4beta] with huge db's

Is the upgrade likely to make a difference?
I'm still none-the-wiser wether the problem I have is due to:
1 Postgres version
2 Database size
3 Table structure
4 Configuration issues
5 Slow hardware
6 All of the above
7 None of the above
8 Something else

Thanks for the help,
Dave.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org