Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-11 Thread Martin Gainty
Reuven--

a few years back this same scenario happened for me working on a project with a 
large Oracle DB which held clinical data 
We needed to implement a join condition but the column we were using wa not 
using an Index scan 
so since the cardinality of the results was extraordinary we re-implemented the 
column to use EXISTS (for large result sets)
*following the omnipresent example used by Oracle books where IN is used for 
columns with low cardinality *
and following the docs from EXISTS where the EXISTS clause ALWAYS has to return 
something 

In our case we were joining on a doctorid with IN (which made no sense as there 
were millions of PatientIDs)  to find all patients whose doctors
were in PPO's so to increase performance we changed the IN clause for the 
column with high cardinality (doctorid) to EXISTS

select p.PATIENT_NAME from PATIENT p where p.doctorid in (select doctorid from 
ppo_table)

/**there are many doctorids in this ppo table so we will change to 
EXISTS/

select p.PATIENT_NAME from PATIENT p
where exists (select 0 from  ppo_table ppo where p.doctorID = 
ppo.doctorID); 

Shalom/
Martin--
--- 
This e-mail message (including attachments, if any) is intended for the use of 
the individual or entity to which it is addressed and may contain information 
that is privileged, proprietary , confidential and exempt from disclosure. If 
you are not the intended recipient, you are notified that any dissemination, 
distribution or copying of this communication is strictly prohibited.
--- 
Le présent message électronique (y compris les pièces qui y sont annexées, le 
cas échéant) s'adresse au destinataire indiqué et peut contenir des 
renseignements de caractère privé ou confidentiel. Si vous n'êtes pas le 
destinataire de ce document, nous vous signalons qu'il est strictement interdit 
de le diffuser, de le distribuer ou de le reproduire.
- Original Message - 
From: "Alvaro Herrera" <[EMAIL PROTECTED]>
To: "Reuven M. Lerner" <[EMAIL PROTECTED]>
Cc: "Webb Sprague" <[EMAIL PROTECTED]>; 
Sent: Tuesday, March 06, 2007 4:19 PM
Subject: Re: [GENERAL] Database slowness -- my design, hardware, or both?


> Reuven M. Lerner escribió:
>> Hi, Webb Sprague.  You wrote:
>> >... but I see two seq scans in your explain in a loop -- this is
>> >probably not good.  If you can find a way to rewrite the IN clause
>> >(either de-normalizing through triggers to save whatever you need on
>> >an insert and not have to deal with a set, or by using except in the
>> >query, or someing else more creative)...
>> I would normally agree that an IN clause is a place to worry -- except 
>> that I'm using IN on a very small table, with about 4-5 rows.  That 
>> might indeed affect things, and I could certainly pull out these values 
>> into a Perl variable that I then interpolate literally into the SQL.  
>> However, I have to assume that this wouldn't affect things all that much.
> 
> Don't assume -- measure.  I had a query which ran orders of magnitude
> faster because I interpolated the constant list in the big query.  The
> table from which the interpolated values were being extracted had about
> 30 rows or so.
> 
> -- 
> Alvaro Herrerahttp://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
> 
> ---(end of broadcast)---
> TIP 6: explain analyze is your friend
>
---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-08 Thread Richard Broersma Jr

> LOG:  statement: UPDATE Transactions
>   SET previous_value = previous_value(id)
> WHERE new_value IS NOT NULL
>   AND new_value <> ''
>   AND node_id IN (SELECT node_id FROM NodeSegments)
> LOG:  duration: 16687993.067 ms

I hope that I can presume some suggestions that I gleened after finishing my 
celko book.  I don't
know if the suggestions presented will help in your case.

>From the reading WHERE conditions such as <> '' or IS NOT NULL can be 
>preformance killers as these
may discourge the optimizer from using an index scan.  The suggest was to 
replace this with:

 new_value > ''

this WHERE conditions should only find non-NULL and non-empty strings.

Also, the IN is also know as a killer so the suggestion was to reform the query 
like so:


UPDATE Transactions
SET previous_value = previous_value(id)
FROM NodeSegments
WHERE Transactions.Node_id = NodeSegments.Node_id
AND Transactions.new_value > 'A'; --assuming A is a minimum value 


I hope this can help.

Regards,
Richard Broersma Jr.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-08 Thread Reuven M. Lerner

Hi, Webb Sprague.  You wrote:


Do you have new \timings?
Yup.  It just finished executing a little while ago.  With the 
explicitly interpolated array in place, I got the following:


   LOG:  statement: UPDATE Transactions
 SET previous_value = previous_value(id)
   WHERE new_value IS NOT NULL
 AND new_value <> ''
 AND node_id IN (351, 169, 664, 240)
   LOG:  duration: 16842710.469 ms

The previous version, which included lots of calls to RAISE NOTICE and 
also used a subselect, had the following timing:


   LOG:  statement: UPDATE Transactions
 SET previous_value = previous_value(id)
   WHERE new_value IS NOT NULL
 AND new_value <> ''
 AND node_id IN (SELECT node_id FROM NodeSegments)
   LOG:  duration: 16687993.067 ms

(I keep timing information in the logfile, rather than using \timing.)

So it looks like this didn't make much of a timing difference at all.  
And what little difference there was, was negative.  Bleah.

What you or I think is a minor change isn't necessarily what the
planner thinks is a minor change, especially when you change data from
something that requires a query to something that is determinate.  I
would suggest changing your function to remove as many such queries as
possible too (I am thinking of the order by limit 1).  This would be a
good move also in that you narrow down the amount of moving parts to
diagnose and it just makes the whole thing cleaner.
Good idea.  I'll see if I can get the function to be a bit cleaner, 
although I'm not sure if it is, given the constraints of the problem.  
That's one of the reasons why I've been adding these "hints" to the 
database -- so that I can have many small queries, rather than one large 
one.

I would also try amortizing the analysis with triggers, rather than
building the table all at once; this may be better or worse, depending
on the on-line character of the application (e.g., if they are waiting
at an ATM, in might be a deal breaker to add two seconds to do an
insert / update, but not if you are tailing off a log file that gets
updated every minute or so.)
The data that I'm dealing with is not changing over time.  So I've been 
trying to build the transactions table (aka my data warehouse) slowly, 
adding one or two columns at a time with hints that will make extracting 
the data easier.  Unfortunately, building those hints has proven to be 
very slow going.


Reuven



Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-07 Thread Webb Sprague

 OK, I modified things to use interpolation.  Here's the updated query:


explain  UPDATE Transactions
SET previous_value = previous_value(id)
  WHERE new_value IS NOT NULL
AND new_value <> ''
AND node_id IN (351, 169, 664, 240);

 And here is the query plan produced by explain:

QUERY
PLAN
--
  Bitmap Heap Scan on transactions  (cost=8842.88..98283.93 rows=407288
width=249)
Recheck Cond: (node_id = ANY
('{351,169,664,240}'::integer[]))
Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text))
->  Bitmap Index Scan on node_id_idx  (cost=0.00..8842.88 rows=434276
width=0)
  Index Cond: (node_id = ANY
('{351,169,664,240}'::integer[]))
 (5 rows)
 I'm still a bit surprised by how different the query plan came out with
what would seem like a minor change.


Do you have new \timings?

What you or I think is a minor change isn't necessarily what the
planner thinks is a minor change, especially when you change data from
something that requires a query to something that is determinate.  I
would suggest changing your function to remove as many such queries as
possible too (I am thinking of the order by limit 1).  This would be a
good move also in that you narrow down the amount of moving parts to
diagnose and it just makes the whole thing cleaner.

The meta-moral is that db optimization requires systematic
experimentation.  Use the database to store the results of the various
experiments!

In light of this, I would suggest you try removing the check clauses
and seeing if you get a difference too.  Just like Francis Bacon said
-- don't deduce from first principles, experiment!

I would also try amortizing the analysis with triggers, rather than
building the table all at once; this may be better or worse, depending
on the on-line character of the application (e.g., if they are waiting
at an ATM, in might be a deal breaker to add two seconds to do an
insert / update, but not if you are tailing off a log file that gets
updated every minute or so.)

W

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-06 Thread Reuven M. Lerner

Hi, Alvaro Herrera.  You wrote:

Don't assume -- measure.  I had a query which ran orders of magnitude
faster because I interpolated the constant list in the big query.  The
table from which the interpolated values were being extracted had about
30 rows or so.
  

OK, I modified things to use interpolation.  Here's the updated query:

   explain  UPDATE Transactions
  SET previous_value = previous_value(id)
WHERE new_value IS NOT NULL
  AND new_value <> ''
  AND node_id IN (351, 169, 664, 240);

And here is the query plan produced by explain:

   QUERY
   PLAN   
   --

Bitmap Heap Scan on transactions  (cost=8842.88..98283.93
   rows=407288 width=249)
  Recheck Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
  Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text))
  ->  Bitmap Index Scan on node_id_idx  (cost=0.00..8842.88
   rows=434276 width=0)
Index Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
   (5 rows)

I'm still a bit surprised by how different the query plan came out with 
what would seem like a minor change. 


Reuven



Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-06 Thread Alvaro Herrera
Reuven M. Lerner escribió:
> Hi, Webb Sprague.  You wrote:
> >... but I see two seq scans in your explain in a loop -- this is
> >probably not good.  If you can find a way to rewrite the IN clause
> >(either de-normalizing through triggers to save whatever you need on
> >an insert and not have to deal with a set, or by using except in the
> >query, or someing else more creative)...
> I would normally agree that an IN clause is a place to worry -- except 
> that I'm using IN on a very small table, with about 4-5 rows.  That 
> might indeed affect things, and I could certainly pull out these values 
> into a Perl variable that I then interpolate literally into the SQL.  
> However, I have to assume that this wouldn't affect things all that much.

Don't assume -- measure.  I had a query which ran orders of magnitude
faster because I interpolated the constant list in the big query.  The
table from which the interpolated values were being extracted had about
30 rows or so.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-06 Thread Reuven M. Lerner

Hi, Webb Sprague.  You wrote:

... but I see two seq scans in your explain in a loop -- this is
probably not good.  If you can find a way to rewrite the IN clause
(either de-normalizing through triggers to save whatever you need on
an insert and not have to deal with a set, or by using except in the
query, or someing else more creative)...
I would normally agree that an IN clause is a place to worry -- except 
that I'm using IN on a very small table, with about 4-5 rows.  That 
might indeed affect things, and I could certainly pull out these values 
into a Perl variable that I then interpolate literally into the SQL.  
However, I have to assume that this wouldn't affect things all that much.

Also -- there is a good book on temporal databases by Snodgrass that
might give some interesting ideas; maybe you have already seen it, but
still.
Heh -- that's one of the first references that I looked at when I 
started this project.  Perhaps I'll look at it again; the design of my 
data warehouse took some of those suggestions into account, but I 
wouldn't be surprised if there's something more I could be doing.

  I am thinking you could increment a sequence variable to give
you a "tick" integer with each action, rather than trying to use
timestamps with all their overhead and inaccuracy (1 second is a long
time, really).  Lamport also did work on clocks that might apply.
I agree that 1 second is a ridiculously long time.  The logfiles were 
created by people outside of my research group, and are a wonderful 
lesson in how to use XML poorly and inefficiently.  The designers of the 
logfile weren't expecting to get dozens or hundreds of values per 
second.  You can be sure that my research group, which is now looking to 
create logfiles of our own, will place a strong emphasis on 
high-resolution timestamps. 

Practically speaking, I've managed to get around this problem by using 
the "id" column, which comes from a sequence, and is thus guaranteed to 
be increasing.  We can thus be assured that a row with an ID of x will 
come before a row whose ID is x + 10, so long as their transaction types 
are the same.  So we can't guarantee that a node entry comes before a 
variable setting via the ID, but we can order two variable settings 
based on the ID.

Also have you tried dropping all your fk and checks just to see if you
get a difference in speed on an update?  It would be interesting,
perhaps.
I sorta figured that because our UPDATE is taking place on a column 
without a foreign key, that this wouldn't matter much.  Am I wrong?

If you could get rid of the sorted limit 1 clause in your function,
there would be less variablity and make it easier to understand; you
probably need to denormalize somehow, perhaps using ticks; I am not
sure
Yeah, that's a good point.  I'm not sure how possible that'll be, 
though, given that at *some* point, I need to find the "most recent" 
setting of the same variable.  There might be dozens or hundreds of rows 
separating the assignment of a new value to a particular variable.


Could a trigger set your previous_value on insert or update, rather
than querying for it later?
Well, what I've been doing is INSERTing each row into the "transactions" 
table without previous_value.  Then I run the mass UPDATE for 
previous_value, doing nothing else at the same time.  Then, after 
previous_value is set, I do the query.


Reuven


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-06 Thread Webb Sprague

I am not able to look as closely as it deserves ...

... but I see two seq scans in your explain in a loop -- this is
probably not good.  If you can find a way to rewrite the IN clause
(either de-normalizing through triggers to save whatever you need on
an insert and not have to deal with a set, or by using except in the
query, or someing else more creative)...

Also -- there is a good book on temporal databases by Snodgrass that
might give some interesting ideas; maybe you have already seen it, but
still.  I am thinking you could increment a sequence variable to give
you a "tick" integer with each action, rather than trying to use
timestamps with all their overhead and inaccuracy (1 second is a long
time, really).  Lamport also did work on clocks that might apply.

Also have you tried dropping all your fk and checks just to see if you
get a difference in speed on an update?  It would be interesting,
perhaps.

If you could get rid of the sorted limit 1 clause in your function,
there would be less variablity and make it easier to understand; you
probably need to denormalize somehow, perhaps using ticks; I am not
sure

Could a trigger set your previous_value on insert or update, rather
than querying for it later?


I'm now thinking of separating each activity into its own database, in
the hopes that this will improve the system speed enough to do what I
need.  But I'm far from convinced that this will really help.


Better to figure out the real problem -- more interesting, more scalable.

Hope my incoherent brain dump lights a spark.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-06 Thread Reuven M. Lerner

Hi, Webb Sprague.  You wrote:

How much concurrency is there on your database?
Almost none.  I'm generally running one client against the server.  I 
often have a second client connected simultaneously, just for the 
purpose of answering short questions. 

I'm now thinking of separating each activity into its own database, in 
the hopes that this will improve the system speed enough to do what I 
need.  But I'm far from convinced that this will really help.


Reuven


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-05 Thread Webb Sprague

.

 Heh.  Sure thing.  I wasn't sure how much detail to give when initially
posting.


Looks like enough to get the real experts on the list started :)

I will try to look again tommorrow, but I bet other folks have better
intuition than me.

How much concurrency is there on your database?

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-05 Thread Webb Sprague

 Well, I've tried to do massive UPDATEs as much as possible.  But the
patterns that we're looking for are basically of the variety, "If the user
clicks on X and then clicks on Y, but without Z between the two of them, and
if these are all part of the same simulation run, then we tag action X as
being of interest to us."  So it's oodles of keeping track of back-and-forth
for each of the rows in the table, and looking forward and backward in the
table.

 I agree that row-at-a-time thinking isn't the best way to work, but I
didn't see a good alternative for our purposes.  I'm open to any and all
suggestions.


Can you post at least some table schemas, indexes, queries, and
explain output?  I think of database optimization as a serious case of
devil in the details, and generalities (like -- make sure you index,
make sure your indexes help using explain, avoid row-at-a-time
thinking) won't get you far. So if we had something concrete to work
with, well, we would have something concrete to work with.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-05 Thread Reuven M. Lerner

Hi, Tom.  You wrote:
Hi, everyone.  I've been using PostgreSQL for a decade, and it hasn't 
failed me yet.  But I've been having some serious performance problems 
on a database that I've been using in my grad-school research group, and 
it's clear that I need help from some more experienced hands.



What PG version are you using?
  
I've been using 8.0, 8.1, and 8.2 at various points, depending on which 
machine I've been using.  My main machine is currently using 8.2.0.  (I 
wish that I had control over which version was being used, but my 
sysadmin powers are inversely proportional to the computer power made 
available to me.

I'd try to think of a way to eliminate the function altogether in favor
of a single UPDATE command.  In general, row-at-a-time thinking isn't
the way to win in SQL.
Well, I've tried to do massive UPDATEs as much as possible.  But the 
patterns that we're looking for are basically of the variety, "If the 
user clicks on X and then clicks on Y, but without Z between the two of 
them, and if these are all part of the same simulation run, then we tag 
action X as being of interest to us."  So it's oodles of keeping track 
of back-and-forth for each of the rows in the table, and looking forward 
and backward in the table.


I agree that row-at-a-time thinking isn't the best way to work, but I 
didn't see a good alternative for our purposes.  I'm open to any and all 
suggestions.


Reuven



Re: [GENERAL] Database slowness -- my design, hardware, or both?

2007-03-05 Thread Tom Lane
"Reuven M. Lerner" <[EMAIL PROTECTED]> writes:
> Hi, everyone.  I've been using PostgreSQL for a decade, and it hasn't 
> failed me yet.  But I've been having some serious performance problems 
> on a database that I've been using in my grad-school research group, and 
> it's clear that I need help from some more experienced hands.

What PG version are you using?

> In theory, this would be the way to go.  In practice, every step has 
> become a performance nightmare, taking many more hours than I might have 
> hoped or expected.  For example, I've got a Pl/PgSQL function that goes 
> through each variable-assignment row, and UPDATEs is previous_value 
> column with whatever the previous value might have been.

I'd try to think of a way to eliminate the function altogether in favor
of a single UPDATE command.  In general, row-at-a-time thinking isn't
the way to win in SQL.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster