Re: [GENERAL] Database slowness -- my design, hardware, or both?
Reuven-- a few years back this same scenario happened for me working on a project with a large Oracle DB which held clinical data We needed to implement a join condition but the column we were using wa not using an Index scan so since the cardinality of the results was extraordinary we re-implemented the column to use EXISTS (for large result sets) *following the omnipresent example used by Oracle books where IN is used for columns with low cardinality * and following the docs from EXISTS where the EXISTS clause ALWAYS has to return something In our case we were joining on a doctorid with IN (which made no sense as there were millions of PatientIDs) to find all patients whose doctors were in PPO's so to increase performance we changed the IN clause for the column with high cardinality (doctorid) to EXISTS select p.PATIENT_NAME from PATIENT p where p.doctorid in (select doctorid from ppo_table) /**there are many doctorids in this ppo table so we will change to EXISTS/ select p.PATIENT_NAME from PATIENT p where exists (select 0 from ppo_table ppo where p.doctorID = ppo.doctorID); Shalom/ Martin-- --- This e-mail message (including attachments, if any) is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, proprietary , confidential and exempt from disclosure. If you are not the intended recipient, you are notified that any dissemination, distribution or copying of this communication is strictly prohibited. --- Le présent message électronique (y compris les pièces qui y sont annexées, le cas échéant) s'adresse au destinataire indiqué et peut contenir des renseignements de caractère privé ou confidentiel. Si vous n'êtes pas le destinataire de ce document, nous vous signalons qu'il est strictement interdit de le diffuser, de le distribuer ou de le reproduire. - Original Message - From: "Alvaro Herrera" <[EMAIL PROTECTED]> To: "Reuven M. Lerner" <[EMAIL PROTECTED]> Cc: "Webb Sprague" <[EMAIL PROTECTED]>; Sent: Tuesday, March 06, 2007 4:19 PM Subject: Re: [GENERAL] Database slowness -- my design, hardware, or both? > Reuven M. Lerner escribió: >> Hi, Webb Sprague. You wrote: >> >... but I see two seq scans in your explain in a loop -- this is >> >probably not good. If you can find a way to rewrite the IN clause >> >(either de-normalizing through triggers to save whatever you need on >> >an insert and not have to deal with a set, or by using except in the >> >query, or someing else more creative)... >> I would normally agree that an IN clause is a place to worry -- except >> that I'm using IN on a very small table, with about 4-5 rows. That >> might indeed affect things, and I could certainly pull out these values >> into a Perl variable that I then interpolate literally into the SQL. >> However, I have to assume that this wouldn't affect things all that much. > > Don't assume -- measure. I had a query which ran orders of magnitude > faster because I interpolated the constant list in the big query. The > table from which the interpolated values were being extracted had about > 30 rows or so. > > -- > Alvaro Herrerahttp://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend > ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Database slowness -- my design, hardware, or both?
> LOG: statement: UPDATE Transactions > SET previous_value = previous_value(id) > WHERE new_value IS NOT NULL > AND new_value <> '' > AND node_id IN (SELECT node_id FROM NodeSegments) > LOG: duration: 16687993.067 ms I hope that I can presume some suggestions that I gleened after finishing my celko book. I don't know if the suggestions presented will help in your case. >From the reading WHERE conditions such as <> '' or IS NOT NULL can be >preformance killers as these may discourge the optimizer from using an index scan. The suggest was to replace this with: new_value > '' this WHERE conditions should only find non-NULL and non-empty strings. Also, the IN is also know as a killer so the suggestion was to reform the query like so: UPDATE Transactions SET previous_value = previous_value(id) FROM NodeSegments WHERE Transactions.Node_id = NodeSegments.Node_id AND Transactions.new_value > 'A'; --assuming A is a minimum value I hope this can help. Regards, Richard Broersma Jr. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Hi, Webb Sprague. You wrote: Do you have new \timings? Yup. It just finished executing a little while ago. With the explicitly interpolated array in place, I got the following: LOG: statement: UPDATE Transactions SET previous_value = previous_value(id) WHERE new_value IS NOT NULL AND new_value <> '' AND node_id IN (351, 169, 664, 240) LOG: duration: 16842710.469 ms The previous version, which included lots of calls to RAISE NOTICE and also used a subselect, had the following timing: LOG: statement: UPDATE Transactions SET previous_value = previous_value(id) WHERE new_value IS NOT NULL AND new_value <> '' AND node_id IN (SELECT node_id FROM NodeSegments) LOG: duration: 16687993.067 ms (I keep timing information in the logfile, rather than using \timing.) So it looks like this didn't make much of a timing difference at all. And what little difference there was, was negative. Bleah. What you or I think is a minor change isn't necessarily what the planner thinks is a minor change, especially when you change data from something that requires a query to something that is determinate. I would suggest changing your function to remove as many such queries as possible too (I am thinking of the order by limit 1). This would be a good move also in that you narrow down the amount of moving parts to diagnose and it just makes the whole thing cleaner. Good idea. I'll see if I can get the function to be a bit cleaner, although I'm not sure if it is, given the constraints of the problem. That's one of the reasons why I've been adding these "hints" to the database -- so that I can have many small queries, rather than one large one. I would also try amortizing the analysis with triggers, rather than building the table all at once; this may be better or worse, depending on the on-line character of the application (e.g., if they are waiting at an ATM, in might be a deal breaker to add two seconds to do an insert / update, but not if you are tailing off a log file that gets updated every minute or so.) The data that I'm dealing with is not changing over time. So I've been trying to build the transactions table (aka my data warehouse) slowly, adding one or two columns at a time with hints that will make extracting the data easier. Unfortunately, building those hints has proven to be very slow going. Reuven
Re: [GENERAL] Database slowness -- my design, hardware, or both?
OK, I modified things to use interpolation. Here's the updated query: explain UPDATE Transactions SET previous_value = previous_value(id) WHERE new_value IS NOT NULL AND new_value <> '' AND node_id IN (351, 169, 664, 240); And here is the query plan produced by explain: QUERY PLAN -- Bitmap Heap Scan on transactions (cost=8842.88..98283.93 rows=407288 width=249) Recheck Cond: (node_id = ANY ('{351,169,664,240}'::integer[])) Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text)) -> Bitmap Index Scan on node_id_idx (cost=0.00..8842.88 rows=434276 width=0) Index Cond: (node_id = ANY ('{351,169,664,240}'::integer[])) (5 rows) I'm still a bit surprised by how different the query plan came out with what would seem like a minor change. Do you have new \timings? What you or I think is a minor change isn't necessarily what the planner thinks is a minor change, especially when you change data from something that requires a query to something that is determinate. I would suggest changing your function to remove as many such queries as possible too (I am thinking of the order by limit 1). This would be a good move also in that you narrow down the amount of moving parts to diagnose and it just makes the whole thing cleaner. The meta-moral is that db optimization requires systematic experimentation. Use the database to store the results of the various experiments! In light of this, I would suggest you try removing the check clauses and seeing if you get a difference too. Just like Francis Bacon said -- don't deduce from first principles, experiment! I would also try amortizing the analysis with triggers, rather than building the table all at once; this may be better or worse, depending on the on-line character of the application (e.g., if they are waiting at an ATM, in might be a deal breaker to add two seconds to do an insert / update, but not if you are tailing off a log file that gets updated every minute or so.) W ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Hi, Alvaro Herrera. You wrote: Don't assume -- measure. I had a query which ran orders of magnitude faster because I interpolated the constant list in the big query. The table from which the interpolated values were being extracted had about 30 rows or so. OK, I modified things to use interpolation. Here's the updated query: explain UPDATE Transactions SET previous_value = previous_value(id) WHERE new_value IS NOT NULL AND new_value <> '' AND node_id IN (351, 169, 664, 240); And here is the query plan produced by explain: QUERY PLAN -- Bitmap Heap Scan on transactions (cost=8842.88..98283.93 rows=407288 width=249) Recheck Cond: (node_id = ANY ('{351,169,664,240}'::integer[])) Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text)) -> Bitmap Index Scan on node_id_idx (cost=0.00..8842.88 rows=434276 width=0) Index Cond: (node_id = ANY ('{351,169,664,240}'::integer[])) (5 rows) I'm still a bit surprised by how different the query plan came out with what would seem like a minor change. Reuven
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Reuven M. Lerner escribió: > Hi, Webb Sprague. You wrote: > >... but I see two seq scans in your explain in a loop -- this is > >probably not good. If you can find a way to rewrite the IN clause > >(either de-normalizing through triggers to save whatever you need on > >an insert and not have to deal with a set, or by using except in the > >query, or someing else more creative)... > I would normally agree that an IN clause is a place to worry -- except > that I'm using IN on a very small table, with about 4-5 rows. That > might indeed affect things, and I could certainly pull out these values > into a Perl variable that I then interpolate literally into the SQL. > However, I have to assume that this wouldn't affect things all that much. Don't assume -- measure. I had a query which ran orders of magnitude faster because I interpolated the constant list in the big query. The table from which the interpolated values were being extracted had about 30 rows or so. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Hi, Webb Sprague. You wrote: ... but I see two seq scans in your explain in a loop -- this is probably not good. If you can find a way to rewrite the IN clause (either de-normalizing through triggers to save whatever you need on an insert and not have to deal with a set, or by using except in the query, or someing else more creative)... I would normally agree that an IN clause is a place to worry -- except that I'm using IN on a very small table, with about 4-5 rows. That might indeed affect things, and I could certainly pull out these values into a Perl variable that I then interpolate literally into the SQL. However, I have to assume that this wouldn't affect things all that much. Also -- there is a good book on temporal databases by Snodgrass that might give some interesting ideas; maybe you have already seen it, but still. Heh -- that's one of the first references that I looked at when I started this project. Perhaps I'll look at it again; the design of my data warehouse took some of those suggestions into account, but I wouldn't be surprised if there's something more I could be doing. I am thinking you could increment a sequence variable to give you a "tick" integer with each action, rather than trying to use timestamps with all their overhead and inaccuracy (1 second is a long time, really). Lamport also did work on clocks that might apply. I agree that 1 second is a ridiculously long time. The logfiles were created by people outside of my research group, and are a wonderful lesson in how to use XML poorly and inefficiently. The designers of the logfile weren't expecting to get dozens or hundreds of values per second. You can be sure that my research group, which is now looking to create logfiles of our own, will place a strong emphasis on high-resolution timestamps. Practically speaking, I've managed to get around this problem by using the "id" column, which comes from a sequence, and is thus guaranteed to be increasing. We can thus be assured that a row with an ID of x will come before a row whose ID is x + 10, so long as their transaction types are the same. So we can't guarantee that a node entry comes before a variable setting via the ID, but we can order two variable settings based on the ID. Also have you tried dropping all your fk and checks just to see if you get a difference in speed on an update? It would be interesting, perhaps. I sorta figured that because our UPDATE is taking place on a column without a foreign key, that this wouldn't matter much. Am I wrong? If you could get rid of the sorted limit 1 clause in your function, there would be less variablity and make it easier to understand; you probably need to denormalize somehow, perhaps using ticks; I am not sure Yeah, that's a good point. I'm not sure how possible that'll be, though, given that at *some* point, I need to find the "most recent" setting of the same variable. There might be dozens or hundreds of rows separating the assignment of a new value to a particular variable. Could a trigger set your previous_value on insert or update, rather than querying for it later? Well, what I've been doing is INSERTing each row into the "transactions" table without previous_value. Then I run the mass UPDATE for previous_value, doing nothing else at the same time. Then, after previous_value is set, I do the query. Reuven ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Database slowness -- my design, hardware, or both?
I am not able to look as closely as it deserves ... ... but I see two seq scans in your explain in a loop -- this is probably not good. If you can find a way to rewrite the IN clause (either de-normalizing through triggers to save whatever you need on an insert and not have to deal with a set, or by using except in the query, or someing else more creative)... Also -- there is a good book on temporal databases by Snodgrass that might give some interesting ideas; maybe you have already seen it, but still. I am thinking you could increment a sequence variable to give you a "tick" integer with each action, rather than trying to use timestamps with all their overhead and inaccuracy (1 second is a long time, really). Lamport also did work on clocks that might apply. Also have you tried dropping all your fk and checks just to see if you get a difference in speed on an update? It would be interesting, perhaps. If you could get rid of the sorted limit 1 clause in your function, there would be less variablity and make it easier to understand; you probably need to denormalize somehow, perhaps using ticks; I am not sure Could a trigger set your previous_value on insert or update, rather than querying for it later? I'm now thinking of separating each activity into its own database, in the hopes that this will improve the system speed enough to do what I need. But I'm far from convinced that this will really help. Better to figure out the real problem -- more interesting, more scalable. Hope my incoherent brain dump lights a spark. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Hi, Webb Sprague. You wrote: How much concurrency is there on your database? Almost none. I'm generally running one client against the server. I often have a second client connected simultaneously, just for the purpose of answering short questions. I'm now thinking of separating each activity into its own database, in the hopes that this will improve the system speed enough to do what I need. But I'm far from convinced that this will really help. Reuven ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Database slowness -- my design, hardware, or both?
. Heh. Sure thing. I wasn't sure how much detail to give when initially posting. Looks like enough to get the real experts on the list started :) I will try to look again tommorrow, but I bet other folks have better intuition than me. How much concurrency is there on your database? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Well, I've tried to do massive UPDATEs as much as possible. But the patterns that we're looking for are basically of the variety, "If the user clicks on X and then clicks on Y, but without Z between the two of them, and if these are all part of the same simulation run, then we tag action X as being of interest to us." So it's oodles of keeping track of back-and-forth for each of the rows in the table, and looking forward and backward in the table. I agree that row-at-a-time thinking isn't the best way to work, but I didn't see a good alternative for our purposes. I'm open to any and all suggestions. Can you post at least some table schemas, indexes, queries, and explain output? I think of database optimization as a serious case of devil in the details, and generalities (like -- make sure you index, make sure your indexes help using explain, avoid row-at-a-time thinking) won't get you far. So if we had something concrete to work with, well, we would have something concrete to work with. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Database slowness -- my design, hardware, or both?
Hi, Tom. You wrote: Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't failed me yet. But I've been having some serious performance problems on a database that I've been using in my grad-school research group, and it's clear that I need help from some more experienced hands. What PG version are you using? I've been using 8.0, 8.1, and 8.2 at various points, depending on which machine I've been using. My main machine is currently using 8.2.0. (I wish that I had control over which version was being used, but my sysadmin powers are inversely proportional to the computer power made available to me. I'd try to think of a way to eliminate the function altogether in favor of a single UPDATE command. In general, row-at-a-time thinking isn't the way to win in SQL. Well, I've tried to do massive UPDATEs as much as possible. But the patterns that we're looking for are basically of the variety, "If the user clicks on X and then clicks on Y, but without Z between the two of them, and if these are all part of the same simulation run, then we tag action X as being of interest to us." So it's oodles of keeping track of back-and-forth for each of the rows in the table, and looking forward and backward in the table. I agree that row-at-a-time thinking isn't the best way to work, but I didn't see a good alternative for our purposes. I'm open to any and all suggestions. Reuven
Re: [GENERAL] Database slowness -- my design, hardware, or both?
"Reuven M. Lerner" <[EMAIL PROTECTED]> writes: > Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't > failed me yet. But I've been having some serious performance problems > on a database that I've been using in my grad-school research group, and > it's clear that I need help from some more experienced hands. What PG version are you using? > In theory, this would be the way to go. In practice, every step has > become a performance nightmare, taking many more hours than I might have > hoped or expected. For example, I've got a Pl/PgSQL function that goes > through each variable-assignment row, and UPDATEs is previous_value > column with whatever the previous value might have been. I'd try to think of a way to eliminate the function altogether in favor of a single UPDATE command. In general, row-at-a-time thinking isn't the way to win in SQL. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster