Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
> >> This bothers me a bit, because in > >> fact the effects if any of the tested query would have been rolled > >> back. Not sure we have any choice though. If we expose the error > >> then we'll have problems with clients not showing the EXPLAIN > >> results. > > > I think we should leave it in top level, throw the error and fix the > > clients. > > As I understood, the idea was, that it only does that if you press ^C > > or query timeout. In this case current clients would also not show the > > plan. > > Not if the clients are implemented per protocol spec. A > client cannot assume that sending QueryCancel will make the > current query fail. Sorry I don't understand that comment. I did not not say that it must fail, but iff it is interrupted (and thus fails) was the case I meant. You stated, that current clients won't show the explain output if they get a protocol error response. (Does the protocol not allow both data and error ?) We would need to teach clients to output the explain result even if an error is returned. I hold my comment: on ^C we should return the plan and return the error. We should not misuse automatic subtransactions for this. Andreas ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
> "MvO" == Martijn van Oosterhout writes: MvO> What we want is just a monotonically increasing counter that can MvO> be read quickly and consistantly, we're not majorly fussed if it MvO> doesn't match real time. This puts us back to CPU cycle counters, MvO> but they have drawbacks of their own. It is amazing how this discussion is mirroring discussions on the linux-kernel list. Applications have been using CPU cycle counters on Linux to avoid the gettimeofday() overhead. With reasonably recent kernels, the overhead is very low when the CPU cycle counters are usable, because gettimeofday() never actually enters kernel space. Unfortunately fewer and fewer systems seem to have usable cycle counters. As an example, dual core Athlon64/Opteron boots with the cycle counters unsynced. The kernel can compensate for that. However they also lose sync whenever clock frequency changes, and the kernel has a really hard time compensating for it. On such systems the kernel switches back to slower timers and gettimeofday() becomes a real system call. Applications should not try to use cycle counters directly on such systems. If the kernel developers manage a good workaround, gettimeofday() becomes fast again, but applications which use cycle counters most likely stay broken. Basically either gettimeofday() is fast, or the cycle counters are useless -- unless you really care about counting CPU cycles and not real time. Some CPUs like Transmetas actually get the counter thing right and count 2 every tick when running at half speed and so on. /Benny ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > There seem to be two types of overhead going on. There's the amount of time > spent in gettimeofday itself which is pretty consistent. That is a fact not in evidence. The impression I had from that linux-kernel discussion was that the problematic kernel code was looping until it got consistent results from successive hardware reads. I'm not at all sure that you can make the above assertion across all varieties of clock hardware, or even all common varieties. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > The interesting thing about this is that they obviously are gearing > gettimeofday() to be accurate, rather than just considering it a > counter that is somewhat close to real time. At the expense of speed. Not sure that that's an accurate description. What I think the kernel fuss is about is that they have to read the counter value as several separate "byte read" operations, and if the hardware doesn't latch the whole counter value when the first byte is pulled then they'll get bytes from several distinct states of the counter, leading to something that's not consistent or even monotonic. On non-latching hardware there's really not a lot of choice what to do. The patch is about not using that same very-slow code path on hardware that does latch. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > Right now I'm confused though. I was under the impression the changes > were going to be ripped out because it was decided to be unworkable. I > think improvements can be made but I'm unsure if there's any > interest... I've reverted the current patch because it clearly doesn't work well enough. There's nothing stopping you from having a better idea though. It's clear that on some platforms the cost of gettimeofday is high enough that some workaround would be good. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Fri, Jun 09, 2006 at 03:55:28PM -0400, Tom Lane wrote: > Greg Stark <[EMAIL PROTECTED]> writes: > > There seem to be two types of overhead going on. There's the amount of time > > spent in gettimeofday itself which is pretty consistent. > > That is a fact not in evidence. The impression I had from that > linux-kernel discussion was that the problematic kernel code was looping > until it got consistent results from successive hardware reads. I'm > not at all sure that you can make the above assertion across all > varieties of clock hardware, or even all common varieties. The interesting thing about this is that they obviously are gearing gettimeofday() to be accurate, rather than just considering it a counter that is somewhat close to real time. At the expense of speed. What we want is just a monotonically increasing counter that can be read quickly and consistantly, we're not majorly fussed if it doesn't match real time. This puts us back to CPU cycle counters, but they have drawbacks of their own. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane <[EMAIL PROTECTED]> writes: > Greg Stark <[EMAIL PROTECTED]> writes: > > To avoid user confusion it would reasonable to print out a line at the > > bottom: > > > Explain analyze profiling overhead removed: xxx ms > > > That also gives the user feedback on how precise their explain analyze > > results > > are. If they see that the overhead being removed is as large as the timing > > remaining then they can realize that the results aren't especially precise. > > On > > the other hand if they see that the overhead being removed is small then > > they > > can be pretty confident in the precision of the results. > > I would prefer to leave the numbers unmanipulated, because frankly I'd > have no confidence in the correction. Perhaps we could put a note at the > bottom about "Estimated total profiling overhead: xxx ms". Well the whole reason the overhead is deceptive is that some nodes call gettimeofday many more times proportionally to their running time than others. It makes users suspect cpu-intensive nodes that process many records when the real problem lies in a i/o-intensive node that processes few records. Just showing the total overhead will give the user the impression that time is distributed evenly throughout the nodes proportionally to their total time which doesn't help correct the distortion. There seem to be two types of overhead going on. There's the amount of time spent in gettimeofday itself which is pretty consistent. Then there's other effects that cause normal operation itself to take longer (or potentially even less long). Postgres may as well remove the consistent gettimeofday overhead it can identify and measure even if there's more it can't characterize perfectly. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > To avoid user confusion it would reasonable to print out a line at the bottom: > Explain analyze profiling overhead removed: xxx ms > That also gives the user feedback on how precise their explain analyze results > are. If they see that the overhead being removed is as large as the timing > remaining then they can realize that the results aren't especially precise. On > the other hand if they see that the overhead being removed is small then they > can be pretty confident in the precision of the results. I would prefer to leave the numbers unmanipulated, because frankly I'd have no confidence in the correction. Perhaps we could put a note at the bottom about "Estimated total profiling overhead: xxx ms". regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane <[EMAIL PROTECTED]> writes: > Greg Stark <[EMAIL PROTECTED]> writes: > > Are we still going to subtract out the gettimeofday overhead? > > > I was always more excited about that than the sampling aspect. I've run into > > queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday > > overhead but I've never run into a query where gettimeofday overhead made > > running the query under EXPLAIN ANALYZE impractical. > > That would be deceptive in a different way, ie, make it look like there > was a whole bunch of outside-the-plan overhead. Well that's exactly what there is. To avoid user confusion it would reasonable to print out a line at the bottom: Explain analyze profiling overhead removed: xxx ms That also gives the user feedback on how precise their explain analyze results are. If they see that the overhead being removed is as large as the timing remaining then they can realize that the results aren't especially precise. On the other hand if they see that the overhead being removed is small then they can be pretty confident in the precision of the results. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > Are we still going to subtract out the gettimeofday overhead? > I was always more excited about that than the sampling aspect. I've run into > queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday > overhead but I've never run into a query where gettimeofday overhead made > running the query under EXPLAIN ANALYZE impractical. That would be deceptive in a different way, ie, make it look like there was a whole bunch of outside-the-plan overhead. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane <[EMAIL PROTECTED]> writes: > I was intending to push forward with the idea of being able to get > numbers out of a canceled EXPLAIN. That will allow you to get some > information even when the underlying query runs longer than you're > willing to tolerate. I still say that the number of queries where > avoiding gettimeofday overhead would transform an intolerable runtime > into a tolerable one is pretty limited. Are we still going to subtract out the gettimeofday overhead? I was always more excited about that than the sampling aspect. I've run into queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday overhead but I've never run into a query where gettimeofday overhead made running the query under EXPLAIN ANALYZE impractical. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Fri, Jun 09, 2006 at 11:22:11AM -0400, A.M. wrote: > It would be nice to keep the gettimeofday()s wherever they are most useful > on hardware/software where they are cheap. Perhaps a compile-time option? I think making it optional at compile time is worse, because then at support time you might have to recompile to get the option you want. Right now I'm confused though. I was under the impression the changes were going to be ripped out because it was decided to be unworkable. I think improvements can be made but I'm unsure if there's any interest... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
"Zeugswetter Andreas DCP SD" <[EMAIL PROTECTED]> writes: >> This bothers me a bit, because in >> fact the effects if any of the tested query would have been >> rolled back. Not sure we have any choice though. If we >> expose the error then we'll have problems with clients not >> showing the EXPLAIN results. > I think we should leave it in top level, throw the error and fix the > clients. > As I understood, the idea was, that it only does that if you > press ^C or query timeout. In this case current clients would also > not show the plan. Not if the clients are implemented per protocol spec. A client cannot assume that sending QueryCancel will make the current query fail. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Fri, 2006-06-09 at 10:56 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > I propose we revert the sampling patch (sorry Martijn) > > yah ... > > > and go with the > > patch to have an explain_analyze_timing parameter (default=on). > > This I'm unexcited about. EXPLAIN output isn't all that transparent > anyway, and losing the extra cue of seeing where the time is really > going would make it extremely easy for people to misinterpret their > problems. As before, agreed, but it works and is available now. > I was intending to push forward with the idea of being able to get > numbers out of a canceled EXPLAIN. That will allow you to get some > information even when the underlying query runs longer than you're > willing to tolerate. I still say that the number of queries where > avoiding gettimeofday overhead would transform an intolerable runtime > into a tolerable one is pretty limited. That would be a good feature to have. > The other thing that I think would be worth investigating is > timer-driven sampling, although it's not yet clear whether we can > make that work usefully. Anyone want to take up that project? Not me, sorry. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: >> I had thought we were applying an order-of-ten-percent correction by >> subtracting SampleOverhead, not an order-of-10x correction :-( > Eh? The whole point is to call gettimeofday() much less often. If you > call it 1000th as often, then the correction is only on the order of > one hundredth of the normal query time... No, because the correction calculation is totaltime += (average time per sampled execution - SampleOverhead) * (number of unsampled executions) If SampleOverhead is 90% of the average time per sampled execution, then you are multiplying something with a large component of cancellation error by a possibly-large number. As an example, using the numbers I posted here for my old PC: http://archives.postgresql.org/pgsql-hackers/2006-06/msg00407.php the actual runtime is clearly about 1 usec per tuple but enabling timing adds 8.5 usec per tuple. If we suppose we sampled 1 out of 1 million rows, then we'd have raw totaltime = 1 * 9.5 usec = 95msec avg time/execution = 9.5 usec SampleOverhead = 8.5 usec number of unsampled executions = 99 correction = 990msec which means that a 10% error in estimating SampleOverhead would contribute as much to the final estimate as the actual measurement did. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
It would be nice to keep the gettimeofday()s wherever they are most useful on hardware/software where they are cheap. Perhaps a compile-time option? On Fri, June 9, 2006 11:18 am, Martijn van Oosterhout wrote: > On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: > >> To tell you the truth, this information makes me even less pleased with >> the sampling-gettimeofday patch than I was before. If gettimeofday() >> in itself increases the runtime of a node by a factor of 10, then just >> trying to subtract off that time is no solution. There's too much >> impact on surrounding nodes, and too much roundoff error anyhow. I had >> thought we were applying an order-of-ten-percent correction by >> subtracting SampleOverhead, not an order-of-10x correction :-( > > Eh? The whole point is to call gettimeofday() much less often. If you > call it 1000th as often, then the correction is only on the order of one > hundredth of the normal query time... > > Subtracting SampleOverhead is only a correction on the order of a few > percent, it's the reduced calling of gettimeofday() that provides the > benefit. > > Have a nice day, > -- > Martijn van Oosterhout http://svana.org/kleptog/ > >> From each according to his ability. To each according to his ability to >> litigate. > ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote: > To tell you the truth, this information makes me even less pleased with > the sampling-gettimeofday patch than I was before. If gettimeofday() in > itself increases the runtime of a node by a factor of 10, then just > trying to subtract off that time is no solution. There's too much > impact on surrounding nodes, and too much roundoff error anyhow. > I had thought we were applying an order-of-ten-percent correction by > subtracting SampleOverhead, not an order-of-10x correction :-( Eh? The whole point is to call gettimeofday() much less often. If you call it 1000th as often, then the correction is only on the order of one hundredth of the normal query time... Subtracting SampleOverhead is only a correction on the order of a few percent, it's the reduced calling of gettimeofday() that provides the benefit. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > I propose we revert the sampling patch (sorry Martijn) yah ... > and go with the > patch to have an explain_analyze_timing parameter (default=on). This I'm unexcited about. EXPLAIN output isn't all that transparent anyway, and losing the extra cue of seeing where the time is really going would make it extremely easy for people to misinterpret their problems. I was intending to push forward with the idea of being able to get numbers out of a canceled EXPLAIN. That will allow you to get some information even when the underlying query runs longer than you're willing to tolerate. I still say that the number of queries where avoiding gettimeofday overhead would transform an intolerable runtime into a tolerable one is pretty limited. The other thing that I think would be worth investigating is timer-driven sampling, although it's not yet clear whether we can make that work usefully. Anyone want to take up that project? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Fri, 2006-06-09 at 10:00 -0400, Tom Lane wrote: > I think the bottom line here is that there are some machines out there > where gettimeofday() is fast enough for our purposes, and some where > it is nowhere near fast enough. And that kernel-level fixes may be > possible for some of the latter, but not all, and we shouldn't hold our > breath waiting for that to happen anyway. Agreed. The issue seems to be some systems are set to get exactly correct times and some are set to return a time (possibly imprecise) with low overhead. Even if fixes existed, OS packagers may not pick the right one of those two options for our purposes for EA. (We might prefer accuracy to speed for other parts of PostgreSQL anyway). I propose we revert the sampling patch (sorry Martijn) and go with the patch to have an explain_analyze_timing parameter (default=on). That way we'll have an option to turn off timing *if* we happen to be running on a OS/hw combo that sucks *and* trying to run large enough EAs that we care. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
"Larry Rosenman" writes: > Does my post from yesterday (39ms without explain analyze, 280+ with explain > analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help? Well, it confirms that FreeBSD is subject to the same problem ;-). I think the bottom line here is that there are some machines out there where gettimeofday() is fast enough for our purposes, and some where it is nowhere near fast enough. And that kernel-level fixes may be possible for some of the latter, but not all, and we shouldn't hold our breath waiting for that to happen anyway. To tell you the truth, this information makes me even less pleased with the sampling-gettimeofday patch than I was before. If gettimeofday() in itself increases the runtime of a node by a factor of 10, then just trying to subtract off that time is no solution. There's too much impact on surrounding nodes, and too much roundoff error anyhow. I had thought we were applying an order-of-ten-percent correction by subtracting SampleOverhead, not an order-of-10x correction :-( regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Hmm, I take that back. I updated to the current CVS tip on this machine > > and now I don't see the problem. Without EXPLAIN ANALYZE, the times are > > around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. > > At the moment, CVS HEAD still has the sampling patch in it. You need to > be looking at 8.1, or HEAD from several weeks ago, to see the full > impact of gettimeofday-every-time. (I imagine Simon's strace count > is wrong for the same reason.) Ok, then I re-retract, because the test I posted previously was done with 8.1. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: >> Hmm, I take that back. I updated to the current CVS tip on this >> machine and now I don't see the problem. Without EXPLAIN ANALYZE, >> the times are around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. > > At the moment, CVS HEAD still has the sampling patch in it. You need > to be looking at 8.1, or HEAD from several weeks ago, to see the full > impact of gettimeofday-every-time. (I imagine Simon's strace count > is wrong for the same reason.) > Does my post from yesterday (39ms without explain analyze, 280+ with explain analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help? What can I do to further the investigation? It has 8.1.4 on it. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Hmm, I take that back. I updated to the current CVS tip on this machine > and now I don't see the problem. Without EXPLAIN ANALYZE, the times are > around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. At the moment, CVS HEAD still has the sampling patch in it. You need to be looking at 8.1, or HEAD from several weeks ago, to see the full impact of gettimeofday-every-time. (I imagine Simon's strace count is wrong for the same reason.) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs wrote: > On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote: > > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > > you were looking for. > > > > Do you have oprofile installed on these? Comparing oprofile results > > might give some more insight where the time is going. > > On my office PC the overhead from EA is about 20%. Can you please run the test program in this thread, if you are running Linux? http://marc.theaimsgroup.com/?l=linux-kernel&m=114297656924494&w=2 The machine I reported to show the problem says # ./a.out 0xe408 0 ./a.out: Detect PM-Timer Bug -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Alvaro Herrera wrote: > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > you were looking for. Hmm, I take that back. I updated to the current CVS tip on this machine and now I don't see the problem. Without EXPLAIN ANALYZE, the times are around 115-117 ms. With EXPLAIN ANALYZE, 149-152ms. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > you were looking for. > > Do you have oprofile installed on these? Comparing oprofile results > might give some more insight where the time is going. On my office PC the overhead from EA is about 20%. stracing a backend, I also notice that on a 1.5 million row table we make only 39158 calls to gettimeofday, on a table of 24591 blocks. I'm thinking that the instrumentation overhead has been reduced as a result of the page-at-a-time heap scans we now have? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
> This bothers me a bit, because in > fact the effects if any of the tested query would have been > rolled back. Not sure we have any choice though. If we > expose the error then we'll have problems with clients not > showing the EXPLAIN results. I think we should leave it in top level, throw the error and fix the clients. As I understood, the idea was, that it only does that if you press ^C or query timeout. In this case current clients would also not show the plan. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
It's worth noting that on Darwin (on Apple hardware) gettimeofday is never a syscall whereas on Linux (AFAIK), it always is. On Jun 8, 2006, at 7:58 PM, Mark Kirkwood wrote: Tom Lane wrote: Alvaro Herrera <[EMAIL PROTECTED]> writes: Wow, that is slow. Maybe a problem in the kernel? Perhaps something similar to this: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 Yeah, that's a pretty interesting thread. I came across something similar on a Red Hat internal list. It seems there are three or four different popular standards for clock hardware in the Intel world, and some good implementations and some pretty bad implementations of each. So the answer may well boil down to "if you're using cheap junk PC hardware then gettimeofday will be slow". OS seems to matter as well - I've got two identical Supermicro P3TDER dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 2.6.16. Doing the 'select count(*) vs explain analyze select count(*) on 10 row table gives: Freebsd : select 108 ms explain analyze 688 ms Linux : select 100 ms explain analyze 196 ms Both systems have ACPI enabled in BIOS (which means there is a better timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' - not sure how to check on Linux). ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ AgentM [EMAIL PROTECTED] ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: Alvaro Herrera <[EMAIL PROTECTED]> writes: Wow, that is slow. Maybe a problem in the kernel? Perhaps something similar to this: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 Yeah, that's a pretty interesting thread. I came across something similar on a Red Hat internal list. It seems there are three or four different popular standards for clock hardware in the Intel world, and some good implementations and some pretty bad implementations of each. So the answer may well boil down to "if you're using cheap junk PC hardware then gettimeofday will be slow". OS seems to matter as well - I've got two identical Supermicro P3TDER dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 2.6.16. Doing the 'select count(*) vs explain analyze select count(*) on 10 row table gives: Freebsd : select 108 ms explain analyze 688 ms Linux : select 100 ms explain analyze 196 ms Both systems have ACPI enabled in BIOS (which means there is a better timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' - not sure how to check on Linux). Cheers Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Wow, that is slow. Maybe a problem in the kernel? Perhaps something > similar to this: > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 Yeah, that's a pretty interesting thread. I came across something similar on a Red Hat internal list. It seems there are three or four different popular standards for clock hardware in the Intel world, and some good implementations and some pretty bad implementations of each. So the answer may well boil down to "if you're using cheap junk PC hardware then gettimeofday will be slow". In fact, I just got around to trying this on my old x86 PC, and behold: regression=# select count(*) from tenk1; count --- 1 (1 row) Time: 9.670 ms regression=# explain analyze select count(*) from tenk1; QUERY PLAN Aggregate (cost=470.00..470.01 rows=1 width=0) (actual time=93.549..93.553 rows=1 loops=1) -> Seq Scan on tenk1 (cost=0.00..445.00 rows=1 width=0) (actual time=0.014..49.261 rows=1 loops=1) Total runtime: 93.663 ms (3 rows) Time: 94.431 ms regression=# So that says that this machine takes about 4 usec to do gettimeofday(), as compared to 3 usec on my nominally 4x slower HPPA machine. The new dual Xeon sitting next to it probably has a much less junky motherboard, with a better clock ... and I find it unsurprising that the HP and Apple machines I was trying aren't subject to such problems. I didn't get the impression from that linux-kernel thread that the proposed patch had actually gone in yet anyplace; anyone know how to track that? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Thu, Jun 08, 2006 at 05:32:36PM -0400, Tom Lane wrote: > "Jim C. Nasby" <[EMAIL PROTECTED]> writes: > > The powerbook tests were not very repeatable at 100,000 rows, so I > > bumped up to 1M. The results still aren't very repeatable... > > Hmm, I suspect you pushed it into the realm where it's doing actual > I/O ... which we don't want for this particular test. (Although it's > worth noting that if your query *is* doing actual I/O, the EXPLAIN > overhead certainly ought to be down in the noise.) Hmm... I guess it depends on how smart the OS cache is; the table is 36MB; 4406 pages. But shared buffers is 3000... Bumping shared_buffers up to 5k, there is a bigger difference, but these numbers are still all over the place, so I don't know that they're very trust-worthy. decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5398.625..5398.626 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.034..3967.927 rows=100 loops=1) Total runtime: 5398.871 ms (3 rows) Time: 5400.900 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5626.671..5626.671 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.035..3875.641 rows=100 loops=1) Total runtime: 5626.904 ms (3 rows) Time: 5628.896 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5137.549..5137.549 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.020..3440.034 rows=100 loops=1) Total runtime: 5137.707 ms (3 rows) Time: 5139.178 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 4357.443 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 4251.208 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 4712.912 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 4479.278 ms decibel=# -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Just tested 2 windows boxes, both running 8.1.3 and XP SP2. P4 2.8G desktop 47ms297ms Pentium M 2G laptop 40ms240ms -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > The powerbook tests were not very repeatable at 100,000 rows, so I > bumped up to 1M. The results still aren't very repeatable... Hmm, I suspect you pushed it into the realm where it's doing actual I/O ... which we don't want for this particular test. (Although it's worth noting that if your query *is* doing actual I/O, the EXPLAIN overhead certainly ought to be down in the noise.) regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > > > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > > you were looking for. > > Do you have oprofile installed on these? Comparing oprofile results > might give some more insight where the time is going. No, I don't. I'll have a look. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Thu, Jun 08, 2006 at 04:58:07PM -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > So the timing is clearly responsible for the additional time I'm > > personally experiencing and very likely to be that for others also. > > Well, that's certainly unsurprising, but the question is why it's such > a large overhead for you when it's not on other apparently-similar kit. > > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 10-row > table, check \timing with and without explain analyze)? If we could get > a clearer picture of which platforms show the problem and which don't, > maybe we could make some progress. > > (BTW, I just tried it on my Apple Powerbook G4, and got results in line > with what I get elsewhere. So I've now tried it on three very different > CPU types and OSes, and not duplicated Simon's problem on any of them.) CPU: AMD Athlon(tm) 64 Processor 3500+ (2210.20-MHz K8-class CPU) 8.1.3 on amd64-portbld-freebsd6.0 25ms353ms CPU: AMD Opteron(tm) Processor 244 (1792.50-MHz K8-class CPU) (dual CPU) 8.1.4 on amd64-portbld-freebsd6.0 31ms295ms Powerbook G4 1.33GHz 8.1.4 on powerpc-apple-darwin8.6.0 5.1s5.8s* The powerbook tests were not very repeatable at 100,000 rows, so I bumped up to 1M. The results still aren't very repeatable... decibel=# select count(*) from i; count - 100 (1 row) Time: 4914.604 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 5186.516 ms decibel=# select count(*) from i; count - 100 (1 row) Time: 5174.418 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5729.623..5729.624 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.155..4039.317 rows=100 loops=1) Total runtime: 5729.907 ms (3 rows) Time: 5732.076 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5916.025..5916.026 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.157..4246.123 rows=100 loops=1) Total runtime: 5916.261 ms (3 rows) Time: 5922.542 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5800.788..5800.789 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.152..4126.781 rows=100 loops=1) Total runtime: 5801.026 ms (3 rows) Time: 5803.070 ms decibel=# explain analyze select count(*) from i; QUERY PLAN --- Aggregate (cost=16905.05..16905.06 rows=1 width=0) (actual time=5994.285..5994.286 rows=1 loops=1) -> Seq Scan on i (cost=0.00..14405.24 rows=24 width=0) (actual time=0.156..4341.463 rows=100 loops=1) Total runtime: 5994.520 ms (3 rows) Time: 5996.577 ms -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for > the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. > The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT > and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction > you were looking for. Do you have oprofile installed on these? Comparing oprofile results might give some more insight where the time is going. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 10-row > table, check \timing with and without explain analyze)? If we could get > a clearer picture of which platforms show the problem and which don't, > maybe we could make some progress. Tried on two machines. The first (Turion64 laptop) gives 44-45 ms for the SELECT, and 50-51 ms for the EXPLAIN ANALYZE. The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT and 788-790 for the EXPLAIN ANALYZE. I guess this is the reproduction you were looking for. All results are repeated multiple times, I'm reporting times other than the first query (which is slower). More specs on the Celeron machine follow: $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : Celeron (Mendocino) stepping: 5 cpu MHz : 534.616 cache size : 128 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips: 1070.25 $ uname -a Linux drizzt 2.6.16-1-686 #2 Sat Apr 15 20:45:20 UTC 2006 i686 GNU/Linux -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: >> So the timing is clearly responsible for the additional time I'm >> personally experiencing and very likely to be that for others also. > > Well, that's certainly unsurprising, but the question is why it's such > a large overhead for you when it's not on other apparently-similar > kit. > > If anyone else is still following this discussion, could you try the > test case Simon gave a bit earlier (select count(*) from 10-row > table, check \timing with and without explain analyze)? If we could > get a clearer picture of which platforms show the problem and which > don't, maybe we could make some progress. > > (BTW, I just tried it on my Apple Powerbook G4, and got results in > line with what I get elsewhere. So I've now tried it on three very > different CPU types and OSes, and not duplicated Simon's problem on > any of them.) > $ psql Welcome to psql 8.1.4, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit ler=# create table test1(id int); CREATE TABLE ler=# insert into test1 select * from generate_series(1,10); INSERT 0 10 ler=# \timing Timing is on. ler=# select count(*) from test1; count 10 (1 row) Time: 39.191 ms ler=# explain analyze select count(*) from test1; QUERY PLAN Aggregate (cost=1988.18..1988.19 rows=1 width=0) (actual time=282.899..282.901 rows=1 loops=1) -> Seq Scan on test1 (cost=0.00..1698.74 rows=115774 width=0) (actual time=0.007..147.845 rows=10 loops=1) Total runtime: 282.987 ms (3 rows) Time: 283.764 ms ler=# \q $ uname -a FreeBSD thebighonker.lerctr.org 6.1-STABLE FreeBSD 6.1-STABLE #59: Thu Jun 1 09:40:47 CDT 2006 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/THEBIGHONKER amd64 $ sysctl hw hw.machine: amd64 hw.model: Intel(R) Xeon(TM) CPU 3.00GHz hw.ncpu: 4 hw.byteorder: 1234 hw.physmem: 4286132224 hw.usermem: 4003151872 hw.pagesize: 4096 hw.floatingpoint: 1 hw.machine_arch: amd64 hw.realmem: 5368709120 [snip] the database is 8.1.4 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: ler@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs wrote: > postgres=# explain analyze select count(*) from accounts; >QUERY PLAN > > Aggregate (cost=2890.00..2890.01 rows=1 width=0) (actual > time=0.000..690.780 rows=1 loops=1) >-> Seq Scan on accounts (cost=0.00..2640.00 rows=10 width=0) > (actual time=0.000..2672.562 rows=10 loops=1) > Total runtime: 734.474 ms > (3 rows) > > Time: 891.822 ms Wow, that is slow. Maybe a problem in the kernel? Perhaps something similar to this: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282 -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > So the timing is clearly responsible for the additional time I'm > personally experiencing and very likely to be that for others also. Well, that's certainly unsurprising, but the question is why it's such a large overhead for you when it's not on other apparently-similar kit. If anyone else is still following this discussion, could you try the test case Simon gave a bit earlier (select count(*) from 10-row table, check \timing with and without explain analyze)? If we could get a clearer picture of which platforms show the problem and which don't, maybe we could make some progress. (BTW, I just tried it on my Apple Powerbook G4, and got results in line with what I get elsewhere. So I've now tried it on three very different CPU types and OSes, and not duplicated Simon's problem on any of them.) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Thu, 2006-06-08 at 12:56 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > So far we have myself, Kevin, Martijn and Luke all saying there is a > > distortion or a massive overhead caused by EXPLAIN ANALYZE. > > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php > > http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php > > Given that we're seeing diametrically opposed results on the same OS > (FC5) and similar (at least all Intel) hardware, I think the prudent > thing is to find out what's really going on before leaping in with > proposed solutions. As the person who's *not* seeing the problem, > I'm not in a position to do that investigation... That seems reasonable. I've cut a patch to remove timing from the EA results. Output shown here: postgres=# set explain_analyze_timing = on; SET Time: 0.673 ms postgres=# explain analyze select count(*) from accounts; QUERY PLAN Aggregate (cost=2890.00..2890.01 rows=1 width=0) (actual time=0.000..690.780 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2640.00 rows=10 width=0) (actual time=0.000..2672.562 rows=10 loops=1) Total runtime: 734.474 ms (3 rows) Time: 891.822 ms postgres=# set explain_analyze_timing = off; SET Time: 0.480 ms postgres=# explain analyze select count(*) from accounts; QUERY PLAN Aggregate (cost=2890.00..2890.01 rows=1 width=0) (rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2640.00 rows=10 width=0) (rows=10 loops=1) Total runtime: 133.674 ms (3 rows) Time: 134.565 ms postgres=# select count(*) from accounts; count 10 (1 row) Time: 130.528 ms So the timing is clearly responsible for the additional time I'm personally experiencing and very likely to be that for others also. As to why that should be the case, I'm not sure. The timing overhead seems fairly constant on particular hardware/OS, just different for each. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com Index: src/backend/commands/explain.c === RCS file: /projects/cvsroot/pgsql/src/backend/commands/explain.c,v retrieving revision 1.147 diff -c -r1.147 explain.c *** src/backend/commands/explain.c 8 Apr 2006 18:49:52 - 1.147 --- src/backend/commands/explain.c 8 Jun 2006 20:27:13 - *** *** 689,699 { double nloops = planstate->instrument->nloops; ! appendStringInfo(str, " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", 1000.0 * planstate->instrument->startup / nloops, 1000.0 * planstate->instrument->total / nloops, planstate->instrument->ntuples / nloops, planstate->instrument->nloops); } else if (es->printAnalyze) appendStringInfo(str, " (never executed)"); --- 689,704 { double nloops = planstate->instrument->nloops; ! if (planstate->instrument->actualtime) ! appendStringInfo(str, " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", 1000.0 * planstate->instrument->startup / nloops, 1000.0 * planstate->instrument->total / nloops, planstate->instrument->ntuples / nloops, planstate->instrument->nloops); + else + appendStringInfo(str, " (rows=%.0f loops=%.0f)", + planstate->instrument->ntuples / nloops, + planstate->instrument->nloops); } else if (es->printAnalyze) appendStringInfo(str, " (never executed)"); Index: src/backend/executor/instrument.c === RCS file: /projects/cvsroot/pgsql/src/backend/executor/instrument.c,v retrieving revision 1.17 diff -c -r1.17 instrument.c *** src/backend/executor/instrument.c 7 Jun 2006 18:49:03 - 1.17 --- src/backend/executor/instrument.c 8 Jun 2006 20:27:13 - *** *** 17,23 #include #include "executor/instrument.h" ! /* * As of PostgreSQL 8.2, we try to reduce the overhead of EXPLAIN ANALYZE --- 17,23 #include #include "executor/instrument.h" ! #include "utils/guc.h" /* * As of PostgreSQL 8.2, we try to reduce the overhead of EXPLAIN ANALYZE *** *** 115,120 --- 115,121 instr = palloc0(n * sizeof(Instrumentation)); /* we don't need to do any initialization except zero 'em */ + instr->actualtime = explain_analyze_timing; return instr; } *** *** 123,128 --- 124,132 void InstrStartNode(Instrumentation *instr) { + if (!instr->actualtime) + return; + if (INSTR_TIME_IS_ZERO(instr->starttime)) { /* *** *** 150,155 --- 154,166 { /* coun
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
Tom Lane wrote: > A possible objection to this is that running a query inside a > subtransaction might have different/worse performance than running it > at top level. I don't recall any severe bottlenecks of that kind but > that doesn't mean there aren't any (Alvaro, any comments?) Nope, nothing that I recall. Maybe make sure to generate the Xid of the subtransaction before starting the EXPLAIN itself (so that there is no pg_subtrans I/O pending, etc) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
Tom Lane <[EMAIL PROTECTED]> writes: > Gregory Stark <[EMAIL PROTECTED]> writes: > > Tom Lane <[EMAIL PROTECTED]> writes: > > Would it be possible to make a whole new protocol message for EXPLAIN > > results? > > I'm really unwilling to get into that. For one thing, that would > absolutely positively break *all* use of EXPLAIN from un-fixed clients, > whether you were trying to stop early or not. Well I was picturing still having the manual EXPLAIN ANALYZE which works as it does now replacing the query results in addition to a lower level protocol request which generates a protocol response with the data. > > The reason I'm suggesting that is because it might make it easier to > > implement > > the SIGINFO handler that sends incremental EXPLAIN results on demand that I > > was describing earlier. > > Doesn't matter, because that's not happening ;-) SIGINFO isn't > portable, and even if it were, a signal handler couldn't possibly > generate EXPLAIN output (remember those catalog accesses). Well the signal gets handled by psql. It would have to be a wire protocol message like Query Cancel that the backend watches for and handles the first time it can. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
Gregory Stark <[EMAIL PROTECTED]> writes: > Tom Lane <[EMAIL PROTECTED]> writes: > Would it be possible to make a whole new protocol message for EXPLAIN results? I'm really unwilling to get into that. For one thing, that would absolutely positively break *all* use of EXPLAIN from un-fixed clients, whether you were trying to stop early or not. > The reason I'm suggesting that is because it might make it easier to implement > the SIGINFO handler that sends incremental EXPLAIN results on demand that I > was describing earlier. Doesn't matter, because that's not happening ;-) SIGINFO isn't portable, and even if it were, a signal handler couldn't possibly generate EXPLAIN output (remember those catalog accesses). regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
Tom Lane <[EMAIL PROTECTED]> writes: > After we've printed the results, we have a bit of a problem: if > ExplainCancelPending is set, we now need to abort the transaction. It > would not do at all to allow an incompletely executed UPDATE to commit. > I experimented with throwing an elog() at the bottom of ExplainQuery() > after end_tup_output(), but this does not work: psql only prints the > error and not the data, because libpq throws away the query result > upon seeing the error. We could probably hack psql to print the results > before noting the error, but I'm concerned about making a change that > would change the behavior for other error-at-end-of-statement cases. > Also, it's not clear what might have to happen to get non-psql clients > to play along. Would it be possible to make a whole new protocol message for EXPLAIN results? The reason I'm suggesting that is because it might make it easier to implement the SIGINFO handler that sends incremental EXPLAIN results on demand that I was describing earlier. Then libpq would have a separate api call to check for EXPLAIN results. If a non-psql client doesn't check it then it doesn't get the EXPLAIN results but the behaviour is correct. If it does then it can get the EXPLAIN results. This might also let you run with EXPLAIN ANALYZE instrumentation but still get the regular query results. Since the explain analyze results would still be available out of band. -- Gregory Stark T + 514 938 2456 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)
I wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: >> A full EXPLAIN ANALYZE is always desirable - we agree on that. The >> question is what we do when one is not available. > The least bad alternative I've heard is to let EXPLAIN ANALYZE print > out stats-so-far if the query is canceled by control-C or statement > timeout. The objection to this is you may mistake startup transients > for full query behavior ... but at least the numbers will be good as > far as they go. I thought some more about this, and it seems doable but tricky (ie, there are many wrong ways to do it). Here are my conclusions: We can't just use the normal QueryCancel logic that throws an elog(ERROR) from someplace down inside the query. This would leave the backend in an unclean state from which we could only certainly recover by doing AbortTransaction. And once we've aborted the transaction we can't do catalog accesses, which gets in the way of producing the EXPLAIN printout. Running the test query inside a subtransaction would fix that, but aborting the subtransaction would throw away the executor state, including the Instrumentation nodes we need. So it seems we need a way to stop the query without raising an error per se. What I'm thinking is that while EXPLAIN ANALYZE is running, timeout or SIGINT should not set QueryCancelPending, but instead set a separate flag "ExplainCancelPending", which we then test in ExecProcNode(), say if (node->instrument) { + if (ExplainCancelPending) + return NULL; InstrStartNode(node->instrument); } There might be one or two other places to check it, but basically we'd only notice the flag in very limited circumstances where it's definitely safe to force early termination of ExecutorRun. When control comes back to explain.c, we just print the results as normal (but probably adding a line explicitly noting that the query was abandoned before completion). Note that we won't have any "running" Instrumentation nodes to contend with, since the change doesn't cause nodes to drop out after they've started timing. So the data is good, just incomplete. After we've printed the results, we have a bit of a problem: if ExplainCancelPending is set, we now need to abort the transaction. It would not do at all to allow an incompletely executed UPDATE to commit. I experimented with throwing an elog() at the bottom of ExplainQuery() after end_tup_output(), but this does not work: psql only prints the error and not the data, because libpq throws away the query result upon seeing the error. We could probably hack psql to print the results before noting the error, but I'm concerned about making a change that would change the behavior for other error-at-end-of-statement cases. Also, it's not clear what might have to happen to get non-psql clients to play along. It seems like the best solution is to establish a subtransaction around the entire EXPLAIN command (not just the test query), which we can abort after we've printed the results. A possible objection to this is that running a query inside a subtransaction might have different/worse performance than running it at top level. I don't recall any severe bottlenecks of that kind but that doesn't mean there aren't any (Alvaro, any comments?) Note that this would mean that ending an EXPLAIN ANALYZE early, via either control-C or statement timeout, would be treated as a non-error situation from the point of view of the outer transaction. This bothers me a bit, because in fact the effects if any of the tested query would have been rolled back. Not sure we have any choice though. If we expose the error then we'll have problems with clients not showing the EXPLAIN results. Thoughts? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > So far we have myself, Kevin, Martijn and Luke all saying there is a > distortion or a massive overhead caused by EXPLAIN ANALYZE. > http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php > http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php The first of these shows the exact opposite of what you are claiming, and the second is without details of any sort that might help determine what the actual problem is. Given that we're seeing diametrically opposed results on the same OS (FC5) and similar (at least all Intel) hardware, I think the prudent thing is to find out what's really going on before leaping in with proposed solutions. As the person who's *not* seeing the problem, I'm not in a position to do that investigation... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Thu, 2006-06-08 at 10:27 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: > >> The overhead seems to be on the order of a couple tens of percent usually. > >> I don't see how that makes the difference between an EXPLAIN ANALYZE you > >> can run and one you can't. > > > Well, thats not my experience and doesn't match others posted on > > -hackers. > > > A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE > > to be consistently above 500% (or more than +400%, depending upon how > > you style those numbers). > > I think we ought to find out why your machine is so broken. > I'm too lazy to pull up any of my other machines right now, but this is > generally consistent with my experience ever since EXPLAIN ANALYZE was > written. Great. Well it isn't consistent with mine, or others who've posted to this list. > So: what's your platform exactly? FC5, Intel laptop running cvstip, new in January. But thats irrelevant. I'm not a user, I solve others problems, as you know. Hence my interest in a usable tool to do that. So far we have myself, Kevin, Martijn and Luke all saying there is a distortion or a massive overhead caused by EXPLAIN ANALYZE. http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php It's real. I won't press the point further. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: >> The overhead seems to be on the order of a couple tens of percent usually. >> I don't see how that makes the difference between an EXPLAIN ANALYZE you >> can run and one you can't. > Well, thats not my experience and doesn't match others posted on > -hackers. > A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE > to be consistently above 500% (or more than +400%, depending upon how > you style those numbers). I think we ought to find out why your machine is so broken. Even in this pretty-much-worst-case scenario (a seqscan does about as little real work per plan node call as possible, especially if the table is already fully cached), I don't see more than about a 2X degradation. On queries that are complicated enough to actually need EXPLAIN ANALYZE, it's not nearly that bad. Old slow HPUX/HPPA machine, PG 8.1 branch tip: bench=# \timing Timing is on. bench=# select count(*) from accounts; count 10 (1 row) Time: 543.565 ms -- do it again to ensure fully cached bench=# select count(*) from accounts; count 10 (1 row) Time: 492.667 ms bench=# explain analyze select count(*) from accounts; QUERY PLAN --- Aggregate (cost=2975.00..2975.01 rows=1 width=0) (actual time=1172.856..1172.8 60 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2725.00 rows=10 width=0) (actual ti me=0.175..720.741 rows=10 loops=1) Total runtime: 1173.290 ms (3 rows) Time: 1176.293 ms bench=# Spiffy new Fedora 5/dual Xeon machine, PG 8.1 branch tip: bench=# \timing Timing is on. bench=# select count(*) from accounts; count 10 (1 row) Time: 61.737 ms -- do it again to ensure fully cached bench=# select count(*) from accounts; count 10 (1 row) Time: 53.941 ms bench=# explain analyze select count(*) from accounts; QUERY PLAN -- Aggregate (cost=2975.00..2975.01 rows=1 width=0) (actual time=117.881..117.882 rows=1 loops=1) -> Seq Scan on accounts (cost=0.00..2725.00 rows=10 width=0) (actual time=0.041..77.628 rows=10 loops=1) Total runtime: 117.936 ms (3 rows) Time: 118.510 ms bench=# I'm too lazy to pull up any of my other machines right now, but this is generally consistent with my experience ever since EXPLAIN ANALYZE was written. So: what's your platform exactly? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: > >> Certainly the removal of timing > >> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an > >> acceptable one; > > > I disagree, as have others. > > The overhead seems to be on the order of a couple tens of percent usually. > I don't see how that makes the difference between an EXPLAIN ANALYZE you > can run and one you can't. Well, thats not my experience and doesn't match others posted on -hackers. A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE to be consistently above 500% (or more than +400%, depending upon how you style those numbers). This is for in-shared-buffers data, following cache priming. Test shown below is typical result from 5 tests: > postgres=# select count(*) from accounts; > count > > 10 > (1 row) > > Time: 267.008 ms > postgres=# explain analyze select count(*) from accounts; > QUERY PLAN > --- > Aggregate (cost=10945.00..10945.01 rows=1 width=0) (actual > time=1497.830..1497.837 rows=1 loops=1) >-> Seq Scan on accounts (cost=0.00..9998.20 rows=378720 width=0) (actual > time=0.078..828.455 rows=10 loops=1) > Total runtime: 1497.954 ms > (3 rows) > > Time: 1498.983 ms Other timings were: 1493 1498 1707 1814 1827 EXPLAIN ANALYZE is designed to be run *when* you have unacceptable run times and need to find out why. So making the execution time even more unacceptable makes the utility infeasible at the time you need it most. The additional run-time occurs on people's production systems, so they generally aren't happy running long tasks. BTW I think EA is great - so good in fact I want to run it more often. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, Jun 07, 2006 at 05:05:23PM -0400, Tom Lane wrote: > The other thing that was bothering me was whether disk I/O might be > undercounted by an interrupt-driven method. I kinda doubt that any > kernel will save up N interrupts that occur while the process is blocked > on a slow read() ... you'll probably get only one. So the whole idea > may be unworkable. True, but if you get to the point where you're waiting on I/O, I would think you could afford a gettimeofday() call. Even if the block we need is in the OS cache, it's possible that the overhead of getting it from there is high enough that the gettimeofday() call won't matter. FWIW, it looks like running a much larger setting for shared_buffers (like, 50% of memory) is a good way to boost performance, and that configuration would make it less likely that you'd do a gettimeofday just to pull data out of the OS cache. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: >> Certainly the removal of timing >> is not going to convert an intolerable EXPLAIN ANALYZE runtime into an >> acceptable one; > I disagree, as have others. The overhead seems to be on the order of a couple tens of percent usually. I don't see how that makes the difference between an EXPLAIN ANALYZE you can run and one you can't. > A full EXPLAIN ANALYZE is always desirable - we agree on that. The > question is what we do when one is not available. The least bad alternative I've heard is to let EXPLAIN ANALYZE print out stats-so-far if the query is canceled by control-C or statement timeout. The objection to this is you may mistake startup transients for full query behavior ... but at least the numbers will be good as far as they go. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: > >> I propose we revert this patch and think about an interrupt-driven > >> sampling method instead. > > > I don't have much more faith in crazy scheme No.2 either. (Mine or > > yours...) > > > Can we just have an option to avoid the timing altogether, please? I > > don't want to have long discussions about instrumentation, I just want a > > reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one > > that we never, ever have to doubt whether the sampling works correctly > > on a Miasmic-367 with HyperKooling. > > Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine. I respect everybody's opinion, yours doubly so, as you know. But with this current opinion, on this occasion, I believe you are in the minority of those who have spent some time asking for EXPLAIN ANALYSEs and have not received them because the run time is unacceptable. > People have been hyperventilating about the timing overhead but I think > that it's perfectly acceptable as-is. The phrase hyperventilating is subjective. From my side, I could use the phrase explaining-the-obvious and potentially cause rankle also. We should be able to discuss things without that. > Certainly the removal of timing > is not going to convert an intolerable EXPLAIN ANALYZE runtime into an > acceptable one; I disagree, as have others. > what it *is* likely to do is let you be misled about > which part of the query is the problem. A full EXPLAIN ANALYZE is always desirable - we agree on that. The question is what we do when one is not available. Guessing leaves you much more open to being misled. I'd like an option that we can take when a full EXPLAIN ANALYZE is not possible, one that doesn't give different results on different CPUs/OS. We need this to help people who are in difficulty. This isn't a vanity feature or a they-do-it-so-we-should-also. It's for support. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, Jun 07, 2006 at 04:04:33PM -0400, Tom Lane wrote: > Martijn van Oosterhout writes: > > Anyway, as a test, if you take the approach that the measurement at > > item X only applies to the tuples immediately preceding it, for the > > data you posted you get a result of 0.681148 seconds. How long did that > > query run that produced that data? > The above idea won't fix it anyway, only move the failure cases around. Well, if we're trying to make a system that never ever produces strange looking results, then yes, we have to drop it. No matter how you sample, there's going to be a failure mode somewhere. If you are prepared to live with a certain margin (it will be within X% of the real value Y% of the time) then I think that's an acheivable goal (I'm not saying that's necessarily what we have now). You could always give people the choice of disabling sampling if it looks wierd, but that just gives people more knobs to twiddle and get upset about. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote: >> timer interrupt routine does this once every few milliseconds: > One issue is that on very fast queries, you'd get absolutely no data > this way. Yeah. Of course, on "very fast" queries you don't get very good data from the existing method either --- with a machine fast enough to have sub-microsecond plan node execution times, the accuracy of gettimeofday really isn't good enough. The other thing that was bothering me was whether disk I/O might be undercounted by an interrupt-driven method. I kinda doubt that any kernel will save up N interrupts that occur while the process is blocked on a slow read() ... you'll probably get only one. So the whole idea may be unworkable. At the moment I think we need to revert the recent patch and go back to the drawing board. Improving on the original implementation of EXPLAIN ANALYZE is clearly not as easy as it looks. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: >> I propose we revert this patch and think about an interrupt-driven >> sampling method instead. > I don't have much more faith in crazy scheme No.2 either. (Mine or > yours...) > Can we just have an option to avoid the timing altogether, please? I > don't want to have long discussions about instrumentation, I just want a > reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one > that we never, ever have to doubt whether the sampling works correctly > on a Miasmic-367 with HyperKooling. Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine. People have been hyperventilating about the timing overhead but I think that it's perfectly acceptable as-is. Certainly the removal of timing is not going to convert an intolerable EXPLAIN ANALYZE runtime into an acceptable one; what it *is* likely to do is let you be misled about which part of the query is the problem. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote: > Ah-hah, I've sussed it. > so I'm > afraid this approach to sampling EXPLAIN ANALYZE is a failure. Hmmm, I thought we already got that bit...sorry to sound negative. > I propose we revert this patch and think about an interrupt-driven > sampling method instead. I don't have much more faith in crazy scheme No.2 either. (Mine or yours...) Can we just have an option to avoid the timing altogether, please? I don't want to have long discussions about instrumentation, I just want a reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one that we never, ever have to doubt whether the sampling works correctly on a Miasmic-367 with HyperKooling. You could lose a month on -perform going into the details of this for everybody - this was supposed to be a simple additional feature. If you're set on the sampling, great, but can we have the option to avoid it completely also? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote: > timer interrupt routine does this once every few milliseconds: One issue is that on very fast queries, you'd get absolutely no data this way. A possible solution would be to start with an extremely short timer interval, and ramp it up if the query runs longer. The downside is that you'd need to somehow re-scale iterations every time the timer interval changed. Of course another option is to start off using the gettimeofday() method and switch to sampling after X seconds, but it'd be nice if both code paths weren't needed. > The bubble-up of sample counts to parent nodes could perhaps be done > while printing the results instead of on-the-fly as sketched above, but > the above seems simpler. It'd be nice if there was an option to not aggregate child runtimes to their parents at all, since it'd make spotting hot spots much easier. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > Anyway, as a test, if you take the approach that the measurement at > item X only applies to the tuples immediately preceding it, for the > data you posted you get a result of 0.681148 seconds. How long did that > query run that produced that data? I didn't save the corresponding printout unfortunately, but it was probably pretty similar to this: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss; QUERY PLAN -- Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=869.395..869.399 rows=1 loops=1) -> Limit (cost=825.00..2484.00 rows=1 width=488) (actual time=248.640..3368.313 rows=1 loops=1) -> Hash Join (cost=825.00..2484.00 rows=1 width=488) (actual time=248.609..2983.528 rows=1 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=1 width=244) (actual time=0.084..21.525 rows=1 loops=1) -> Hash (cost=458.00..458.00 rows=1 width=244) (actual time=248.269..248.269 rows=1 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=1 width=244) (actual time=0.025..22.760 rows=1 loops=1) Total runtime: 877.265 ms (8 rows) Time: 888.469 ms regression=# The above idea won't fix it anyway, only move the failure cases around. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, Jun 07, 2006 at 03:32:35PM -0400, Tom Lane wrote: > On reflection it's easy to imagine other cases where some iterations > take much longer than others in a not-very-predictable way. For > instance, a join where only a subset of the outer tuples have a match > is going to act that way. I don't think there's any good way that we > can be sure we have a representative sample of executions, and so I'm > afraid this approach to sampling EXPLAIN ANALYZE is a failure. I don't think we ever assumed it would never be a problem. We just assumed that the sampling would cancel the effect out to give a decent average. Thing is, I never expected to get a three order magnitude difference between samples. That's just far too much to be corrected in any way. The random sampling should counter most effects, and I didn't consider the one tuple in a million that takes much longer to be a particularly common case. Anyway, as a test, if you take the approach that the measurement at item X only applies to the tuples immediately preceding it, for the data you posted you get a result of 0.681148 seconds. How long did that query run that produced that data? (The bit of perl I used is: cat data | perl -lne 'BEGIN { $last=0; $sum =0 } /: (\d+) usec in iter (\d+)/ && do { $sum += ($2-$last)*$1; $last=$2 }; END { print "$sum\n" }' > I propose we revert this patch and think about an interrupt-driven > sampling method instead. That's another possibility ofcourse... -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Ah-hah, I've sussed it. The faulty assumption can actually be stated as: "all the executions, except maybe the first, will take approximately the same amount of time". The failing test case I've been looking at is one where the system decides to use a "batched" hash join, and in this plan type some iterations take much longer than others. (The apparent dependence on targetlist width is now easy to understand, because that affects the estimated hashtable size and hence the decision whether batching is needed. I'm not sure why I don't see the effect when running the identical case on my other machine, but since the other one is a 64-bit machine its space calculations are probably a bit different.) I added some printf's to instrument.c to print out the actual time measurements for each sample, as well as the calculations in InstrEndLoop. Attached are the printouts that occurred for the HashJoin node. The thing that is killing the extrapolation is of course the very large time for iteration 2, which the extrapolation includes in its average. But there's well over 10:1 variation in the later samples as well. On reflection it's easy to imagine other cases where some iterations take much longer than others in a not-very-predictable way. For instance, a join where only a subset of the outer tuples have a match is going to act that way. I don't think there's any good way that we can be sure we have a representative sample of executions, and so I'm afraid this approach to sampling EXPLAIN ANALYZE is a failure. I propose we revert this patch and think about an interrupt-driven sampling method instead. regards, tom lane 401489a0 1: 331616 usec in iter 1 401489a0 1: 110338 usec in iter 2 401489a0 1: 54 usec in iter 3 401489a0 1: 99 usec in iter 4 401489a0 1: 77 usec in iter 5 401489a0 1: 145 usec in iter 6 401489a0 1: 117 usec in iter 7 401489a0 1: 33 usec in iter 8 401489a0 1: 97 usec in iter 9 401489a0 1: 98 usec in iter 10 401489a0 1: 52 usec in iter 11 401489a0 1: 33 usec in iter 12 401489a0 1: 51 usec in iter 13 401489a0 1: 83 usec in iter 14 401489a0 1: 153 usec in iter 15 401489a0 1: 115 usec in iter 16 401489a0 1: 52 usec in iter 17 401489a0 1: 242 usec in iter 18 401489a0 1: 48 usec in iter 19 401489a0 1: 87 usec in iter 20 401489a0 1: 23 usec in iter 21 401489a0 1: 80 usec in iter 22 401489a0 1: 57 usec in iter 23 401489a0 1: 17 usec in iter 24 401489a0 1: 51 usec in iter 25 401489a0 1: 18 usec in iter 26 401489a0 1: 16 usec in iter 27 401489a0 1: 100 usec in iter 28 401489a0 1: 45 usec in iter 29 401489a0 1: 174 usec in iter 30 401489a0 1: 131 usec in iter 31 401489a0 1: 17 usec in iter 32 401489a0 1: 45 usec in iter 33 401489a0 1: 16 usec in iter 34 401489a0 1: 120 usec in iter 35 401489a0 1: 15 usec in iter 36 401489a0 1: 17 usec in iter 37 401489a0 1: 15 usec in iter 38 401489a0 1: 48 usec in iter 39 401489a0 1: 127 usec in iter 40 401489a0 1: 36 usec in iter 41 401489a0 1: 41 usec in iter 42 401489a0 1: 69 usec in iter 43 401489a0 1: 50 usec in iter 44 401489a0 1: 104 usec in iter 45 401489a0 1: 22 usec in iter 46 401489a0 1: 50 usec in iter 47 401489a0 1: 17 usec in iter 48 401489a0 1: 47 usec in iter 49 401489a0 1: 54 usec in iter 50 401489a0 1: 46 usec in iter 51 401489a0 1: 20 usec in iter 54 401489a0 1: 38 usec in iter 55 401489a0 1: 68 usec in iter 56 401489a0 1: 17 usec in iter 60 401489a0 1: 16 usec in iter 61 401489a0 1: 15 usec in iter 67 401489a0 1: 31 usec in iter 68 401489a0 1: 15 usec in iter 70 401489a0 1: 61 usec in iter 78 401489a0 1: 143 usec in iter 85 401489a0 1: 21 usec in iter 89 401489a0 1: 14 usec in iter 96 401489a0 1: 21 usec in iter 104 401489a0 1: 21 usec in iter 107 401489a0 1: 16 usec in iter 116 401489a0 1: 194 usec in iter 118 401489a0 1: 136 usec in iter 122 401489a0 1: 34 usec in iter 127 401489a0 1: 46 usec in iter 131 401489a0 1: 15 usec in iter 133 401489a0 1: 15 usec in iter 135 401489a0 1: 34 usec in iter 137 401489a0 1: 54 usec in iter 142 401489a0 1: 206 usec in iter 151 401489a0 1: 75 usec in iter 162 401489a0 1: 20 usec in iter 172 401489a0 1: 66 usec in iter 177 401489a0 1: 21 usec in iter 181 401489a0 1: 69 usec in iter 186 401489a0 1: 16 usec in iter 193 401489a0 1: 46 usec in iter 201 401489a0 1: 33 usec in iter 210 401489a0 1: 50 usec in iter 216 401489a0 1: 21 usec in iter 222 401489a0 1: 18 usec in iter 224 401489a0 1: 33 usec in iter 229 401489a0 1: 20 usec in iter 232 401489a0 1: 44 usec in iter 236 401489a0 1: 29 usec in iter 239 401489a0 1: 34 usec in iter 240 401489a0 1: 31 usec in iter 241 401489a0 1: 27 usec in iter 254 401489a0 1: 45 usec in iter 257 401489a0 1: 147 usec in iter 259 401489a0 1: 15 usec in iter 269 401489a0 1: 16 usec in iter 278 401489a0 1: 14 usec in iter 279 401489a0 1: 58 usec in iter 290 401489a0 1: 15 usec in iter 291 401489a0 1: 53 usec in iter 295 401489a0 1: 15 usec in iter 306 401489a0 1: 16 usec in iter 318 401489a0 1: 34 usec in iter 328 401489a0 1: 37 usec in iter 339 401489a0 1:
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > Tom Lane <[EMAIL PROTECTED]> writes: >> Nope, that is certainly not the explanation, because the hash table is >> loaded in the (single) call of the Hash node at the start of the query. >> It is static all through the sampled-and-not executions of the Hash Join >> node, which is where our problem is. > At the risk of asking a stupid question, it's not perchance including that > hash build in the first sample it takes of the hash join node? Sure. Which is one of the reasons why the first tuple is excluded from the extrapolation... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane <[EMAIL PROTECTED]> writes: > Nope, that is certainly not the explanation, because the hash table is > loaded in the (single) call of the Hash node at the start of the query. > It is static all through the sampled-and-not executions of the Hash Join > node, which is where our problem is. At the risk of asking a stupid question, it's not perchance including that hash build in the first sample it takes of the hash join node? -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > Martijn van Oosterhout writes: >> This would mean that we wouldn't be assuming that tuples near the end >> take as long as tuples near the beginning. Except we're now dealing >> will smaller numbers, so I'm worried about error accumlation. > Hm, that would explain why Hash joins suffer from this especially. Even when > functioning properly hashes get slower as the buckets fill up and there are > longer lists to traverse. Nope, that is certainly not the explanation, because the hash table is loaded in the (single) call of the Hash node at the start of the query. It is static all through the sampled-and-not executions of the Hash Join node, which is where our problem is. I don't see that Martijn's idea responds to the problem anyway, if it is some sort of TLB-related issue. The assumption we are making is not "tuples near the end take as long as tuples near the beginning", it is "tuples we sample take as long as tuples we don't" (both statements of course meaning "on the average"). If the act of sampling incurs overhead beyond the gettimeofday() call itself, then we are screwed, and playing around with which iterations we sample and how we do the extrapolation won't make the slightest bit of difference. I'm unsure about the TLB-flush theory because I see no evidence of any such overhead in the 8.1 timings; but on the other hand it's hard to see what else could explain the apparent dependence on targetlist width. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > This would mean that we wouldn't be assuming that tuples near the end > take as long as tuples near the beginning. Except we're now dealing > will smaller numbers, so I'm worried about error accumlation. Hm, that would explain why Hash joins suffer from this especially. Even when functioning properly hashes get slower as the buckets fill up and there are longer lists to traverse. Perhaps the hashes are suffering inordinately from collisions though. Some of the data type hash functions looked kind of suspect when I peeked at them a while back. -- greg ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote: >> If we do have to revert, I'd propose that we pursue the notion of >> interrupt-driven sampling like gprof uses. > How would that work? You could then estimate how much time was spent in > each node, but you no longer have any idea about when they were entered > or left. Hm? It would work the same way gprof works. I'm imagining something along the lines of global variable: volatile Instrumentation *current_instr = NULL; also add an "Instrumentation *parent_instr" field to Instrumentation InstrStartNode does: instr->parent_instr = current_instr; current_instr = instr; InstrStopNode restores the previous value of the global: current_instr = instr->parent_instr; timer interrupt routine does this once every few milliseconds: total_samples++; for (instr = current_instr; instr; instr = instr->parent_instr) instr->samples++; You still count executions and tuples in InstrStartNode/InstrStopNode, but you never call gettimeofday there. You *do* call gettimeofday at beginning and end of the whole query to measure the total runtime, and then you blame a fraction samples/total_samples of that on each node. The bubble-up of sample counts to parent nodes could perhaps be done while printing the results instead of on-the-fly as sketched above, but the above seems simpler. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > Do we agree the idea can't go further? What next? > > It still needs investigation; I'm no longer convinced that the TLB-flush > theory is correct. See rest of thread. We may well have to revert the > current patch, but I'd like to be sure we understand why. One thing I'm thinking of trying is to, instead of assuming we can estimate the duractions of all the nodes by taking the total time divided by samples. we assume that the duration of tuple X is similar in duration to tuple X+1 but not necessarily the same as all other tuples. This moves the calculation from EndLoop to StopInstr. Basically in StopInstr you do the steps: if( sampling ) { x = get time for this tuple n = number of tuples skipped cumulativetime += x*n } This would mean that we wouldn't be assuming that tuples near the end take as long as tuples near the beginning. Except we're now dealing will smaller numbers, so I'm worried about error accumlation. > If we do have to revert, I'd propose that we pursue the notion of > interrupt-driven sampling like gprof uses. How would that work? You could then estimate how much time was spent in each node, but you no longer have any idea about when they were entered or left. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Simon Riggs <[EMAIL PROTECTED]> writes: > Do we agree the idea can't go further? What next? It still needs investigation; I'm no longer convinced that the TLB-flush theory is correct. See rest of thread. We may well have to revert the current patch, but I'd like to be sure we understand why. If we do have to revert, I'd propose that we pursue the notion of interrupt-driven sampling like gprof uses. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Tue, 2006-06-06 at 16:50 -0400, Tom Lane wrote: > I have a theory about this, and it's not pleasant at all. What I > think is that we have a Heisenberg problem here: the act of invoking > gettimeofday() actually changes what is measured. > If this theory is correct, then the entire notion of EXPLAIN ANALYZE > sampling has just crashed and burned. We can't ship a measurement > tool that is accurate on some platforms and not others. Regrettably, I would agree and so conclude that we shouldn't pursue the sampling idea further. Heisenbugs suck time like no other. Interesting, though. That leaves us with a number of possibilities: 0. Do Nothing 1. Option to skip timing altogether on an EXPLAIN ANALYZE 2. Option to produce a partial execution only, to locate problem areas. Any others? Option 2 would be harder to interpret, but still useful - originally discussed in a current thread on -perform. Option 1 wouldn't be as useful as the original sampling idea, but if its not on the table any longer I'd revert back to Option 1 as being the best choice for further work. Do we agree the idea can't go further? What next? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Greg Stark <[EMAIL PROTECTED]> writes: > Tom Lane <[EMAIL PROTECTED]> writes: >> And two, that upper plan nodes seem much more affected than lower >> ones. That makes sense because the execution cycle of an upper node >> will involve touching more userspace data than a lower node, and >> therefore more of the flushed TLB entries will need to be reloaded. > I would have expected the opposite effect. If you only execute one instruction > then the cache miss can make it take many times longer than normal. But as the > number of instructions grows the cache gets repopulated and the overhead > levels off and becomes negligible relative to the total time. Well, none of our plan nodes are in the "one instruction" regime ;-). I was thinking that the total volume of data accessed was the critical factor. Right at the moment I'm disillusioned with the TLB-access theory though. Something I'm noticing right now is that it seems like only hash joins are really seriously misestimated --- nest and merge joins have some small issues but only the hash is way out there. What's going on?? Can anyone else reproduce this? > The other option aside from gprof-like profiling would be to > investigate those cpu timing instructions again. I know some of them > are unsafe on multi-cpu systems but surely there's a solution out > there. It's not like there aren't a million games, music playing, and > other kewl kid toys that depend on accurate low overhead timing these > days. Yeah, and they all work only on Windoze and Intel chips :-( regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Tom Lane <[EMAIL PROTECTED]> writes: > And two, that upper plan nodes seem much more affected than lower > ones. That makes sense because the execution cycle of an upper node > will involve touching more userspace data than a lower node, and > therefore more of the flushed TLB entries will need to be reloaded. I would have expected the opposite effect. If you only execute one instruction then the cache miss can make it take many times longer than normal. But as the number of instructions grows the cache gets repopulated and the overhead levels off and becomes negligible relative to the total time. The other option aside from gprof-like profiling would be to investigate those cpu timing instructions again. I know some of them are unsafe on multi-cpu systems but surely there's a solution out there. It's not like there aren't a million games, music playing, and other kewl kid toys that depend on accurate low overhead timing these days. -- greg ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > This doesn't make any sense at all. How can a sampling run that only > sampled 7% of the actual tuples, end up with a actual measured time > that's more than 50% of the actual final runtime? AFAICS, the only conclusion is that the sampled executions are in fact taking longer than supposedly-equivalent unsampled ones, and by a pretty good percentage too. I'm growing unsure again about the mechanism responsible for that, however. > Can you get an estimate of the actual overhead (ie compare wall clock > time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is > on the order of a few hundred cycles, I'm not sure about the cost of > TLB flushes (that's probably highly architechture dependant). Here's some examples. Keep in mind I've already determined that gettimeofday() takes about 3 usec on this hardware ... regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss; QUERY PLAN - Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1) -> Limit (cost=825.00..2484.00 rows=1 width=488) (actual time=208.208..2576.528 rows=1 loops=1) -> Hash Join (cost=825.00..2484.00 rows=1 width=488) (actual time=208.190..2082.577 rows=1 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=1 width=244) (actual time=0.082..3.718 rows=1 loops=1) -> Hash (cost=458.00..458.00 rows=1 width=244) (actual time=207.933..207.933 rows=1 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=1 width=244) (actual time=0.017..3.583 rows=1 loops=1) Total runtime: 805.036 ms (8 rows) Time: 816.463 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss; count --- 1 (1 row) Time: 816.970 m The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite a bit, probably because of the random variation we're using in sampling interval. This particular combination was unusually close. But in any case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small here; the problem is that we're incorrectly extrapolating the measured runtime to the unmeasured executions. What's especially interesting is that the excess time doesn't seem to show up if I form the query in a way that doesn't require pushing as much data around: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss; QUERY PLAN - Aggregate (cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1) -> Hash Join (cost=483.00..1316.00 rows=1 width=0) (actual time=88.061..160.886 rows=1 loops=1) Hash Cond: (a.unique1 = b.unique2) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=1 width=4) (actual time=0.071..4.068 rows=1 loops=1) -> Hash (cost=458.00..458.00 rows=1 width=4) (actual time=87.862..87.862 rows=1 loops=1) -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=1 width=4) (actual time=0.031..4.780 rows=1 loops=1) Total runtime: 221.022 ms (7 rows) Time: 229.377 ms regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss; count --- 1 (1 row) Time: 202.531 ms regression=# (Without the OFFSET 0, the planner flattens the subquery and discovers that it doesn't actually need to fetch any of the non-join-key table columns.) Note the only plan nodes showing whacked-out timings are the ones returning wide tuples (large "width" values). I'm not entirely sure what to make of this. It could be interpreted as evidence for my theory about TLB reloads during userspace data access being the problem. But I'm getting a bit disenchanted with that theory after running the same test case in 8.1: regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss; QUERY PLAN -- Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loo
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote: > But that still leaves me with a problem because my machine is clearly > overestimating the correction needed. I added some printouts and got > > raw totaltime = 0.370937 > per_iter = 0.000156913, SampleOverhead = 3.28e-06 > adj totaltime = 1.82976 > sampling = 0 > starttime = 0/00 > counter = 0/370937 > firsttuple = 0.258321 > tuplecount = 1 > itercount = 10001 > samplecount = 704 > nextsample = 10011 > startup = 0.258321 > total = 1.82976 > ntuples = 1 > nloops = 1 > > on a run with an actual elapsed time near 750 msec. Clearly the > sampling adjustment is wrong, but why? This doesn't make any sense at all. How can a sampling run that only sampled 7% of the actual tuples, end up with a actual measured time that's more than 50% of the actual final runtime? Can you get an estimate of the actual overhead (ie compare wall clock time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is on the order of a few hundred cycles, I'm not sure about the cost of TLB flushes (that's probably highly architechture dependant). To be honest, I wonder about caching effects, but for the disks. The first few cycles of any plan (like an index scan) is going to incur costs that won't happen later on. Because we sample much more heavily at the beginning rather than the end, this will bias towards higher numbers. You should be able to see this by seeing if running queries that don't require disk access fare better. That would suggest a much more careful correction method that works for non-linear timing patterns... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > If that's the case, then maybe a more sopdisticated method of measuring > the overhead would work. I think you missed the point: the time spent in gettimeofday() itself is not the major overhead of EXPLAIN ANALYZE. At least it appears that this is the case on my machine. I'm thinking that interrupt-driven sampling might work OK though. My previous worries were based on trying to use the ITIMER_PROF timer, which might not be portable and would conflict with gprof anyway. But EXPLAIN ANALYZE has always been interested in real time rather than CPU time, so the correct interrupt to use is ITIMER_REAL. That means we only have to share with our own existing usages of that interrupt, which turns it from a portability issue into just a Small Matter Of Programming. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote: > I have a theory about this, and it's not pleasant at all. What I > think is that we have a Heisenberg problem here: the act of invoking > gettimeofday() actually changes what is measured. That is, the > runtime of the "second part" of ExecProcNode is actually longer when > we sample than when we don't, not merely due to the extra time spent > in gettimeofday(). It's not very hard to guess at reasons why, either. > The kernel entry is probably flushing some part of the CPU's state, > such as virtual/physical address mapping for the userland address > space. After returning from the kernel call, the time to reload > that state shows up as more execution time within the "second part". > > This theory explains two observations that otherwise are hard to > explain. One, that the effect is platform-specific: your machine > may avoid flushing as much state during a kernel call as mine does. > And two, that upper plan nodes seem much more affected than lower > ones. That makes sense because the execution cycle of an upper node > will involve touching more userspace data than a lower node, and > therefore more of the flushed TLB entries will need to be reloaded. If that's the case, then maybe a more sopdisticated method of measuring the overhead would work. My thought is that on the second call to pull a tuple from a node (second because the first probably has some anomolies due to startup), we measure the overhead for that node. This would probably mean doing the following: get start time # I'm not refering to this as gettimeofday to avoid # confusion gettimeofday() # this is the gettimeofday call that will happen during # normal operation get end time Hopefully, there's no caching effect that would come into play from not actually touching any of the data structures after the gettimeofday() call. If that's not the case, it makes measuring the overhead more complex, but I think it should still be doable... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
I wrote: > The problem looks to be an underestimation of SampleOverhead, and on > reflection it's clear why: what CalculateSampleOverhead is measuring > isn't the total overhead, but the time between the two gettimeofday > calls. Which is probably about half the true overhead. On further thought, I take that back: the "true overhead" is not the point here. The time elapsed during a plan node execution when sampling can be broken down into three phases: time before first gettimeofday call time between gettimeofday calls time after second gettimeofday call Only the second part is actually measured by the instrumentation system; the other parts are overhead that has never been counted by EXPLAIN ANALYZE, sampling version or no. Moreover, most of the runtime of InstrStartNode and InstrStopNode falls into the first or third parts. What we would actually like to set SampleOverhead to is the portion of the second-part runtime that doesn't occur when sampling = false. Assuming that gettimeofday() has consistent runtime and the actual time reported is measured at a consistent instant within that runtime, I believe that we should take the SampleOverhead as just equal to the runtime of a single gettimeofday() call. The added or removed second-part time within InstrStartNode is just the tail time of gettimeofday, and the added or removed second-part time within InstrStopNode is basically just the head time of gettimeofday. (To make this as true as possible, we need to change the order of operations so that gettimeofday is invoked *immediately* after the "if (sampling)" test, but that's easy.) So this line of thought leads to the conclusion that CalculateSampleOverhead is actually overestimating SampleOverhead a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT(). But that still leaves me with a problem because my machine is clearly overestimating the correction needed. I added some printouts and got raw totaltime = 0.370937 per_iter = 0.000156913, SampleOverhead = 3.28e-06 adj totaltime = 1.82976 sampling = 0 starttime = 0/00 counter = 0/370937 firsttuple = 0.258321 tuplecount = 1 itercount = 10001 samplecount = 704 nextsample = 10011 startup = 0.258321 total = 1.82976 ntuples = 1 nloops = 1 on a run with an actual elapsed time near 750 msec. Clearly the sampling adjustment is wrong, but why? I have a theory about this, and it's not pleasant at all. What I think is that we have a Heisenberg problem here: the act of invoking gettimeofday() actually changes what is measured. That is, the runtime of the "second part" of ExecProcNode is actually longer when we sample than when we don't, not merely due to the extra time spent in gettimeofday(). It's not very hard to guess at reasons why, either. The kernel entry is probably flushing some part of the CPU's state, such as virtual/physical address mapping for the userland address space. After returning from the kernel call, the time to reload that state shows up as more execution time within the "second part". This theory explains two observations that otherwise are hard to explain. One, that the effect is platform-specific: your machine may avoid flushing as much state during a kernel call as mine does. And two, that upper plan nodes seem much more affected than lower ones. That makes sense because the execution cycle of an upper node will involve touching more userspace data than a lower node, and therefore more of the flushed TLB entries will need to be reloaded. If this theory is correct, then the entire notion of EXPLAIN ANALYZE sampling has just crashed and burned. We can't ship a measurement tool that is accurate on some platforms and not others. I'm wondering if it's at all practical to go over to a gprof-like measurement method for taking EXPLAIN ANALYZE runtime measurements; that is, take an interrupt every so often and bump the count for the currently active plan node. This would spread the TLB-flush overhead more evenly and thus avoid introducing that bias. There may be too much platform dependency in this idea, though, and I also wonder if it'd break the ability to do normal gprof profiling of the backend. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
Martijn van Oosterhout writes: > Wierd. Can you get the output of *instr in each call of > InstrEndLoop? Preferably after it does the calculation but before it > clears the values. So we get an idea of number of samples and what it > guesses. SampleOverhead would be good too. The problem looks to be an underestimation of SampleOverhead, and on reflection it's clear why: what CalculateSampleOverhead is measuring isn't the total overhead, but the time between the two gettimeofday calls. Which is probably about half the true overhead. What we ought to do is iterate InstStartNode/InstrStopNode N times, and *separately* measure the total elapsed time spent. It occurs to me that what we really want to know is not so much the total time spent in InstStartNode/InstrStopNode, as the difference in the time spent when sampling is on vs when it is off. I'm not quite sure if the time spent when it's off is negligible. Off to do some measuring ... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work
On Mon, Jun 05, 2006 at 11:02:33PM -0400, Tom Lane wrote: > Just got this rather surprising result: > The "Total runtime" is correct AFAICT, which puts the top node's "actual > time" rather far out in left field. This is pretty repeatable on my old > slow HPPA machine. A new Xeon shows less of a discrepancy, but it's > still claiming top node actual > total, which is not right. Wierd. Can you get the output of *instr in each call of InstrEndLoop? Preferably after it does the calculation but before it clears the values. So we get an idea of number of samples and what it guesses. SampleOverhead would be good too. I know my version produced sensible results on my machine and the handful of people testing, so I'll try it again with your changes, see how it looks... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature