subject:"\[HACKERS\] That EXPLAIN ANALYZE patch still needs work"

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-12 Thread Zeugswetter Andreas DCP SD


  This bothers me a bit, because in
  fact the effects if any of the tested query would have been rolled 
  back.  Not sure we have any choice though.  If we expose the error 
  then we'll have problems with clients not showing the EXPLAIN 
  results.
 
  I think we should leave it in top level, throw the error and fix the

  clients.
  As I understood, the idea was, that it only does that if you press
^C 
  or query timeout. In this case current clients would also not show
the 
  plan.
 
 Not if the clients are implemented per protocol spec.  A 
 client cannot assume that sending QueryCancel will make the 
 current query fail.

Sorry I don't understand that comment. I did not not say that it must
fail,
but iff it is interrupted (and thus fails) was the case I meant.

You stated, that current clients won't show the explain output if they 
get a protocol error response. (Does the protocol not allow both data
and error ?)

We would need to teach clients to output the explain result even if an
error
is returned.

I hold my comment: on ^C we should return the plan and return the error.
We should not misuse automatic subtransactions for this.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-10 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 Right now I'm confused though. I was under the impression the changes
 were going to be ripped out because it was decided to be unworkable. I
 think improvements can be made but I'm unsure if there's any
 interest...

I've reverted the current patch because it clearly doesn't work well
enough.  There's nothing stopping you from having a better idea though.
It's clear that on some platforms the cost of gettimeofday is high
enough that some workaround would be good.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-10 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 The interesting thing about this is that they obviously are gearing
 gettimeofday() to be accurate, rather than just considering it a
 counter that is somewhat close to real time. At the expense of speed.

Not sure that that's an accurate description.  What I think the kernel
fuss is about is that they have to read the counter value as several
separate byte read operations, and if the hardware doesn't latch the
whole counter value when the first byte is pulled then they'll get bytes
from several distinct states of the counter, leading to something that's
not consistent or even monotonic.  On non-latching hardware there's
really not a lot of choice what to do.  The patch is about not using
that same very-slow code path on hardware that does latch.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-10 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 There seem to be two types of overhead going on. There's the amount of time
 spent in gettimeofday itself which is pretty consistent.

That is a fact not in evidence.  The impression I had from that
linux-kernel discussion was that the problematic kernel code was looping
until it got consistent results from successive hardware reads.  I'm
not at all sure that you can make the above assertion across all
varieties of clock hardware, or even all common varieties.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-10 Thread Benny Amorsen

 MvO == Martijn van Oosterhout kleptog@svana.org writes:

MvO What we want is just a monotonically increasing counter that can
MvO be read quickly and consistantly, we're not majorly fussed if it
MvO doesn't match real time. This puts us back to CPU cycle counters,
MvO but they have drawbacks of their own.

It is amazing how this discussion is mirroring discussions on the
linux-kernel list.

Applications have been using CPU cycle counters on Linux to avoid the
gettimeofday() overhead. With reasonably recent kernels, the overhead
is very low when the CPU cycle counters are usable, because
gettimeofday() never actually enters kernel space.

Unfortunately fewer and fewer systems seem to have usable cycle
counters. As an example, dual core Athlon64/Opteron boots with the
cycle counters unsynced. The kernel can compensate for that. However
they also lose sync whenever clock frequency changes, and the kernel
has a really hard time compensating for it. On such systems the kernel
switches back to slower timers and gettimeofday() becomes a real
system call. Applications should not try to use cycle counters
directly on such systems. If the kernel developers manage a good
workaround, gettimeofday() becomes fast again, but applications which
use cycle counters most likely stay broken.

Basically either gettimeofday() is fast, or the cycle counters are
useless -- unless you really care about counting CPU cycles and not
real time. Some CPUs like Transmetas actually get the counter thing
right and count 2 every tick when running at half speed and so on.


/Benny



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-09 Thread Zeugswetter Andreas DCP SD

 This bothers me a bit, because in 
 fact the effects if any of the tested query would have been 
 rolled back.  Not sure we have any choice though.  If we 
 expose the error then we'll have problems with clients not 
 showing the EXPLAIN results.

I think we should leave it in top level, throw the error and fix the
clients.
As I understood, the idea was, that it only does that if you 
press ^C or query timeout. In this case current clients would also
not show the plan.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Simon Riggs

On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
  the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
 
  The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
  and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
  you were looking for.
 
 Do you have oprofile installed on these?  Comparing oprofile results
 might give some more insight where the time is going.

On my office PC the overhead from EA is about 20%.

stracing a backend, I also notice that on a 1.5 million row table we
make only 39158 calls to gettimeofday, on a table of 24591 blocks.

I'm thinking that the instrumentation overhead has been reduced as a
result of the page-at-a-time heap scans we now have?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Alvaro Herrera

Alvaro Herrera wrote:

 The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
 and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
 you were looking for.

Hmm, I take that back.  I updated to the current CVS tip on this machine
and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Alvaro Herrera

Simon Riggs wrote:
 On Thu, 2006-06-08 at 17:21 -0400, Tom Lane wrote:
  Alvaro Herrera [EMAIL PROTECTED] writes:
   Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
   the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
  
   The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
   and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
   you were looking for.
  
  Do you have oprofile installed on these?  Comparing oprofile results
  might give some more insight where the time is going.
 
 On my office PC the overhead from EA is about 20%.

Can you please run the test program in this thread, if you are running
Linux?

http://marc.theaimsgroup.com/?l=linux-kernelm=114297656924494w=2

The machine I reported to show the problem says

# ./a.out 0xe408
0
./a.out: Detect PM-Timer Bug

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Hmm, I take that back.  I updated to the current CVS tip on this machine
 and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
 around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.

At the moment, CVS HEAD still has the sampling patch in it.  You need to
be looking at 8.1, or HEAD from several weeks ago, to see the full
impact of gettimeofday-every-time.  (I imagine Simon's strace count
is wrong for the same reason.)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Larry Rosenman

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Hmm, I take that back.  I updated to the current CVS tip on this
 machine and now I don't see the problem.  Without EXPLAIN ANALYZE,
 the times are around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.
 
 At the moment, CVS HEAD still has the sampling patch in it.  You need
 to be looking at 8.1, or HEAD from several weeks ago, to see the full
 impact of gettimeofday-every-time.  (I imagine Simon's strace count
 is wrong for the same reason.)
 
Does my post from yesterday (39ms without explain analyze, 280+ with explain
analyze)
on FreeBSD/amd64, Dual Xeon's in HTT mode help?

What can I do to further the investigation?

It has 8.1.4 on it.

LER



-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Alvaro Herrera

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Hmm, I take that back.  I updated to the current CVS tip on this machine
  and now I don't see the problem.  Without EXPLAIN ANALYZE, the times are
  around 115-117 ms.  With EXPLAIN ANALYZE, 149-152ms.
 
 At the moment, CVS HEAD still has the sampling patch in it.  You need to
 be looking at 8.1, or HEAD from several weeks ago, to see the full
 impact of gettimeofday-every-time.  (I imagine Simon's strace count
 is wrong for the same reason.)

Ok, then I re-retract, because the test I posted previously was done
with 8.1.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Larry Rosenman ler@lerctr.org writes:
 Does my post from yesterday (39ms without explain analyze, 280+ with explain
 analyze) on FreeBSD/amd64, Dual Xeon's in HTT mode help?

Well, it confirms that FreeBSD is subject to the same problem ;-).

I think the bottom line here is that there are some machines out there
where gettimeofday() is fast enough for our purposes, and some where
it is nowhere near fast enough.  And that kernel-level fixes may be
possible for some of the latter, but not all, and we shouldn't hold our
breath waiting for that to happen anyway.

To tell you the truth, this information makes me even less pleased with
the sampling-gettimeofday patch than I was before.  If gettimeofday() in
itself increases the runtime of a node by a factor of 10, then just
trying to subtract off that time is no solution.  There's too much
impact on surrounding nodes, and too much roundoff error anyhow.
I had thought we were applying an order-of-ten-percent correction by
subtracting SampleOverhead, not an order-of-10x correction :-(

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Simon Riggs

On Fri, 2006-06-09 at 10:00 -0400, Tom Lane wrote:

 I think the bottom line here is that there are some machines out there
 where gettimeofday() is fast enough for our purposes, and some where
 it is nowhere near fast enough.  And that kernel-level fixes may be
 possible for some of the latter, but not all, and we shouldn't hold our
 breath waiting for that to happen anyway.

Agreed.

The issue seems to be some systems are set to get exactly correct times
and some are set to return a time (possibly imprecise) with low
overhead. Even if fixes existed, OS packagers may not pick the right one
of those two options for our purposes for EA. (We might prefer accuracy
to speed for other parts of PostgreSQL anyway).

I propose we revert the sampling patch (sorry Martijn) and go with the
patch to have an explain_analyze_timing parameter (default=on). That way
we'll have an option to turn off timing *if* we happen to be running on
a OS/hw combo that sucks *and* trying to run large enough EAs that we
care.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 I propose we revert the sampling patch (sorry Martijn)

yah ...

 and go with the
 patch to have an explain_analyze_timing parameter (default=on).

This I'm unexcited about.  EXPLAIN output isn't all that transparent
anyway, and losing the extra cue of seeing where the time is really
going would make it extremely easy for people to misinterpret their
problems.

I was intending to push forward with the idea of being able to get
numbers out of a canceled EXPLAIN.  That will allow you to get some
information even when the underlying query runs longer than you're
willing to tolerate.  I still say that the number of queries where
avoiding gettimeofday overhead would transform an intolerable runtime
into a tolerable one is pretty limited.

The other thing that I think would be worth investigating is
timer-driven sampling, although it's not yet clear whether we can
make that work usefully.  Anyone want to take up that project?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Martijn van Oosterhout

On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:
 To tell you the truth, this information makes me even less pleased with
 the sampling-gettimeofday patch than I was before.  If gettimeofday() in
 itself increases the runtime of a node by a factor of 10, then just
 trying to subtract off that time is no solution.  There's too much
 impact on surrounding nodes, and too much roundoff error anyhow.
 I had thought we were applying an order-of-ten-percent correction by
 subtracting SampleOverhead, not an order-of-10x correction :-(

Eh? The whole point is to call gettimeofday() much less often. If you
call it 1000th as often, then the correction is only on the order of
one hundredth of the normal query time...

Subtracting SampleOverhead is only a correction on the order of a few
percent, it's the reduced calling of gettimeofday() that provides the
benefit.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread A.M.

It would be nice to keep the gettimeofday()s wherever they are most useful
on hardware/software where they are cheap. Perhaps a compile-time option?

On Fri, June 9, 2006 11:18 am, Martijn van Oosterhout wrote:
 On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:

 To tell you the truth, this information makes me even less pleased with
  the sampling-gettimeofday patch than I was before.  If gettimeofday()
 in itself increases the runtime of a node by a factor of 10, then just
 trying to subtract off that time is no solution.  There's too much
 impact on surrounding nodes, and too much roundoff error anyhow. I had
 thought we were applying an order-of-ten-percent correction by
 subtracting SampleOverhead, not an order-of-10x correction :-(

 Eh? The whole point is to call gettimeofday() much less often. If you
 call it 1000th as often, then the correction is only on the order of one
 hundredth of the normal query time...

 Subtracting SampleOverhead is only a correction on the order of a few
 percent, it's the reduced calling of gettimeofday() that provides the
 benefit.

 Have a nice day,
 --
 Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/

 From each according to his ability. To each according to his ability to
 litigate.




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 On Fri, Jun 09, 2006 at 10:00:20AM -0400, Tom Lane wrote:
 I had thought we were applying an order-of-ten-percent correction by
 subtracting SampleOverhead, not an order-of-10x correction :-(

 Eh? The whole point is to call gettimeofday() much less often. If you
 call it 1000th as often, then the correction is only on the order of
 one hundredth of the normal query time...

No, because the correction calculation is
totaltime += (average time per sampled execution - SampleOverhead) * (number of 
unsampled executions)

If SampleOverhead is 90% of the average time per sampled execution,
then you are multiplying something with a large component of
cancellation error by a possibly-large number.

As an example, using the numbers I posted here for my old PC:
http://archives.postgresql.org/pgsql-hackers/2006-06/msg00407.php
the actual runtime is clearly about 1 usec per tuple but enabling
timing adds 8.5 usec per tuple.  If we suppose we sampled
1 out of 1 million rows, then we'd have

raw totaltime = 1 * 9.5 usec = 95msec
avg time/execution = 9.5 usec
SampleOverhead = 8.5 usec
number of unsampled executions = 99
correction = 990msec

which means that a 10% error in estimating SampleOverhead would
contribute as much to the final estimate as the actual measurement did.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Simon Riggs

On Fri, 2006-06-09 at 10:56 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I propose we revert the sampling patch (sorry Martijn)
 
 yah ...
 
  and go with the
  patch to have an explain_analyze_timing parameter (default=on).
 
 This I'm unexcited about.  EXPLAIN output isn't all that transparent
 anyway, and losing the extra cue of seeing where the time is really
 going would make it extremely easy for people to misinterpret their
 problems.

As before, agreed, but it works and is available now.

 I was intending to push forward with the idea of being able to get
 numbers out of a canceled EXPLAIN.  That will allow you to get some
 information even when the underlying query runs longer than you're
 willing to tolerate.  I still say that the number of queries where
 avoiding gettimeofday overhead would transform an intolerable runtime
 into a tolerable one is pretty limited.

That would be a good feature to have.

 The other thing that I think would be worth investigating is
 timer-driven sampling, although it's not yet clear whether we can
 make that work usefully.  Anyone want to take up that project?

Not me, sorry.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-09 Thread Tom Lane

Zeugswetter Andreas DCP SD [EMAIL PROTECTED] writes:
 This bothers me a bit, because in 
 fact the effects if any of the tested query would have been 
 rolled back.  Not sure we have any choice though.  If we 
 expose the error then we'll have problems with clients not 
 showing the EXPLAIN results.

 I think we should leave it in top level, throw the error and fix the
 clients.
 As I understood, the idea was, that it only does that if you 
 press ^C or query timeout. In this case current clients would also
 not show the plan.

Not if the clients are implemented per protocol spec.  A client cannot
assume that sending QueryCancel will make the current query fail.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Martijn van Oosterhout

On Fri, Jun 09, 2006 at 11:22:11AM -0400, A.M. wrote:
 It would be nice to keep the gettimeofday()s wherever they are most useful
 on hardware/software where they are cheap. Perhaps a compile-time option?

I think making it optional at compile time is worse, because then at
support time you might have to recompile to get the option you want.

Right now I'm confused though. I was under the impression the changes
were going to be ripped out because it was decided to be unworkable. I
think improvements can be made but I'm unsure if there's any
interest...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 I was intending to push forward with the idea of being able to get
 numbers out of a canceled EXPLAIN.  That will allow you to get some
 information even when the underlying query runs longer than you're
 willing to tolerate.  I still say that the number of queries where
 avoiding gettimeofday overhead would transform an intolerable runtime
 into a tolerable one is pretty limited.

Are we still going to subtract out the gettimeofday overhead?

I was always more excited about that than the sampling aspect. I've run into
queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
overhead but I've never run into a query where gettimeofday overhead made
running the query under EXPLAIN ANALYZE impractical.


-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Are we still going to subtract out the gettimeofday overhead?

 I was always more excited about that than the sampling aspect. I've run into
 queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
 overhead but I've never run into a query where gettimeofday overhead made
 running the query under EXPLAIN ANALYZE impractical.

That would be deceptive in a different way, ie, make it look like there
was a whole bunch of outside-the-plan overhead.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  Are we still going to subtract out the gettimeofday overhead?
 
  I was always more excited about that than the sampling aspect. I've run into
  queries where EXPLAIN ANALYZE results were deceptive due to the gettimeofday
  overhead but I've never run into a query where gettimeofday overhead made
  running the query under EXPLAIN ANALYZE impractical.
 
 That would be deceptive in a different way, ie, make it look like there
 was a whole bunch of outside-the-plan overhead.

Well that's exactly what there is.

To avoid user confusion it would reasonable to print out a line at the bottom:

  Explain analyze profiling overhead removed: xxx ms

That also gives the user feedback on how precise their explain analyze results
are. If they see that the overhead being removed is as large as the timing
remaining then they can realize that the results aren't especially precise. On
the other hand if they see that the overhead being removed is small then they
can be pretty confident in the precision of the results.

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 To avoid user confusion it would reasonable to print out a line at the bottom:

   Explain analyze profiling overhead removed: xxx ms

 That also gives the user feedback on how precise their explain analyze results
 are. If they see that the overhead being removed is as large as the timing
 remaining then they can realize that the results aren't especially precise. On
 the other hand if they see that the overhead being removed is small then they
 can be pretty confident in the precision of the results.

I would prefer to leave the numbers unmanipulated, because frankly I'd
have no confidence in the correction.  Perhaps we could put a note at the
bottom about Estimated total profiling overhead: xxx ms.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  To avoid user confusion it would reasonable to print out a line at the 
  bottom:
 
Explain analyze profiling overhead removed: xxx ms
 
  That also gives the user feedback on how precise their explain analyze 
  results
  are. If they see that the overhead being removed is as large as the timing
  remaining then they can realize that the results aren't especially precise. 
  On
  the other hand if they see that the overhead being removed is small then 
  they
  can be pretty confident in the precision of the results.
 
 I would prefer to leave the numbers unmanipulated, because frankly I'd
 have no confidence in the correction.  Perhaps we could put a note at the
 bottom about Estimated total profiling overhead: xxx ms.

Well the whole reason the overhead is deceptive is that some nodes call
gettimeofday many more times proportionally to their running time than others.
It makes users suspect cpu-intensive nodes that process many records when the
real problem lies in a i/o-intensive node that processes few records. Just
showing the total overhead will give the user the impression that time is
distributed evenly throughout the nodes proportionally to their total time
which doesn't help correct the distortion.

There seem to be two types of overhead going on. There's the amount of time
spent in gettimeofday itself which is pretty consistent. Then there's other
effects that cause normal operation itself to take longer (or potentially even
less long).

Postgres may as well remove the consistent gettimeofday overhead it can
identify and measure even if there's more it can't characterize perfectly.

-- 
greg


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-09 Thread Martijn van Oosterhout

On Fri, Jun 09, 2006 at 03:55:28PM -0400, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  There seem to be two types of overhead going on. There's the amount of time
  spent in gettimeofday itself which is pretty consistent.
 
 That is a fact not in evidence.  The impression I had from that
 linux-kernel discussion was that the problematic kernel code was looping
 until it got consistent results from successive hardware reads.  I'm
 not at all sure that you can make the above assertion across all
 varieties of clock hardware, or even all common varieties.

The interesting thing about this is that they obviously are gearing
gettimeofday() to be accurate, rather than just considering it a
counter that is somewhat close to real time. At the expense of speed.

What we want is just a monotonically increasing counter that can be
read quickly and consistantly, we're not majorly fussed if it doesn't
match real time. This puts us back to CPU cycle counters, but they have
drawbacks of their own.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Simon Riggs

On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
  Certainly the removal of timing
  is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
  acceptable one; 
 
  I disagree, as have others.
 
 The overhead seems to be on the order of a couple tens of percent usually.
 I don't see how that makes the difference between an EXPLAIN ANALYZE you
 can run and one you can't.

Well, thats not my experience and doesn't match others posted on
-hackers. 

A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
to be consistently above 500% (or more than +400%, depending upon how
you style those numbers). This is for in-shared-buffers data, following
cache priming. Test shown below is typical result from 5 tests:

 postgres=# select count(*) from accounts;
  count
 
  10
 (1 row)
 
 Time: 267.008 ms
 postgres=# explain analyze select count(*) from accounts;
   QUERY PLAN 
 ---
  Aggregate  (cost=10945.00..10945.01 rows=1 width=0) (actual 
 time=1497.830..1497.837 rows=1 loops=1)
-  Seq Scan on accounts  (cost=0.00..9998.20 rows=378720 width=0) (actual 
 time=0.078..828.455 rows=10 loops=1)
  Total runtime: 1497.954 ms
 (3 rows)
 
 Time: 1498.983 ms

Other timings were: 1493 1498 1707 1814 1827

EXPLAIN ANALYZE is designed to be run *when* you have unacceptable run
times and need to find out why. So making the execution time even more
unacceptable makes the utility infeasible at the time you need it most.
The additional run-time occurs on people's production systems, so they
generally aren't happy running long tasks.

BTW I think EA is great - so good in fact I want to run it more often.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
 The overhead seems to be on the order of a couple tens of percent usually.
 I don't see how that makes the difference between an EXPLAIN ANALYZE you
 can run and one you can't.

 Well, thats not my experience and doesn't match others posted on
 -hackers. 

 A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
 to be consistently above 500% (or more than +400%, depending upon how
 you style those numbers).

I think we ought to find out why your machine is so broken.

Even in this pretty-much-worst-case scenario (a seqscan does about as
little real work per plan node call as possible, especially if the table
is already fully cached), I don't see more than about a 2X degradation.
On queries that are complicated enough to actually need EXPLAIN ANALYZE,
it's not nearly that bad.

Old slow HPUX/HPPA machine, PG 8.1 branch tip:

bench=# \timing
Timing is on.
bench=# select count(*) from accounts;
 count

 10
(1 row)

Time: 543.565 ms
-- do it again to ensure fully cached
bench=# select count(*) from accounts;
 count

 10
(1 row)

Time: 492.667 ms
bench=# explain analyze select count(*) from accounts;
  QUERY PLAN


---
 Aggregate  (cost=2975.00..2975.01 rows=1 width=0) (actual time=1172.856..1172.8
60 rows=1 loops=1)
   -  Seq Scan on accounts  (cost=0.00..2725.00 rows=10 width=0) (actual ti
me=0.175..720.741 rows=10 loops=1)
 Total runtime: 1173.290 ms
(3 rows)

Time: 1176.293 ms
bench=#

Spiffy new Fedora 5/dual Xeon machine, PG 8.1 branch tip:

bench=# \timing
Timing is on.
bench=# select count(*) from accounts;
 count

 10
(1 row)

Time: 61.737 ms
-- do it again to ensure fully cached
bench=# select count(*) from accounts;
 count

 10
(1 row)

Time: 53.941 ms
bench=# explain analyze select count(*) from accounts;
  QUERY PLAN
--
 Aggregate  (cost=2975.00..2975.01 rows=1 width=0) (actual 
time=117.881..117.882 rows=1 loops=1)
   -  Seq Scan on accounts  (cost=0.00..2725.00 rows=10 width=0) (actual 
time=0.041..77.628 rows=10 loops=1)
 Total runtime: 117.936 ms
(3 rows)

Time: 118.510 ms
bench=#

I'm too lazy to pull up any of my other machines right now, but this is
generally consistent with my experience ever since EXPLAIN ANALYZE was
written.

So: what's your platform exactly?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Simon Riggs

On Thu, 2006-06-08 at 10:27 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Wed, 2006-06-07 at 17:28 -0400, Tom Lane wrote:
  The overhead seems to be on the order of a couple tens of percent usually.
  I don't see how that makes the difference between an EXPLAIN ANALYZE you
  can run and one you can't.
 
  Well, thats not my experience and doesn't match others posted on
  -hackers. 
 
  A simple test with pgbench shows the timing overhead of EXPLAIN ANALYZE
  to be consistently above 500% (or more than +400%, depending upon how
  you style those numbers).
 
 I think we ought to find out why your machine is so broken.

 I'm too lazy to pull up any of my other machines right now, but this is
 generally consistent with my experience ever since EXPLAIN ANALYZE was
 written.

Great. Well it isn't consistent with mine, or others who've posted to
this list.

 So: what's your platform exactly?

FC5, Intel laptop running cvstip, new in January.

But thats irrelevant. I'm not a user, I solve others problems, as you
know. Hence my interest in a usable tool to do that.

So far we have myself, Kevin, Martijn and Luke all saying there is a
distortion or a massive overhead caused by EXPLAIN ANALYZE.
http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php

It's real. I won't press the point further.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 So far we have myself, Kevin, Martijn and Luke all saying there is a
 distortion or a massive overhead caused by EXPLAIN ANALYZE.
 http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
 http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php

The first of these shows the exact opposite of what you are claiming,
and the second is without details of any sort that might help determine
what the actual problem is.

Given that we're seeing diametrically opposed results on the same OS
(FC5) and similar (at least all Intel) hardware, I think the prudent
thing is to find out what's really going on before leaping in with
proposed solutions.  As the person who's *not* seeing the problem,
I'm not in a position to do that investigation...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-08 Thread Tom Lane

I wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 A full EXPLAIN ANALYZE is always desirable - we agree on that. The
 question is what we do when one is not available.

 The least bad alternative I've heard is to let EXPLAIN ANALYZE print
 out stats-so-far if the query is canceled by control-C or statement
 timeout.  The objection to this is you may mistake startup transients
 for full query behavior ... but at least the numbers will be good as
 far as they go.

I thought some more about this, and it seems doable but tricky (ie,
there are many wrong ways to do it).  Here are my conclusions:

We can't just use the normal QueryCancel logic that throws an
elog(ERROR) from someplace down inside the query.  This would leave
the backend in an unclean state from which we could only certainly
recover by doing AbortTransaction.  And once we've aborted the
transaction we can't do catalog accesses, which gets in the way of
producing the EXPLAIN printout.

Running the test query inside a subtransaction would fix that,
but aborting the subtransaction would throw away the executor state,
including the Instrumentation nodes we need.

So it seems we need a way to stop the query without raising an error
per se.  What I'm thinking is that while EXPLAIN ANALYZE is running,
timeout or SIGINT should not set QueryCancelPending, but instead
set a separate flag ExplainCancelPending, which we then test in
ExecProcNode(), say

if (node-instrument)
{
+   if (ExplainCancelPending)
+   return NULL;
InstrStartNode(node-instrument);
}

There might be one or two other places to check it, but basically
we'd only notice the flag in very limited circumstances where
it's definitely safe to force early termination of ExecutorRun.

When control comes back to explain.c, we just print the results as
normal (but probably adding a line explicitly noting that the query
was abandoned before completion).  Note that we won't have any running
Instrumentation nodes to contend with, since the change doesn't cause
nodes to drop out after they've started timing.  So the data is good,
just incomplete.

After we've printed the results, we have a bit of a problem: if
ExplainCancelPending is set, we now need to abort the transaction.  It
would not do at all to allow an incompletely executed UPDATE to commit.
I experimented with throwing an elog() at the bottom of ExplainQuery()
after end_tup_output(), but this does not work: psql only prints the
error and not the data, because libpq throws away the query result
upon seeing the error.  We could probably hack psql to print the results
before noting the error, but I'm concerned about making a change that
would change the behavior for other error-at-end-of-statement cases.
Also, it's not clear what might have to happen to get non-psql clients
to play along.

It seems like the best solution is to establish a subtransaction around
the entire EXPLAIN command (not just the test query), which we can abort
after we've printed the results.

A possible objection to this is that running a query inside a
subtransaction might have different/worse performance than running it
at top level.  I don't recall any severe bottlenecks of that kind but
that doesn't mean there aren't any (Alvaro, any comments?)

Note that this would mean that ending an EXPLAIN ANALYZE early, via
either control-C or statement timeout, would be treated as a non-error
situation from the point of view of the outer transaction.  This bothers
me a bit, because in fact the effects if any of the tested query would
have been rolled back.  Not sure we have any choice though.  If we
expose the error then we'll have problems with clients not showing the
EXPLAIN results.

Thoughts?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-08 Thread Gregory Stark


Tom Lane [EMAIL PROTECTED] writes:

 After we've printed the results, we have a bit of a problem: if
 ExplainCancelPending is set, we now need to abort the transaction.  It
 would not do at all to allow an incompletely executed UPDATE to commit.
 I experimented with throwing an elog() at the bottom of ExplainQuery()
 after end_tup_output(), but this does not work: psql only prints the
 error and not the data, because libpq throws away the query result
 upon seeing the error.  We could probably hack psql to print the results
 before noting the error, but I'm concerned about making a change that
 would change the behavior for other error-at-end-of-statement cases.
 Also, it's not clear what might have to happen to get non-psql clients
 to play along.

Would it be possible to make a whole new protocol message for EXPLAIN results?

The reason I'm suggesting that is because it might make it easier to implement
the SIGINFO handler that sends incremental EXPLAIN results on demand that I
was describing earlier.

Then libpq would have a separate api call to check for EXPLAIN results. If a
non-psql client doesn't check it then it doesn't get the EXPLAIN results but
the behaviour is correct. If it does then it can get the EXPLAIN results.

This might also let you run with EXPLAIN ANALYZE instrumentation but still get
the regular query results. Since the explain analyze results would still be
available out of band.

-- 
Gregory Stark
T + 514 938 2456


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-08 Thread Tom Lane

Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Would it be possible to make a whole new protocol message for EXPLAIN results?

I'm really unwilling to get into that.  For one thing, that would
absolutely positively break *all* use of EXPLAIN from un-fixed clients,
whether you were trying to stop early or not.

 The reason I'm suggesting that is because it might make it easier to implement
 the SIGINFO handler that sends incremental EXPLAIN results on demand that I
 was describing earlier.

Doesn't matter, because that's not happening ;-)  SIGINFO isn't
portable, and even if it were, a signal handler couldn't possibly
generate EXPLAIN output (remember those catalog accesses).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-08 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Gregory Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  Would it be possible to make a whole new protocol message for EXPLAIN 
  results?
 
 I'm really unwilling to get into that.  For one thing, that would
 absolutely positively break *all* use of EXPLAIN from un-fixed clients,
 whether you were trying to stop early or not.

Well I was picturing still having the manual EXPLAIN ANALYZE which works as it
does now replacing the query results in addition to a lower level protocol
request which generates a protocol response with the data.

  The reason I'm suggesting that is because it might make it easier to 
  implement
  the SIGINFO handler that sends incremental EXPLAIN results on demand that I
  was describing earlier.
 
 Doesn't matter, because that's not happening ;-)  SIGINFO isn't
 portable, and even if it were, a signal handler couldn't possibly
 generate EXPLAIN output (remember those catalog accesses).

Well the signal gets handled by psql. It would have to be a wire protocol
message like Query Cancel that the backend watches for and handles the first
time it can.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: Ending EXPLAIN ANALYZE early (was Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work)

2006-06-08 Thread Alvaro Herrera

Tom Lane wrote:

 A possible objection to this is that running a query inside a
 subtransaction might have different/worse performance than running it
 at top level.  I don't recall any severe bottlenecks of that kind but
 that doesn't mean there aren't any (Alvaro, any comments?)

Nope, nothing that I recall.  Maybe make sure to generate the Xid of the
subtransaction before starting the EXPLAIN itself (so that there is no
pg_subtrans I/O pending, etc)

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Simon Riggs

On Thu, 2006-06-08 at 12:56 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  So far we have myself, Kevin, Martijn and Luke all saying there is a
  distortion or a massive overhead caused by EXPLAIN ANALYZE.
  http://archives.postgresql.org/pgsql-hackers/2006-03/msg00954.php
  http://archives.postgresql.org/pgsql-patches/2006-05/msg00168.php
 
 Given that we're seeing diametrically opposed results on the same OS
 (FC5) and similar (at least all Intel) hardware, I think the prudent
 thing is to find out what's really going on before leaping in with
 proposed solutions.  As the person who's *not* seeing the problem,
 I'm not in a position to do that investigation...

That seems reasonable.

I've cut a patch to remove timing from the EA results.
Output shown here:

postgres=# set explain_analyze_timing = on;
SET
Time: 0.673 ms
postgres=# explain analyze select count(*) from accounts;
   QUERY PLAN 

 Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (actual
time=0.000..690.780 rows=1 loops=1)
   -  Seq Scan on accounts  (cost=0.00..2640.00 rows=10 width=0)
(actual time=0.000..2672.562 rows=10 loops=1)
 Total runtime: 734.474 ms
(3 rows)

Time: 891.822 ms
postgres=# set explain_analyze_timing = off;
SET
Time: 0.480 ms
postgres=# explain analyze select count(*) from accounts;
 QUERY PLAN 

 Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (rows=1 loops=1)
   -  Seq Scan on accounts  (cost=0.00..2640.00 rows=10 width=0)
(rows=10 loops=1)
 Total runtime: 133.674 ms
(3 rows)

Time: 134.565 ms
postgres=# select count(*) from accounts;
 count

 10
(1 row)

Time: 130.528 ms

So the timing is clearly responsible for the additional time I'm
personally experiencing and very likely to be that for others also.

As to why that should be the case, I'm not sure. The timing overhead
seems fairly constant on particular hardware/OS, just different for
each. 

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com
Index: src/backend/commands/explain.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/commands/explain.c,v
retrieving revision 1.147
diff -c -r1.147 explain.c
*** src/backend/commands/explain.c	8 Apr 2006 18:49:52 -	1.147
--- src/backend/commands/explain.c	8 Jun 2006 20:27:13 -
***
*** 689,699 
  	{
  		double		nloops = planstate-instrument-nloops;
  
! 		appendStringInfo(str,  (actual time=%.3f..%.3f rows=%.0f loops=%.0f),
  		 1000.0 * planstate-instrument-startup / nloops,
  		 1000.0 * planstate-instrument-total / nloops,
  		 planstate-instrument-ntuples / nloops,
  		 planstate-instrument-nloops);
  	}
  	else if (es-printAnalyze)
  		appendStringInfo(str,  (never executed));
--- 689,704 
  	{
  		double		nloops = planstate-instrument-nloops;
  
! if (planstate-instrument-actualtime)
! 		appendStringInfo(str,  (actual time=%.3f..%.3f rows=%.0f loops=%.0f),
  		 1000.0 * planstate-instrument-startup / nloops,
  		 1000.0 * planstate-instrument-total / nloops,
  		 planstate-instrument-ntuples / nloops,
  		 planstate-instrument-nloops);
+ else
+ 		appendStringInfo(str,  (rows=%.0f loops=%.0f),
+ 		 planstate-instrument-ntuples / nloops,
+ 		 planstate-instrument-nloops);
  	}
  	else if (es-printAnalyze)
  		appendStringInfo(str,  (never executed));
Index: src/backend/executor/instrument.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/executor/instrument.c,v
retrieving revision 1.17
diff -c -r1.17 instrument.c
*** src/backend/executor/instrument.c	7 Jun 2006 18:49:03 -	1.17
--- src/backend/executor/instrument.c	8 Jun 2006 20:27:13 -
***
*** 17,23 
  #include unistd.h
  
  #include executor/instrument.h
! 
  
  /*
   * As of PostgreSQL 8.2, we try to reduce the overhead of EXPLAIN ANALYZE
--- 17,23 
  #include unistd.h
  
  #include executor/instrument.h
! #include utils/guc.h
  
  /*
   * As of PostgreSQL 8.2, we try to reduce the overhead of EXPLAIN ANALYZE
***
*** 115,120 
--- 115,121 
  	instr = palloc0(n * sizeof(Instrumentation));
  
  	/* we don't need to do any initialization except zero 'em */
+ instr-actualtime = explain_analyze_timing;
  
  	return instr;
  }
***
*** 123,128 
--- 124,132 
  void
  InstrStartNode(Instrumentation *instr)
  {
+ if (!instr-actualtime)
+ return;
+ 
  	if (INSTR_TIME_IS_ZERO(instr-starttime))
  	{
  		/*
***
*** 150,155 
--- 154,166 
  {
  	/* count the returned tuples and iterations */

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 So the timing is clearly responsible for the additional time I'm
 personally experiencing and very likely to be that for others also.

Well, that's certainly unsurprising, but the question is why it's such
a large overhead for you when it's not on other apparently-similar kit.

If anyone else is still following this discussion, could you try the
test case Simon gave a bit earlier (select count(*) from 10-row
table, check \timing with and without explain analyze)?  If we could get
a clearer picture of which platforms show the problem and which don't,
maybe we could make some progress.

(BTW, I just tried it on my Apple Powerbook G4, and got results in line
with what I get elsewhere.  So I've now tried it on three very different
CPU types and OSes, and not duplicated Simon's problem on any of them.)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Alvaro Herrera

Simon Riggs wrote:

 postgres=# explain analyze select count(*) from accounts;
QUERY PLAN 
 
  Aggregate  (cost=2890.00..2890.01 rows=1 width=0) (actual
 time=0.000..690.780 rows=1 loops=1)
-  Seq Scan on accounts  (cost=0.00..2640.00 rows=10 width=0)
 (actual time=0.000..2672.562 rows=10 loops=1)
  Total runtime: 734.474 ms
 (3 rows)
 
 Time: 891.822 ms

Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
similar to this:

http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Larry Rosenman

Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 So the timing is clearly responsible for the additional time I'm
 personally experiencing and very likely to be that for others also.
 
 Well, that's certainly unsurprising, but the question is why it's such
 a large overhead for you when it's not on other apparently-similar
 kit. 
 
 If anyone else is still following this discussion, could you try the
 test case Simon gave a bit earlier (select count(*) from 10-row
 table, check \timing with and without explain analyze)?  If we could
 get a clearer picture of which platforms show the problem and which
 don't, maybe we could make some progress.
 
 (BTW, I just tried it on my Apple Powerbook G4, and got results in
 line with what I get elsewhere.  So I've now tried it on three very
 different CPU types and OSes, and not duplicated Simon's problem on
 any of them.) 
 

$ psql
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
   \h for help with SQL commands
   \? for help with psql commands
   \g or terminate with semicolon to execute query
   \q to quit

ler=# create table test1(id int);
CREATE TABLE
ler=# insert into test1 select * from generate_series(1,10);
INSERT 0 10
ler=# \timing
Timing is on.
ler=# select count(*) from test1;
 count

 10
(1 row)

Time: 39.191 ms
ler=# explain analyze select count(*) from test1;
 QUERY PLAN



 Aggregate  (cost=1988.18..1988.19 rows=1 width=0) (actual
time=282.899..282.901 rows=1 loops=1)
   -  Seq Scan on test1  (cost=0.00..1698.74 rows=115774 width=0) (actual
time=0.007..147.845 rows=10 loops=1)
 Total runtime: 282.987 ms
(3 rows)

Time: 283.764 ms
ler=# \q
$ uname -a
FreeBSD thebighonker.lerctr.org 6.1-STABLE FreeBSD 6.1-STABLE #59: Thu Jun
1 09:40:47 CDT 2006
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/THEBIGHONKER  amd64
$ sysctl hw
hw.machine: amd64
hw.model: Intel(R) Xeon(TM) CPU 3.00GHz
hw.ncpu: 4
hw.byteorder: 1234
hw.physmem: 4286132224
hw.usermem: 4003151872
hw.pagesize: 4096
hw.floatingpoint: 1
hw.machine_arch: amd64
hw.realmem: 5368709120

[snip]

the database is 8.1.4
-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Alvaro Herrera

Tom Lane wrote:

 If anyone else is still following this discussion, could you try the
 test case Simon gave a bit earlier (select count(*) from 10-row
 table, check \timing with and without explain analyze)?  If we could get
 a clearer picture of which platforms show the problem and which don't,
 maybe we could make some progress.

Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.

The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
you were looking for.

All results are repeated multiple times, I'm reporting times other than
the first query (which is slower).

More specs on the Celeron machine follow:

$ cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 6
model name  : Celeron (Mendocino)
stepping: 5
cpu MHz : 534.616
cache size  : 128 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat 
pse36 mmx fxsr
bogomips: 1070.25

$ uname -a
Linux drizzt 2.6.16-1-686 #2 Sat Apr 15 20:45:20 UTC 2006 i686 GNU/Linux

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
 the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.

 The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
 and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
 you were looking for.

Do you have oprofile installed on these?  Comparing oprofile results
might give some more insight where the time is going.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Jim C. Nasby

On Thu, Jun 08, 2006 at 04:58:07PM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  So the timing is clearly responsible for the additional time I'm
  personally experiencing and very likely to be that for others also.
 
 Well, that's certainly unsurprising, but the question is why it's such
 a large overhead for you when it's not on other apparently-similar kit.
 
 If anyone else is still following this discussion, could you try the
 test case Simon gave a bit earlier (select count(*) from 10-row
 table, check \timing with and without explain analyze)?  If we could get
 a clearer picture of which platforms show the problem and which don't,
 maybe we could make some progress.
 
 (BTW, I just tried it on my Apple Powerbook G4, and got results in line
 with what I get elsewhere.  So I've now tried it on three very different
 CPU types and OSes, and not duplicated Simon's problem on any of them.)

CPU: AMD Athlon(tm) 64 Processor 3500+ (2210.20-MHz K8-class CPU)
8.1.3 on amd64-portbld-freebsd6.0   25ms353ms

CPU: AMD Opteron(tm) Processor 244 (1792.50-MHz K8-class CPU) (dual CPU)
8.1.4 on amd64-portbld-freebsd6.0   31ms295ms

Powerbook G4 1.33GHz
8.1.4 on powerpc-apple-darwin8.6.0  5.1s5.8s*

The powerbook tests were not very repeatable at 100,000 rows, so I
bumped up to 1M. The results still aren't very repeatable...
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 4914.604 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 5186.516 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 5174.418 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5729.623..5729.624 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.155..4039.317 rows=100 loops=1)
 Total runtime: 5729.907 ms
(3 rows)

Time: 5732.076 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5916.025..5916.026 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.157..4246.123 rows=100 loops=1)
 Total runtime: 5916.261 ms
(3 rows)

Time: 5922.542 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5800.788..5800.789 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.152..4126.781 rows=100 loops=1)
 Total runtime: 5801.026 ms
(3 rows)

Time: 5803.070 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5994.285..5994.286 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.156..4341.463 rows=100 loops=1)
 Total runtime: 5994.520 ms
(3 rows)

Time: 5996.577 ms
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Alvaro Herrera

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tried on two machines.  The first (Turion64 laptop) gives 44-45 ms for
  the SELECT, and 50-51 ms for the EXPLAIN ANALYZE.
 
  The second machine, desktop Celeron 533, gives 197-200 ms for the SELECT
  and 788-790 for the EXPLAIN ANALYZE.  I guess this is the reproduction
  you were looking for.
 
 Do you have oprofile installed on these?  Comparing oprofile results
 might give some more insight where the time is going.

No, I don't.  I'll have a look.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 The powerbook tests were not very repeatable at 100,000 rows, so I
 bumped up to 1M. The results still aren't very repeatable...

Hmm, I suspect you pushed it into the realm where it's doing actual
I/O ... which we don't want for this particular test.  (Although it's
worth noting that if your query *is* doing actual I/O, the EXPLAIN
overhead certainly ought to be down in the noise.)

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Jim C. Nasby

Just tested 2 windows boxes, both running 8.1.3 and XP SP2.

P4 2.8G desktop 47ms297ms
Pentium M 2G laptop 40ms240ms
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Jim C. Nasby

On Thu, Jun 08, 2006 at 05:32:36PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  The powerbook tests were not very repeatable at 100,000 rows, so I
  bumped up to 1M. The results still aren't very repeatable...
 
 Hmm, I suspect you pushed it into the realm where it's doing actual
 I/O ... which we don't want for this particular test.  (Although it's
 worth noting that if your query *is* doing actual I/O, the EXPLAIN
 overhead certainly ought to be down in the noise.)

Hmm... I guess it depends on how smart the OS cache is; the table is
36MB; 4406 pages. But shared buffers is 3000...

Bumping shared_buffers up to 5k, there is a bigger difference, but these
numbers are still all over the place, so I don't know that they're very
trust-worthy.

decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5398.625..5398.626 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.034..3967.927 rows=100 loops=1)
 Total runtime: 5398.871 ms
(3 rows)

Time: 5400.900 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5626.671..5626.671 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.035..3875.641 rows=100 loops=1)
 Total runtime: 5626.904 ms
(3 rows)

Time: 5628.896 ms
decibel=# explain analyze select count(*) from i;
QUERY PLAN  
   
---
 Aggregate  (cost=16905.05..16905.06 rows=1 width=0) (actual 
time=5137.549..5137.549 rows=1 loops=1)
   -  Seq Scan on i  (cost=0.00..14405.24 rows=24 width=0) (actual 
time=0.020..3440.034 rows=100 loops=1)
 Total runtime: 5137.707 ms
(3 rows)

Time: 5139.178 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 4357.443 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 4251.208 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 4712.912 ms
decibel=# select count(*) from i;
  count  
-
 100
(1 row)

Time: 4479.278 ms
decibel=# 
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
 similar to this:
 http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282

Yeah, that's a pretty interesting thread.  I came across something
similar on a Red Hat internal list.  It seems there are three or four
different popular standards for clock hardware in the Intel world,
and some good implementations and some pretty bad implementations
of each.  So the answer may well boil down to if you're using cheap
junk PC hardware then gettimeofday will be slow.

In fact, I just got around to trying this on my old x86 PC, and
behold:

regression=# select count(*) from tenk1;
 count 
---
 1
(1 row)

Time: 9.670 ms
regression=# explain analyze select count(*) from tenk1;
   QUERY PLAN   


 Aggregate  (cost=470.00..470.01 rows=1 width=0) (actual time=93.549..93.553 
rows=1 loops=1)
   -  Seq Scan on tenk1  (cost=0.00..445.00 rows=1 width=0) (actual 
time=0.014..49.261 rows=1 loops=1)
 Total runtime: 93.663 ms
(3 rows)

Time: 94.431 ms
regression=# 

So that says that this machine takes about 4 usec to do gettimeofday(),
as compared to 3 usec on my nominally 4x slower HPPA machine.

The new dual Xeon sitting next to it probably has a much less junky
motherboard, with a better clock ... and I find it unsurprising that
the HP and Apple machines I was trying aren't subject to such problems.

I didn't get the impression from that linux-kernel thread that the
proposed patch had actually gone in yet anyplace; anyone know how to
track that?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Mark Kirkwood


Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
similar to this:
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282


Yeah, that's a pretty interesting thread.  I came across something
similar on a Red Hat internal list.  It seems there are three or four
different popular standards for clock hardware in the Intel world,
and some good implementations and some pretty bad implementations
of each.  So the answer may well boil down to if you're using cheap
junk PC hardware then gettimeofday will be slow.



OS seems to matter as well - I've got two identical Supermicro P3TDER 
dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 2.6.16.


Doing the 'select count(*) vs explain analyze select count(*) on 10 
row table gives:


Freebsd : select 108 ms  explain analyze 688 ms
Linux   : select 100 ms  explain analyze 196 ms

Both systems have ACPI enabled in BIOS (which means there is a better 
timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' - 
not sure how to check on Linux).


Cheers

Mark


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-08 Thread Agent M

It's worth noting that on Darwin (on Apple hardware) gettimeofday is 
never a syscall whereas on Linux (AFAIK), it always is.


On Jun 8, 2006, at 7:58 PM, Mark Kirkwood wrote:


Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

Wow, that is slow.  Maybe a problem in the kernel?  Perhaps something
similar to this:
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/index.html#1282

Yeah, that's a pretty interesting thread.  I came across something
similar on a Red Hat internal list.  It seems there are three or four
different popular standards for clock hardware in the Intel world,
and some good implementations and some pretty bad implementations
of each.  So the answer may well boil down to if you're using cheap
junk PC hardware then gettimeofday will be slow.


OS seems to matter as well - I've got two identical Supermicro P3TDER 
dual intel boxes. 1 running FreeBSD 6.1, one running Gentoo Linux 
2.6.16.


Doing the 'select count(*) vs explain analyze select count(*) on 
10 row table gives:


Freebsd : select 108 ms  explain analyze 688 ms
Linux   : select 100 ms  explain analyze 196 ms

Both systems have ACPI enabled in BIOS (which means there is a better 
timecounter than 'i8254' available (FreeBSD says its using 'ACPI-safe' 
- not sure how to check on Linux).


¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
[EMAIL PROTECTED]
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Simon Riggs

On Tue, 2006-06-06 at 16:50 -0400, Tom Lane wrote:

 I have a theory about this, and it's not pleasant at all.  What I
 think is that we have a Heisenberg problem here: the act of invoking
 gettimeofday() actually changes what is measured. 

 If this theory is correct, then the entire notion of EXPLAIN ANALYZE
 sampling has just crashed and burned.  We can't ship a measurement
 tool that is accurate on some platforms and not others.

Regrettably, I would agree and so conclude that we shouldn't pursue the
sampling idea further. Heisenbugs suck time like no other. Interesting,
though.

That leaves us with a number of possibilities:
0. Do Nothing
1. Option to skip timing altogether on an EXPLAIN ANALYZE
2. Option to produce a partial execution only, to locate problem areas.

Any others?

Option 2 would be harder to interpret, but still useful - originally
discussed in a current thread on -perform.
Option 1 wouldn't be as useful as the original sampling idea, but if its
not on the table any longer

I'd revert back to Option 1 as being the best choice for further work.

Do we agree the idea can't go further? What next?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 Do we agree the idea can't go further? What next?

It still needs investigation; I'm no longer convinced that the TLB-flush
theory is correct.  See rest of thread.  We may well have to revert the
current patch, but I'd like to be sure we understand why.

If we do have to revert, I'd propose that we pursue the notion of
interrupt-driven sampling like gprof uses.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Martijn van Oosterhout

On Wed, Jun 07, 2006 at 09:53:32AM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Do we agree the idea can't go further? What next?
 
 It still needs investigation; I'm no longer convinced that the TLB-flush
 theory is correct.  See rest of thread.  We may well have to revert the
 current patch, but I'd like to be sure we understand why.

One thing I'm thinking of trying is to, instead of assuming we can
estimate the duractions of all the nodes by taking the total time
divided by samples. we assume that the duration of tuple X is similar
in duration to tuple X+1 but not necessarily the same as all other
tuples.

This moves the calculation from EndLoop to StopInstr. Basically in
StopInstr you do the steps:

if( sampling )
{
  x = get time for this tuple
  n = number of tuples skipped

  cumulativetime += x*n
}

This would mean that we wouldn't be assuming that tuples near the end
take as long as tuples near the beginning. Except we're now dealing
will smaller numbers, so I'm worried about error accumlation.

 If we do have to revert, I'd propose that we pursue the notion of
 interrupt-driven sampling like gprof uses.

How would that work? You could then estimate how much time was spent in
each node, but you no longer have any idea about when they were entered
or left.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Greg Stark


Martijn van Oosterhout kleptog@svana.org writes:

 This would mean that we wouldn't be assuming that tuples near the end
 take as long as tuples near the beginning. Except we're now dealing
 will smaller numbers, so I'm worried about error accumlation.

Hm, that would explain why Hash joins suffer from this especially. Even when
functioning properly hashes get slower as the buckets fill up and there are
longer lists to traverse. Perhaps the hashes are suffering inordinately from
collisions though. Some of the data type hash functions looked kind of suspect
when I peeked at them a while back.

-- 
greg


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Martijn van Oosterhout kleptog@svana.org writes:
 This would mean that we wouldn't be assuming that tuples near the end
 take as long as tuples near the beginning. Except we're now dealing
 will smaller numbers, so I'm worried about error accumlation.

 Hm, that would explain why Hash joins suffer from this especially. Even when
 functioning properly hashes get slower as the buckets fill up and there are
 longer lists to traverse.

Nope, that is certainly not the explanation, because the hash table is
loaded in the (single) call of the Hash node at the start of the query.
It is static all through the sampled-and-not executions of the Hash Join
node, which is where our problem is.

I don't see that Martijn's idea responds to the problem anyway, if it is
some sort of TLB-related issue.  The assumption we are making is not
tuples near the end take as long as tuples near the beginning, it is
tuples we sample take as long as tuples we don't (both statements of
course meaning on the average).  If the act of sampling incurs
overhead beyond the gettimeofday() call itself, then we are screwed,
and playing around with which iterations we sample and how we do the
extrapolation won't make the slightest bit of difference.

I'm unsure about the TLB-flush theory because I see no evidence of any
such overhead in the 8.1 timings; but on the other hand it's hard to see
what else could explain the apparent dependence on targetlist width.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

 Nope, that is certainly not the explanation, because the hash table is
 loaded in the (single) call of the Hash node at the start of the query.
 It is static all through the sampled-and-not executions of the Hash Join
 node, which is where our problem is.

At the risk of asking a stupid question, it's not perchance including that
hash build in the first sample it takes of the hash join node?

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Nope, that is certainly not the explanation, because the hash table is
 loaded in the (single) call of the Hash node at the start of the query.
 It is static all through the sampled-and-not executions of the Hash Join
 node, which is where our problem is.

 At the risk of asking a stupid question, it's not perchance including that
 hash build in the first sample it takes of the hash join node?

Sure.  Which is one of the reasons why the first tuple is excluded from
the extrapolation...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Ah-hah, I've sussed it.  The faulty assumption can actually be stated
as: all the executions, except maybe the first, will take approximately
the same amount of time.  The failing test case I've been looking at
is one where the system decides to use a batched hash join, and in
this plan type some iterations take much longer than others.  (The
apparent dependence on targetlist width is now easy to understand,
because that affects the estimated hashtable size and hence the decision
whether batching is needed.  I'm not sure why I don't see the effect
when running the identical case on my other machine, but since the
other one is a 64-bit machine its space calculations are probably a
bit different.)

I added some printf's to instrument.c to print out the actual time
measurements for each sample, as well as the calculations in
InstrEndLoop.  Attached are the printouts that occurred for the HashJoin
node.  The thing that is killing the extrapolation is of course the very
large time for iteration 2, which the extrapolation includes in its
average.  But there's well over 10:1 variation in the later samples
as well.

On reflection it's easy to imagine other cases where some iterations
take much longer than others in a not-very-predictable way.  For
instance, a join where only a subset of the outer tuples have a match
is going to act that way.  I don't think there's any good way that we
can be sure we have a representative sample of executions, and so I'm
afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I propose we revert this patch and think about an interrupt-driven
sampling method instead.

regards, tom lane


401489a0 1: 331616 usec in iter 1
401489a0 1: 110338 usec in iter 2
401489a0 1: 54 usec in iter 3
401489a0 1: 99 usec in iter 4
401489a0 1: 77 usec in iter 5
401489a0 1: 145 usec in iter 6
401489a0 1: 117 usec in iter 7
401489a0 1: 33 usec in iter 8
401489a0 1: 97 usec in iter 9
401489a0 1: 98 usec in iter 10
401489a0 1: 52 usec in iter 11
401489a0 1: 33 usec in iter 12
401489a0 1: 51 usec in iter 13
401489a0 1: 83 usec in iter 14
401489a0 1: 153 usec in iter 15
401489a0 1: 115 usec in iter 16
401489a0 1: 52 usec in iter 17
401489a0 1: 242 usec in iter 18
401489a0 1: 48 usec in iter 19
401489a0 1: 87 usec in iter 20
401489a0 1: 23 usec in iter 21
401489a0 1: 80 usec in iter 22
401489a0 1: 57 usec in iter 23
401489a0 1: 17 usec in iter 24
401489a0 1: 51 usec in iter 25
401489a0 1: 18 usec in iter 26
401489a0 1: 16 usec in iter 27
401489a0 1: 100 usec in iter 28
401489a0 1: 45 usec in iter 29
401489a0 1: 174 usec in iter 30
401489a0 1: 131 usec in iter 31
401489a0 1: 17 usec in iter 32
401489a0 1: 45 usec in iter 33
401489a0 1: 16 usec in iter 34
401489a0 1: 120 usec in iter 35
401489a0 1: 15 usec in iter 36
401489a0 1: 17 usec in iter 37
401489a0 1: 15 usec in iter 38
401489a0 1: 48 usec in iter 39
401489a0 1: 127 usec in iter 40
401489a0 1: 36 usec in iter 41
401489a0 1: 41 usec in iter 42
401489a0 1: 69 usec in iter 43
401489a0 1: 50 usec in iter 44
401489a0 1: 104 usec in iter 45
401489a0 1: 22 usec in iter 46
401489a0 1: 50 usec in iter 47
401489a0 1: 17 usec in iter 48
401489a0 1: 47 usec in iter 49
401489a0 1: 54 usec in iter 50
401489a0 1: 46 usec in iter 51
401489a0 1: 20 usec in iter 54
401489a0 1: 38 usec in iter 55
401489a0 1: 68 usec in iter 56
401489a0 1: 17 usec in iter 60
401489a0 1: 16 usec in iter 61
401489a0 1: 15 usec in iter 67
401489a0 1: 31 usec in iter 68
401489a0 1: 15 usec in iter 70
401489a0 1: 61 usec in iter 78
401489a0 1: 143 usec in iter 85
401489a0 1: 21 usec in iter 89
401489a0 1: 14 usec in iter 96
401489a0 1: 21 usec in iter 104
401489a0 1: 21 usec in iter 107
401489a0 1: 16 usec in iter 116
401489a0 1: 194 usec in iter 118
401489a0 1: 136 usec in iter 122
401489a0 1: 34 usec in iter 127
401489a0 1: 46 usec in iter 131
401489a0 1: 15 usec in iter 133
401489a0 1: 15 usec in iter 135
401489a0 1: 34 usec in iter 137
401489a0 1: 54 usec in iter 142
401489a0 1: 206 usec in iter 151
401489a0 1: 75 usec in iter 162
401489a0 1: 20 usec in iter 172
401489a0 1: 66 usec in iter 177
401489a0 1: 21 usec in iter 181
401489a0 1: 69 usec in iter 186
401489a0 1: 16 usec in iter 193
401489a0 1: 46 usec in iter 201
401489a0 1: 33 usec in iter 210
401489a0 1: 50 usec in iter 216
401489a0 1: 21 usec in iter 222
401489a0 1: 18 usec in iter 224
401489a0 1: 33 usec in iter 229
401489a0 1: 20 usec in iter 232
401489a0 1: 44 usec in iter 236
401489a0 1: 29 usec in iter 239
401489a0 1: 34 usec in iter 240
401489a0 1: 31 usec in iter 241
401489a0 1: 27 usec in iter 254
401489a0 1: 45 usec in iter 257
401489a0 1: 147 usec in iter 259
401489a0 1: 15 usec in iter 269
401489a0 1: 16 usec in iter 278
401489a0 1: 14 usec in iter 279
401489a0 1: 58 usec in iter 290
401489a0 1: 15 usec in iter 291
401489a0 1: 53 usec in iter 295
401489a0 1: 15 usec in iter 306
401489a0 1: 16 usec in iter 318
401489a0 1: 34 usec in iter 328
401489a0 1: 37 usec in iter 339
401489a0 1: 28

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Martijn van Oosterhout

On Wed, Jun 07, 2006 at 03:32:35PM -0400, Tom Lane wrote:
 On reflection it's easy to imagine other cases where some iterations
 take much longer than others in a not-very-predictable way.  For
 instance, a join where only a subset of the outer tuples have a match
 is going to act that way.  I don't think there's any good way that we
 can be sure we have a representative sample of executions, and so I'm
 afraid this approach to sampling EXPLAIN ANALYZE is a failure.

I don't think we ever assumed it would never be a problem. We just
assumed that the sampling would cancel the effect out to give a decent
average.

Thing is, I never expected to get a three order magnitude difference
between samples. That's just far too much to be corrected in any way.
The random sampling should counter most effects, and I didn't consider
the one tuple in a million that takes much longer to be a particularly
common case.

Anyway, as a test, if you take the approach that the measurement at
item X only applies to the tuples immediately preceding it, for the
data you posted you get a result of 0.681148 seconds. How long did that
query run that produced that data?

(The bit of perl I used is:

cat data | perl -lne 'BEGIN { $last=0; $sum =0 } 
/: (\d+) usec in iter (\d+)/  do { $sum += ($2-$last)*$1; $last=$2 }; 
END { print $sum\n }' 

 I propose we revert this patch and think about an interrupt-driven
 sampling method instead.

That's another possibility ofcourse...
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 Anyway, as a test, if you take the approach that the measurement at
 item X only applies to the tuples immediately preceding it, for the
 data you posted you get a result of 0.681148 seconds. How long did that
 query run that produced that data?

I didn't save the corresponding printout unfortunately, but it was
probably pretty similar to this:

regression=# explain analyze select count(*) from (select * from tenk1 a join 
tenk1 b on a.unique1 = b.unique2 offset 0) ss;
  QUERY PLAN
--
 Aggregate  (cost=2609.00..2609.01 rows=1 width=0) (actual 
time=869.395..869.399 rows=1 loops=1)
   -  Limit  (cost=825.00..2484.00 rows=1 width=488) (actual 
time=248.640..3368.313 rows=1 loops=1)
 -  Hash Join  (cost=825.00..2484.00 rows=1 width=488) (actual 
time=248.609..2983.528 rows=1 loops=1)
   Hash Cond: (a.unique1 = b.unique2)
   -  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 
width=244) (actual time=0.084..21.525 rows=1 loops=1)
   -  Hash  (cost=458.00..458.00 rows=1 width=244) (actual 
time=248.269..248.269 rows=1 loops=1)
 -  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=1 
width=244) (actual time=0.025..22.760 rows=1 loops=1)
 Total runtime: 877.265 ms
(8 rows)

Time: 888.469 ms
regression=#

The above idea won't fix it anyway, only move the failure cases around.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Jim C. Nasby

On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote:
 timer interrupt routine does this once every few milliseconds:

One issue is that on very fast queries, you'd get absolutely no data
this way. A possible solution would be to start with an extremely short
timer interval, and ramp it up if the query runs longer. The downside is
that you'd need to somehow re-scale iterations every time the timer
interval changed. Of course another option is to start off using the
gettimeofday() method and switch to sampling after X seconds, but it'd
be nice if both code paths weren't needed.

 The bubble-up of sample counts to parent nodes could perhaps be done
 while printing the results instead of on-the-fly as sketched above, but
 the above seems simpler.

It'd be nice if there was an option to not aggregate child runtimes to
their parents at all, since it'd make spotting hot spots much easier.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Simon Riggs

On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
 Ah-hah, I've sussed it. 

 so I'm
 afraid this approach to sampling EXPLAIN ANALYZE is a failure.

Hmmm, I thought we already got that bit...sorry to sound negative.

 I propose we revert this patch and think about an interrupt-driven
 sampling method instead.

I don't have much more faith in crazy scheme No.2 either. (Mine or
yours...)

Can we just have an option to avoid the timing altogether, please? I
don't want to have long discussions about instrumentation, I just want a
reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
that we never, ever have to doubt whether the sampling works correctly
on a Miasmic-367 with HyperKooling. You could lose a month on -perform
going into the details of this for everybody - this was supposed to be a
simple additional feature.

If you're set on the sampling, great, but can we have the option to
avoid it completely also?

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
 I propose we revert this patch and think about an interrupt-driven
 sampling method instead.

 I don't have much more faith in crazy scheme No.2 either. (Mine or
 yours...)

 Can we just have an option to avoid the timing altogether, please? I
 don't want to have long discussions about instrumentation, I just want a
 reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
 that we never, ever have to doubt whether the sampling works correctly
 on a Miasmic-367 with HyperKooling.

Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine.
People have been hyperventilating about the timing overhead but I think
that it's perfectly acceptable as-is.  Certainly the removal of timing
is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
acceptable one; what it *is* likely to do is let you be misled about
which part of the query is the problem.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 On Wed, Jun 07, 2006 at 11:34:30AM -0400, Tom Lane wrote:
 timer interrupt routine does this once every few milliseconds:

 One issue is that on very fast queries, you'd get absolutely no data
 this way.

Yeah.  Of course, on very fast queries you don't get very good data
from the existing method either --- with a machine fast enough to have
sub-microsecond plan node execution times, the accuracy of gettimeofday
really isn't good enough.

The other thing that was bothering me was whether disk I/O might be
undercounted by an interrupt-driven method.  I kinda doubt that any
kernel will save up N interrupts that occur while the process is blocked
on a slow read() ... you'll probably get only one.  So the whole idea
may be unworkable.

At the moment I think we need to revert the recent patch and go back
to the drawing board.  Improving on the original implementation of 
EXPLAIN ANALYZE is clearly not as easy as it looks.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Martijn van Oosterhout

On Wed, Jun 07, 2006 at 04:04:33PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Anyway, as a test, if you take the approach that the measurement at
  item X only applies to the tuples immediately preceding it, for the
  data you posted you get a result of 0.681148 seconds. How long did that
  query run that produced that data?

snip

 The above idea won't fix it anyway, only move the failure cases around.

Well, if we're trying to make a system that never ever produces strange
looking results, then yes, we have to drop it. No matter how you
sample, there's going to be a failure mode somewhere.

If you are prepared to live with a certain margin (it will be within X%
of the real value Y% of the time) then I think that's an acheivable
goal (I'm not saying that's necessarily what we have now). You could
always give people the choice of disabling sampling if it looks wierd,
but that just gives people more knobs to twiddle and get upset about.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Simon Riggs

On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Wed, 2006-06-07 at 15:32 -0400, Tom Lane wrote:
  I propose we revert this patch and think about an interrupt-driven
  sampling method instead.
 
  I don't have much more faith in crazy scheme No.2 either. (Mine or
  yours...)
 
  Can we just have an option to avoid the timing altogether, please? I
  don't want to have long discussions about instrumentation, I just want a
  reasonably useful EXPLAIN ANALYZE in a reasonable amount of time - one
  that we never, ever have to doubt whether the sampling works correctly
  on a Miasmic-367 with HyperKooling.
 
 Frankly, I think the pre-existing version of EXPLAIN ANALYZE is fine.

I respect everybody's opinion, yours doubly so, as you know. But with
this current opinion, on this occasion, I believe you are in the
minority of those who have spent some time asking for EXPLAIN ANALYSEs
and have not received them because the run time is unacceptable.

 People have been hyperventilating about the timing overhead but I think
 that it's perfectly acceptable as-is.  

The phrase hyperventilating is subjective. From my side, I could use the
phrase explaining-the-obvious and potentially cause rankle also. We
should be able to discuss things without that. 

 Certainly the removal of timing
 is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
 acceptable one; 

I disagree, as have others.

 what it *is* likely to do is let you be misled about
 which part of the query is the problem.

A full EXPLAIN ANALYZE is always desirable - we agree on that. The
question is what we do when one is not available. Guessing leaves you
much more open to being misled. I'd like an option that we can take when
a full EXPLAIN ANALYZE is not possible, one that doesn't give different
results on different CPUs/OS.

We need this to help people who are in difficulty. This isn't a vanity
feature or a they-do-it-so-we-should-also. It's for support.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Wed, 2006-06-07 at 16:56 -0400, Tom Lane wrote:
 Certainly the removal of timing
 is not going to convert an intolerable EXPLAIN ANALYZE runtime into an
 acceptable one; 

 I disagree, as have others.

The overhead seems to be on the order of a couple tens of percent usually.
I don't see how that makes the difference between an EXPLAIN ANALYZE you
can run and one you can't.

 A full EXPLAIN ANALYZE is always desirable - we agree on that. The
 question is what we do when one is not available.

The least bad alternative I've heard is to let EXPLAIN ANALYZE print
out stats-so-far if the query is canceled by control-C or statement
timeout.  The objection to this is you may mistake startup transients
for full query behavior ... but at least the numbers will be good as
far as they go.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-07 Thread Jim C. Nasby

On Wed, Jun 07, 2006 at 05:05:23PM -0400, Tom Lane wrote:
 The other thing that was bothering me was whether disk I/O might be
 undercounted by an interrupt-driven method.  I kinda doubt that any
 kernel will save up N interrupts that occur while the process is blocked
 on a slow read() ... you'll probably get only one.  So the whole idea
 may be unworkable.

True, but if you get to the point where you're waiting on I/O, I would
think you could afford a gettimeofday() call. Even if the block we need
is in the OS cache, it's possible that the overhead of getting it from
there is high enough that the gettimeofday() call won't matter. FWIW, it
looks like running a much larger setting for shared_buffers (like, 50%
of memory) is a good way to boost performance, and that configuration
would make it less likely that you'd do a gettimeofday just to pull data
out of the OS cache.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Martijn van Oosterhout

On Mon, Jun 05, 2006 at 11:02:33PM -0400, Tom Lane wrote:
 Just got this rather surprising result:

snip bogus explain output

 The Total runtime is correct AFAICT, which puts the top node's actual
 time rather far out in left field.  This is pretty repeatable on my old
 slow HPPA machine.  A new Xeon shows less of a discrepancy, but it's
 still claiming top node actual  total, which is not right.

Wierd. Can you get the output of *instr in each call of
InstrEndLoop? Preferably after it does the calculation but before it
clears the values. So we get an idea of number of samples and what it
guesses. SampleOverhead would be good too.

I know my version produced sensible results on my machine and the
handful of people testing, so I'll try it again with your changes, see
how it looks...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 Wierd. Can you get the output of *instr in each call of
 InstrEndLoop? Preferably after it does the calculation but before it
 clears the values. So we get an idea of number of samples and what it
 guesses. SampleOverhead would be good too.

The problem looks to be an underestimation of SampleOverhead, and on
reflection it's clear why: what CalculateSampleOverhead is measuring
isn't the total overhead, but the time between the two gettimeofday
calls.  Which is probably about half the true overhead.  What we
ought to do is iterate InstStartNode/InstrStopNode N times, and
*separately* measure the total elapsed time spent.

It occurs to me that what we really want to know is not so much the
total time spent in InstStartNode/InstrStopNode, as the difference in
the time spent when sampling is on vs when it is off.  I'm not quite
sure if the time spent when it's off is negligible.  Off to do some
measuring ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Tom Lane

I wrote:
 The problem looks to be an underestimation of SampleOverhead, and on
 reflection it's clear why: what CalculateSampleOverhead is measuring
 isn't the total overhead, but the time between the two gettimeofday
 calls.  Which is probably about half the true overhead.

On further thought, I take that back: the true overhead is not the
point here.  The time elapsed during a plan node execution when
sampling can be broken down into three phases:
time before first gettimeofday call
time between gettimeofday calls
time after second gettimeofday call
Only the second part is actually measured by the instrumentation system;
the other parts are overhead that has never been counted by EXPLAIN
ANALYZE, sampling version or no.  Moreover, most of the runtime of
InstrStartNode and InstrStopNode falls into the first or third parts.

What we would actually like to set SampleOverhead to is the portion
of the second-part runtime that doesn't occur when sampling = false.
Assuming that gettimeofday() has consistent runtime and the actual
time reported is measured at a consistent instant within that runtime,
I believe that we should take the SampleOverhead as just equal to
the runtime of a single gettimeofday() call.  The added or removed
second-part time within InstrStartNode is just the tail time of
gettimeofday, and the added or removed second-part time within
InstrStopNode is basically just the head time of gettimeofday.  (To make
this as true as possible, we need to change the order of operations so
that gettimeofday is invoked *immediately* after the if (sampling)
test, but that's easy.)

So this line of thought leads to the conclusion that
CalculateSampleOverhead is actually overestimating SampleOverhead
a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT().

But that still leaves me with a problem because my machine is clearly
overestimating the correction needed.  I added some printouts and got

raw totaltime = 0.370937
per_iter = 0.000156913, SampleOverhead = 3.28e-06
adj totaltime = 1.82976
sampling = 0
starttime = 0/00
counter = 0/370937
firsttuple = 0.258321
tuplecount = 1
itercount = 10001
samplecount = 704
nextsample = 10011
startup = 0.258321
total = 1.82976
ntuples = 1
nloops = 1

on a run with an actual elapsed time near 750 msec.  Clearly the
sampling adjustment is wrong, but why?

I have a theory about this, and it's not pleasant at all.  What I
think is that we have a Heisenberg problem here: the act of invoking
gettimeofday() actually changes what is measured.  That is, the
runtime of the second part of ExecProcNode is actually longer when
we sample than when we don't, not merely due to the extra time spent
in gettimeofday().  It's not very hard to guess at reasons why, either.
The kernel entry is probably flushing some part of the CPU's state,
such as virtual/physical address mapping for the userland address
space.  After returning from the kernel call, the time to reload
that state shows up as more execution time within the second part.

This theory explains two observations that otherwise are hard to
explain.  One, that the effect is platform-specific: your machine
may avoid flushing as much state during a kernel call as mine does.
And two, that upper plan nodes seem much more affected than lower
ones.  That makes sense because the execution cycle of an upper node
will involve touching more userspace data than a lower node, and
therefore more of the flushed TLB entries will need to be reloaded.

If this theory is correct, then the entire notion of EXPLAIN ANALYZE
sampling has just crashed and burned.  We can't ship a measurement
tool that is accurate on some platforms and not others.

I'm wondering if it's at all practical to go over to a gprof-like
measurement method for taking EXPLAIN ANALYZE runtime measurements;
that is, take an interrupt every so often and bump the count for the
currently active plan node.  This would spread the TLB-flush overhead
more evenly and thus avoid introducing that bias.  There may be too much
platform dependency in this idea, though, and I also wonder if it'd
break the ability to do normal gprof profiling of the backend.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Jim C. Nasby

On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:
 I have a theory about this, and it's not pleasant at all.  What I
 think is that we have a Heisenberg problem here: the act of invoking
 gettimeofday() actually changes what is measured.  That is, the
 runtime of the second part of ExecProcNode is actually longer when
 we sample than when we don't, not merely due to the extra time spent
 in gettimeofday().  It's not very hard to guess at reasons why, either.
 The kernel entry is probably flushing some part of the CPU's state,
 such as virtual/physical address mapping for the userland address
 space.  After returning from the kernel call, the time to reload
 that state shows up as more execution time within the second part.
 
 This theory explains two observations that otherwise are hard to
 explain.  One, that the effect is platform-specific: your machine
 may avoid flushing as much state during a kernel call as mine does.
 And two, that upper plan nodes seem much more affected than lower
 ones.  That makes sense because the execution cycle of an upper node
 will involve touching more userspace data than a lower node, and
 therefore more of the flushed TLB entries will need to be reloaded.

If that's the case, then maybe a more sopdisticated method of measuring
the overhead would work. My thought is that on the second call to pull a
tuple from a node (second because the first probably has some anomolies
due to startup), we measure the overhead for that node. This would
probably mean doing the following:
get start time # I'm not refering to this as gettimeofday to avoid
   # confusion
gettimeofday() # this is the gettimeofday call that will happen during
   # normal operation
get end time

Hopefully, there's no caching effect that would come into play from not
actually touching any of the data structures after the gettimeofday()
call. If that's not the case, it makes measuring the overhead more
complex, but I think it should still be doable...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 If that's the case, then maybe a more sopdisticated method of measuring
 the overhead would work.

I think you missed the point: the time spent in gettimeofday() itself
is not the major overhead of EXPLAIN ANALYZE.  At least it appears that
this is the case on my machine.

I'm thinking that interrupt-driven sampling might work OK though.
My previous worries were based on trying to use the ITIMER_PROF
timer, which might not be portable and would conflict with gprof
anyway.  But EXPLAIN ANALYZE has always been interested in real time
rather than CPU time, so the correct interrupt to use is ITIMER_REAL.
That means we only have to share with our own existing usages of that
interrupt, which turns it from a portability issue into just a Small
Matter Of Programming.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Martijn van Oosterhout

On Tue, Jun 06, 2006 at 04:50:28PM -0400, Tom Lane wrote:
 But that still leaves me with a problem because my machine is clearly
 overestimating the correction needed.  I added some printouts and got
 
 raw totaltime = 0.370937
 per_iter = 0.000156913, SampleOverhead = 3.28e-06
 adj totaltime = 1.82976
 sampling = 0
 starttime = 0/00
 counter = 0/370937
 firsttuple = 0.258321
 tuplecount = 1
 itercount = 10001
 samplecount = 704
 nextsample = 10011
 startup = 0.258321
 total = 1.82976
 ntuples = 1
 nloops = 1
 
 on a run with an actual elapsed time near 750 msec.  Clearly the
 sampling adjustment is wrong, but why?

This doesn't make any sense at all. How can a sampling run that only
sampled 7% of the actual tuples, end up with a actual measured time
that's more than 50% of the actual final runtime?

Can you get an estimate of the actual overhead (ie compare wall clock
time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
on the order of a few hundred cycles, I'm not sure about the cost of
TLB flushes (that's probably highly architechture dependant).

To be honest, I wonder about caching effects, but for the disks. The
first few cycles of any plan (like an index scan) is going to incur
costs that won't happen later on. Because we sample much more heavily
at the beginning rather than the end, this will bias towards higher
numbers. You should be able to see this by seeing if running queries
that don't require disk access fare better.

That would suggest a much more careful correction method that works for
non-linear timing patterns...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 This doesn't make any sense at all. How can a sampling run that only
 sampled 7% of the actual tuples, end up with a actual measured time
 that's more than 50% of the actual final runtime?

AFAICS, the only conclusion is that the sampled executions are in fact
taking longer than supposedly-equivalent unsampled ones, and by a pretty
good percentage too.  I'm growing unsure again about the mechanism
responsible for that, however.

 Can you get an estimate of the actual overhead (ie compare wall clock
 time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
 on the order of a few hundred cycles, I'm not sure about the cost of
 TLB flushes (that's probably highly architechture dependant).

Here's some examples.  Keep in mind I've already determined that
gettimeofday() takes about 3 usec on this hardware ...

regression=# explain analyze select count(*) from (select * from tenk1 a join 
tenk1 b on a.unique1 = b.unique2 offset 0) ss;
 QUERY PLAN 
 
-
 Aggregate  (cost=2609.00..2609.01 rows=1 width=0) (actual 
time=797.412..797.416 rows=1 loops=1)
   -  Limit  (cost=825.00..2484.00 rows=1 width=488) (actual 
time=208.208..2576.528 rows=1 loops=1)
 -  Hash Join  (cost=825.00..2484.00 rows=1 width=488) (actual 
time=208.190..2082.577 rows=1 loops=1)
   Hash Cond: (a.unique1 = b.unique2)
   -  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 
width=244) (actual time=0.082..3.718 rows=1 loops=1)
   -  Hash  (cost=458.00..458.00 rows=1 width=244) (actual 
time=207.933..207.933 rows=1 loops=1)
 -  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=1 
width=244) (actual time=0.017..3.583 rows=1 loops=1)
 Total runtime: 805.036 ms
(8 rows)

Time: 816.463 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on 
a.unique1 = b.unique2 offset 0) ss;
 count 
---
 1
(1 row)

Time: 816.970 m

The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite
a bit, probably because of the random variation we're using in sampling
interval.  This particular combination was unusually close.  But in any
case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small
here; the problem is that we're incorrectly extrapolating the measured
runtime to the unmeasured executions.

What's especially interesting is that the excess time doesn't seem to
show up if I form the query in a way that doesn't require pushing as
much data around:

regression=# explain analyze select count(*) from (select * from tenk1 a join 
tenk1 b on a.unique1 = b.unique2) ss;
 QUERY PLAN 
 
-
 Aggregate  (cost=1341.00..1341.01 rows=1 width=0) (actual 
time=212.313..212.317 rows=1 loops=1)
   -  Hash Join  (cost=483.00..1316.00 rows=1 width=0) (actual 
time=88.061..160.886 rows=1 loops=1)
 Hash Cond: (a.unique1 = b.unique2)
 -  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 width=4) 
(actual time=0.071..4.068 rows=1 loops=1)
 -  Hash  (cost=458.00..458.00 rows=1 width=4) (actual 
time=87.862..87.862 rows=1 loops=1)
   -  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=1 width=4) 
(actual time=0.031..4.780 rows=1 loops=1)
 Total runtime: 221.022 ms
(7 rows)

Time: 229.377 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on 
a.unique1 = b.unique2) ss;
 count 
---
 1
(1 row)

Time: 202.531 ms
regression=# 

(Without the OFFSET 0, the planner flattens the subquery and discovers
that it doesn't actually need to fetch any of the non-join-key table
columns.)  Note the only plan nodes showing whacked-out timings are the
ones returning wide tuples (large width values).  I'm not entirely
sure what to make of this.  It could be interpreted as evidence for my
theory about TLB reloads during userspace data access being the problem.
But I'm getting a bit disenchanted with that theory after running the
same test case in 8.1:

regression=# explain analyze select count(*) from (select * from tenk1 a join 
tenk1 b on a.unique1 = b.unique2 offset 0) ss;
  QUERY PLAN
  
--
 Aggregate  (cost=2609.00..2609.01 rows=1 width=0) (actual 
time=1033.866..1033.870 rows=1

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

 And two, that upper plan nodes seem much more affected than lower
 ones.  That makes sense because the execution cycle of an upper node
 will involve touching more userspace data than a lower node, and
 therefore more of the flushed TLB entries will need to be reloaded.

I would have expected the opposite effect. If you only execute one instruction
then the cache miss can make it take many times longer than normal. But as the
number of instructions grows the cache gets repopulated and the overhead
levels off and becomes negligible relative to the total time.


The other option aside from gprof-like profiling would be to investigate those
cpu timing instructions again. I know some of them are unsafe on multi-cpu
systems but surely there's a solution out there. It's not like there aren't a
million games, music playing, and other kewl kid toys that depend on accurate
low overhead timing these days.


-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-06 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 And two, that upper plan nodes seem much more affected than lower
 ones.  That makes sense because the execution cycle of an upper node
 will involve touching more userspace data than a lower node, and
 therefore more of the flushed TLB entries will need to be reloaded.

 I would have expected the opposite effect. If you only execute one instruction
 then the cache miss can make it take many times longer than normal. But as the
 number of instructions grows the cache gets repopulated and the overhead
 levels off and becomes negligible relative to the total time.

Well, none of our plan nodes are in the one instruction regime ;-).
I was thinking that the total volume of data accessed was the critical
factor.  Right at the moment I'm disillusioned with the TLB-access
theory though.

Something I'm noticing right now is that it seems like only hash joins
are really seriously misestimated --- nest and merge joins have some
small issues but only the hash is way out there.  What's going on??
Can anyone else reproduce this?

 The other option aside from gprof-like profiling would be to
 investigate those cpu timing instructions again. I know some of them
 are unsafe on multi-cpu systems but surely there's a solution out
 there. It's not like there aren't a million games, music playing, and
 other kewl kid toys that depend on accurate low overhead timing these
 days.

Yeah, and they all work only on Windoze and Intel chips :-(

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

[HACKERS] That EXPLAIN ANALYZE patch still needs work

2006-06-05 Thread Tom Lane

Just got this rather surprising result:

regression=# \timing
Timing is on.
regression=# explain analyze select * from tenk1 a join tenk1 b on a.unique1 = 
b.unique2;
QUERY PLAN  
  
--
 Hash Join  (cost=825.00..2484.00 rows=1 width=488) (actual 
time=250.510..2557.129 rows=1 loops=1)
   Hash Cond: (a.unique1 = b.unique2)
   -  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.084..30.070 rows=1 loops=1)
   -  Hash  (cost=458.00..458.00 rows=1 width=244) (actual 
time=250.242..250.242 rows=1 loops=1)
 -  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=1 width=244) 
(actual time=0.019..23.317 rows=1 loops=1)
 Total runtime: 847.855 ms
(6 rows)

Time: 856.179 ms
regression=# 

The Total runtime is correct AFAICT, which puts the top node's actual
time rather far out in left field.  This is pretty repeatable on my old
slow HPPA machine.  A new Xeon shows less of a discrepancy, but it's
still claiming top node actual  total, which is not right.

I realize that the sampling code can't be expected to be exactly right,
but a factor-of-three error is not acceptable.  One of us broke this :-(

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

78 matches

Mail list logo