Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Greg Smith

Michael Nolan wrote:
Based on last year's discussion of this TODO item, it seems thoughts 
have been focused on estimating how much data is
being satisfied from PG's shared buffers.  However, I think that's 
only part of the problem.  


Sure, but neither it nor what you're talking about are the real gating 
factor on making an improvement here.  Figuring out how to expose all 
this information to the optimizer so it can use it when planning is the 
hard part.  Collecting a read time profile is just one of the many ways 
you can estimate what's in cache, and each of the possible methods has 
good and bad trade-offs.  I've been suggesting that people assume that's 
a solved problem--I'm pretty sure what you're proposing was done by Greg 
Stark once and a prototype built even--and instead ask what you're going 
to do next if you had this data.


This data would probably need to be kept separately for each table or 
index, as some tables or indexes
may be mostly or fully in cache or on faster physical media than 
others, although in the absence of other
data about a specific table or index, data about other relations in 
the same tablespace might be of some use. 


This is the important part.  Model how the data needs to get stored such 
that the optimizer can make decisions using it, and I consider it easy 
to figure out how it will get populated later.  There are actually 
multiple ways to do it, and it may end up being something people plug-in 
an implementation that fits their workload into, rather than just having 
one available.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Robert Haas
On Wed, May 25, 2011 at 10:17 AM, Greg Smith g...@2ndquadrant.com wrote:
 This data would probably need to be kept separately for each table or
 index, as some tables or indexes
 may be mostly or fully in cache or on faster physical media than others,
 although in the absence of other
 data about a specific table or index, data about other relations in the
 same tablespace might be of some use.

 This is the important part.  Model how the data needs to get stored such
 that the optimizer can make decisions using it, and I consider it easy to
 figure out how it will get populated later.

I basically agree.  There have been several recent discussions of this
topic on both -hackers and -performance; it is likely that the TODO
needs to be updated with some more recent links.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Michael Nolan
On Wed, May 25, 2011 at 11:18 AM, Robert Haas robertmh...@gmail.com wrote:


 I basically agree.  There have been several recent discussions of this
 topic on both -hackers and -performance; it is likely that the TODO
 needs to be updated with some more recent links.


Anything to help the NKOTB to get up to speed would be appreciated, though I
still think it is not just a 'caching' issue.

The question I hesitated to ask in Ottawa was:  So, what information would
you like and what would you do with it?
--
Mike Nolan


Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Robert Haas
On Wed, May 25, 2011 at 8:37 PM, Michael Nolan htf...@gmail.com wrote:
 On Wed, May 25, 2011 at 11:18 AM, Robert Haas robertmh...@gmail.com wrote:

 I basically agree.  There have been several recent discussions of this
 topic on both -hackers and -performance; it is likely that the TODO
 needs to be updated with some more recent links.

 Anything to help the NKOTB to get up to speed would be appreciated, though I
 still think it is not just a 'caching' issue.

 The question I hesitated to ask in Ottawa was:  So, what information would
 you like and what would you do with it?

*scratches head*

I'm not sure I understand your question.  Can you elaborate?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-24 Thread Michael Nolan
In the TODO list is this item:

*Modify the planner to better estimate caching effects
*
Tom mentioned this in his presentation at PGCON, and I also chatted with Tom
about it briefly afterwards.

Based on last year's discussion of this TODO item, it seems thoughts have
been focused on estimating how much data is
being satisfied from PG's shared buffers.  However, I think that's only part
of the problem.

Specifically, read performance is going to be affected by:

1.  Reads fulfilled from shared buffers.
2.  Reads fulfilled from system cache.
3.  Reads fulfilled from disk controller cache.
4.  Reads from physical media.

#4 is further complicated by the type of physical media for that specific
block.  For example, reads that can
be fulfilled from a SSD are going to be much faster than ones that access
hard drives (or even slower types of media.)

System load is going to impact all of these as well.

Therefore, I suggest that an alternative to the above TODO may be to gather
performance data without knowing
(or more importantly without needing to know) which of the above sources
fulfilled the read.

This data would probably need to be kept separately for each table or index,
as some tables or indexes
may be mostly or fully in cache or on faster physical media than others,
although in the absence of other
data about a specific table or index, data about other relations in the same
tablespace might be of some use.

Tom mentioned that the cost of doing multiple system time-of-day calls for
each block read might be
prohibitive, it may also be that the data may also be too coarse on some
systems to be truly useful
(eg, the epoch time in seconds.)

If this data were available, that could mean that successive plans for the
same query could have
significantly different plans (and thus actual performance), based on what
has happened recently,
so these statistics would have to be relatively short term and updated
frequently, but without becoming
computational bottlenecks.

The problem is one I'm interested in working on.
--
Mike Nolan