Re: [PERFORM] [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-11-01 Thread Simon Riggs
On Wed, 2004-10-27 at 01:39, Josh Berkus wrote:
 Thomas,
 
  As a result, I was intending to inflate the value of
  effective_cache_size to closer to the amount of unused RAM on some of
  the machines I admin (once I've verified that they all have a unified
  buffer cache). Is that correct?
 
 Currently, yes.  

I now believe the answer to that is no, that is not fully correct,
following investigation into how to set that parameter correctly.

 Right now, e_c_s is used just to inform the planner and make 
 index vs. table scan and join order decisions.

Yes, I agree that is what e_c_s is used for.

...lets go deeper:

effective_cache_size is used to calculate the number of I/Os required to
index scan a table, which varies according to the size of the available
cache (whether this be OS cache or shared_buffers). The reason to do
this is because whether a table is in cache can make a very great
difference to access times; *small* tables tend to be the ones that vary
most significantly. PostgreSQL currently uses the Mackert and Lohman
[1989] equation to assess how much of a table is in cache in a blocked
DBMS with a finite cache. 

The Mackert and Lohman equation is accurate, as long as the parameter b
is reasonably accurately set. [I'm discussing only the current behaviour
here, not what it can or should or could be] If it is incorrectly set,
then the equation will give the wrong answer for small tables. The same
answer (i.e. same asymptotic behaviour) is returned for very large
tables, but they are the ones we didn't worry about anyway. Getting the
equation wrong means you will choose sub-optimal plans, potentially
reducing your performance considerably.

As I read it, effective_cache_size is equivalent to the parameter b,
defined as (p.3) minimum buffer size dedicated to a given scan. ML
they point out (p.3) We...do not consider interactions of
multiple users sharing the buffer for multiple file accesses. 

Either way, ML aren't talking about the total size of the cache,
which we would interpret to mean shared_buffers + OS cache, in our
effort to not forget the beneficial effect of the OS cache. They use the
phrase dedicated to a given scan

AFAICS effective_cache_size should be set to a value that reflects how
many other users of the cache there might be. If you know for certain
you're the only user, set it according to the existing advice. If you
know you aren't, then set it an appropriate factor lower. Setting that
accurately on a system wide basis may clearly be difficult and setting
it high will often be inappropriate.

The manual is not clear as to how to set effective_cache_size. Other
advice misses out the effect of the many scans/many tables issue and
will give the wrong answer for many calculations, and thus produce
incorrect plans for 8.0 (and earlier releases also).

This is something that needs to be documented rather than a bug fix.
It's a complex one, so I'll await all of your objections before I write
a new doc patch.

[Anyway, I do hope I've missed something somewhere in all that, though
I've read their paper twice now. Fairly accessible, but requires
interpretation to the PostgreSQL case. Mackert and Lohman [1989] Index
Scans using a finite LRU buffer: A validated I/O model]

 The problem which Simon is bringing up is part of a discussion about doing 
 *more* with the information supplied by e_c_s.He points out that it's not 
 really related to the *real* probability of any particular table being 
 cached.   At least, if I'm reading him right.

Yes, that was how Jan originally meant to discuss it, but not what I meant.

Best regards,

Simon Riggs



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-30 Thread Jan Wieck
On 10/26/2004 1:53 AM, Tom Lane wrote:
Greg Stark [EMAIL PROTECTED] writes:
Tom Lane [EMAIL PROTECTED] writes:
Another issue is what we do with the effective_cache_size value once we
have a number we trust.  We can't readily change the size of the ARC
lists on the fly.

Huh? I thought effective_cache_size was just used as an factor the cost
estimation equation.
Today, that is true.  Jan is speculating about using it as a parameter
of the ARC cache management algorithm ... and that worries me.
If we need another config option, it's not that we are running out of 
possible names, is it?

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-30 Thread Tom Lane
Jan Wieck [EMAIL PROTECTED] writes:
 On 10/26/2004 1:53 AM, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Another issue is what we do with the effective_cache_size value once we
 have a number we trust.  We can't readily change the size of the ARC
 lists on the fly.
 
 Huh? I thought effective_cache_size was just used as an factor the cost
 estimation equation.
 
 Today, that is true.  Jan is speculating about using it as a parameter
 of the ARC cache management algorithm ... and that worries me.

 If we need another config option, it's not that we are running out of 
 possible names, is it?

No, the point is that the value is not very trustworthy at the moment.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-27 Thread Kenneth Marshall
On Mon, Oct 25, 2004 at 05:53:25PM -0400, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  So I would suggest using something like 100us as the threshold for
  determining whether a buffer fetch came from cache.
 
 I see no reason to hardwire such a number.  On any hardware, the
 distribution is going to be double-humped, and it will be pretty easy to
 determine a cutoff after minimal accumulation of data.  The real question
 is whether we can afford a pair of gettimeofday() calls per read().
 This isn't a big issue if the read actually results in I/O, but if it
 doesn't, the percentage overhead could be significant.
 
How invasive would reading the CPU counter be, if it is available?
A read operation should avoid flushing a cache line and we can throw
out the obvious outliers since we only need an estimate and not the
actual value.

--Ken


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-27 Thread Kevin Brown
Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  So I would suggest using something like 100us as the threshold for
  determining whether a buffer fetch came from cache.
 
 I see no reason to hardwire such a number.  On any hardware, the
 distribution is going to be double-humped, and it will be pretty easy to
 determine a cutoff after minimal accumulation of data.  The real question
 is whether we can afford a pair of gettimeofday() calls per read().
 This isn't a big issue if the read actually results in I/O, but if it
 doesn't, the percentage overhead could be significant.
 
 If we assume that the effective_cache_size value isn't changing very
 fast, maybe it would be good enough to instrument only every N'th read
 (I'm imagining N on the order of 100) for this purpose.  Or maybe we
 need only instrument reads that are of blocks that are close to where
 the ARC algorithm thinks the cache edge is.

If it's decided to instrument reads, then perhaps an even better use
of it would be to tune random_page_cost.  If the storage manager knows
the difference between a sequential scan and a random scan, then it
should easily be able to measure the actual performance it gets for
each and calculate random_page_cost based on the results.

While the ARC lists can't be tuned on the fly, random_page_cost can.

 One small problem is that the time measurement gives you only a lower
 bound on the time the read() actually took.  In a heavily loaded system
 you might not get the CPU back for long enough to fool you about whether
 the block came from cache or not.

True, but that's information that you'd want to factor into the
performance measurements anyway.  The database needs to know how much
wall clock time it takes for it to fetch a page under various
circumstances from disk via the OS.  For determining whether or not
the read() hit the disk instead of just OS cache, what would matter is
the average difference between the two.  That's admittedly a problem
if the difference is less than the noise, though, but at the same time
that would imply that given the circumstances it really doesn't matter
whether or not the page was fetched from disk: the difference is small
enough that you could consider them equivalent.


You don't need 100% accuracy for this stuff, just statistically
significant accuracy.


 Another issue is what we do with the effective_cache_size value once
 we have a number we trust.  We can't readily change the size of the
 ARC lists on the fly.

Compare it with the current value, and notify the DBA if the values
are significantly different?  Perhaps write the computed value to a
file so the DBA can look at it later?

Same with other values that are computed on the fly.  In fact, it
might make sense to store them in a table that gets periodically
updated, and load their values from that table, and then the values in
postgresql.conf or the command line would be the default that's used
if there's nothing in the table (and if you really want fine-grained
control of this process, you could stick a boolean column in the table
to indicate whether or not to load the value from the table at startup
time).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Curt Sampson
On Tue, 26 Oct 2004, Greg Stark wrote:

 I see mmap or O_DIRECT being the only viable long-term stable states. My
 natural inclination was the former but after the latest thread on the subject
 I suspect it'll be forever out of reach. That makes O_DIRECT And a Postgres
 managed cache the only real choice. Having both caches is just a waste of
 memory and a waste of cpu cycles.

I don't see why mmap is any more out of reach than O_DIRECT; it's not
all that much harder to implement, and mmap (and madvise!) is more
widely available.

But if using two caches is only costing us 1% in performance, there's
not really much point

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Simon Riggs
On Mon, 2004-10-25 at 16:34, Jan Wieck wrote: 
 The problem is, with a too small directory ARC cannot guesstimate what 
 might be in the kernel buffers. Nor can it guesstimate what recently was 
 in the kernel buffers and got pushed out from there. That results in a 
 way too small B1 list, and therefore we don't get B1 hits when in fact 
 the data was found in memory. B1 hits is what increases the T1target, 
 and since we are missing them with a too small directory size, our 
 implementation of ARC is propably using a T2 size larger than the 
 working set. That is not optimal.

I think I have seen that the T1 list shrinks too much, but need more
tests...with some good test results

The effectiveness of ARC relies upon the balance between the often
conflicting requirements of recency and frequency. It seems
possible, even likely, that pgsql's version of ARC may need some subtle
changes to rebalance it - if we are unlikely enough to find cases where
it genuinely is out of balance. Many performance tests are required,
together with a few ideas on extra parameters to includehence my
support of Jan's ideas.

That's also why I called the B1+B2 hit ratio turbulence because it
relates to how much oscillation is happening between T1 and T2. In
physical systems, we expect the oscillations to be damped, but there is
no guarantee that we have a nearly critically damped oscillator. (Note
that the absence of turbulence doesn't imply that T1+T2 is optimally
sized, just that is balanced).

[...and all though the discussion has wandered away from my original
patch...would anybody like to commit, or decline the patch?]

 If we would replace the dynamic T1 buffers with a max_backends*2 area of 
 shared buffers, use a C value representing the effective cache size and 
 limit the T1target on the lower bound to effective cache size - shared 
 buffers, then we basically moved the T1 cache into the OS buffers.

Limiting the minimum size of T1len to be 2* maxbackends sounds like an
easy way to prevent overbalancing of T2, but I would like to follow up
on ways to have T1 naturally stay larger. I'll do a patch with this idea
in, for testing. I'll call this T1 minimum size so we can discuss it.

Any other patches are welcome...

It could be that B1 is too small and so we could use a larger value of C
to keep track of more blocks. I think what is being suggested is two
GUCs: shared_buffers (as is), plus another one, larger, which would
allow us to track what is in shared_buffers and what is in OS cache. 

I have comments on effective cache size below

On Mon, 2004-10-25 at 17:03, Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  This all only holds water, if the OS is allowed to swap out shared 
  memory. And that was my initial question, how likely is it to find this 
  to be true these days?
 
 I think it's more likely that not that the OS will consider shared
 memory to be potentially swappable.  On some platforms there is a shmctl
 call you can make to lock your shmem in memory, but (a) we don't use it
 and (b) it may well require privileges we haven't got anyway.

Are you saying we shouldn't, or we don't yet? I simply assumed that we
did use that function - surely it must be at least an option? RHEL
supports this at least

It may well be that we don't have those privileges, in which case we
turn off the option. Often, we (or I?) will want to install a dedicated
server, so we should have all the permissions we need, in which case...

 This has always been one of the arguments against making shared_buffers
 really large, of course --- if the buffers aren't all heavily used, and
 the OS decides to swap them to disk, you are worse off than you would
 have been with a smaller shared_buffers setting.

Not really, just an argument against making them *too* large. Large
*and* utilised is OK, so we need ways of judging optimal sizing.

 However, I'm still really nervous about the idea of using
 effective_cache_size to control the ARC algorithm.  That number is
 usually entirely bogus.  Right now it is only a second-order influence
 on certain planner estimates, and I am afraid to rely on it any more
 heavily than that.

...ah yes, effective_cache_size.

The manual describes effective_cache_size as if it had something to do
with the OS, and some of this discussion has picked up on that.

effective_cache_size is used in only two places in the code (both in the
planner), as an estimate for calculating the cost of a) nonsequential
access and b) index access, mainly as a way of avoiding overestimates of
access costs for small tables.

There is absolutely no implication in the code that effective_cache_size
measures anything in the OS; what it gives is an estimate of the number
of blocks that will be available from *somewhere* in memory (i.e. in
shared_buffers OR OS cache) for one particular table (the one currently
being considered by the planner).

Crucially, the size referred to is the size of the *estimate*, not the
size 

Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Simon Riggs
On Tue, 2004-10-26 at 06:53, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  Another issue is what we do with the effective_cache_size value once we
  have a number we trust.  We can't readily change the size of the ARC
  lists on the fly.
 
  Huh? I thought effective_cache_size was just used as an factor the cost
  estimation equation.
 
 Today, that is true.  Jan is speculating about using it as a parameter
 of the ARC cache management algorithm ... and that worries me.
 

ISTM that we should be optimizing the use of shared_buffers, not whats
outside. Didn't you (Tom) already say that?

BTW, very good ideas on how to proceed, but why bother?

For me, if the sysadmin didn't give shared_buffers to PostgreSQL, its
because the memory is intended for use by something else and so not
available at all. At least not dependably. The argument against large
shared_buffers because of swapping applies to that assumption also...the
OS cache is too volatile to attempt to gauge sensibly.

There's an argument for improving performance for people that haven't
set their parameters correctly, but thats got to be a secondary
consideration anyhow.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Simon Riggs
On Tue, 2004-10-26 at 09:49, Simon Riggs wrote:
 On Mon, 2004-10-25 at 16:34, Jan Wieck wrote: 
  The problem is, with a too small directory ARC cannot guesstimate what 
  might be in the kernel buffers. Nor can it guesstimate what recently was 
  in the kernel buffers and got pushed out from there. That results in a 
  way too small B1 list, and therefore we don't get B1 hits when in fact 
  the data was found in memory. B1 hits is what increases the T1target, 
  and since we are missing them with a too small directory size, our 
  implementation of ARC is propably using a T2 size larger than the 
  working set. That is not optimal.
 
 I think I have seen that the T1 list shrinks too much, but need more
 tests...with some good test results
 
  If we would replace the dynamic T1 buffers with a max_backends*2 area of 
  shared buffers, use a C value representing the effective cache size and 
  limit the T1target on the lower bound to effective cache size - shared 
  buffers, then we basically moved the T1 cache into the OS buffers.
 
 Limiting the minimum size of T1len to be 2* maxbackends sounds like an
 easy way to prevent overbalancing of T2, but I would like to follow up
 on ways to have T1 naturally stay larger. I'll do a patch with this idea
 in, for testing. I'll call this T1 minimum size so we can discuss it.
 

Don't know whether you've seen this latest update on the ARC idea:
Sorav Bansal and Dharmendra S. Modha, 
CAR: Clock with Adaptive Replacement,
in Proceedings of the USENIX Conference on File and Storage Technologies
(FAST), pages 187--200, March 2004.
[I picked up the .pdf here http://citeseer.ist.psu.edu/bansal04car.html]

In that paper Bansal and Modha introduce an update to ARC called CART
which they say is more appropriate for databases. Their idea is to
introduce a temporal locality window as a way of making sure that
blocks called twice within a short period don't fall out of T1, though
don't make it into T2 either. Strangely enough the temporal locality
window is made by increasing the size of T1... in an adpative way, of
course.

If we were going to put a limit on the minimum size of T1, then this
would put a minimal temporal locality window in placerather than
the increased complexity they go to in order to make T1 larger. I note
test results from both the ARC and CAR papers that show that T2 usually
represents most of C, so the observations that T1 is very small is not
atypical. That implies that the cost of managing the temporal locality
window in CART is usually wasted, even though it does cut in as an
overall benefit: The results show that CART is better than ARC over the
whole range of cache sizes tested (16MB to 4GB) and workloads (apart
from 1 out 22).

If we were to implement a minimum size of T1, related as suggested to
number of users, then this would provide a reasonable approximation of
the temporal locality window. This wouldn't prevent the adaptation of T1
to be higher than this when required.

Jan has already optimised ARC for PostgreSQL by the addition of a
special lookup on transactionId required to optimise for the double
cache lookup of select/update that occurs on a T1 hit. That seems likely
to be able to be removed as a result of having a larger T1.

I'd suggest limiting T1 to be a value of:
shared_buffers = 1000  T1limit = max_backends *0.75
shared_buffers = 2000  T1limit = max_backends
shared_buffers = 5000  T1limit = max_backends *1.5
shared_buffers  5000   T1limit = max_backends *2

I'll try some tests with both
- minimum size of T1
- update optimisation removed

Thoughts?

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Greg Stark

Curt Sampson [EMAIL PROTECTED] writes:

 On Tue, 26 Oct 2004, Greg Stark wrote:
 
  I see mmap or O_DIRECT being the only viable long-term stable states. My
  natural inclination was the former but after the latest thread on the subject
  I suspect it'll be forever out of reach. That makes O_DIRECT And a Postgres
  managed cache the only real choice. Having both caches is just a waste of
  memory and a waste of cpu cycles.
 
 I don't see why mmap is any more out of reach than O_DIRECT; it's not
 all that much harder to implement, and mmap (and madvise!) is more
 widely available.

Because there's no way to prevent a write-out from occurring and no way to be
notified by mmap before a write-out occurs, and Postgres wants to do its WAL
logging at that time if it hasn't already happened.

 But if using two caches is only costing us 1% in performance, there's
 not really much point

Well firstly it depends on the work profile. It can probably get much higher
than we saw in that profile if your work load is causing more fresh buffers to
be fetched.

Secondly it also reduces the amount of cache available. If you have 256M of
ram with about 200M free, and 40Mb of ram set aside for Postgres's buffer
cache then you really only get 160Mb. It's costing you 20% of your cache, and
reducing the cache hit rate accordingly.

Thirdly the kernel doesn't know as much as Postgres about the load. Postgres
could optimize its use of cache based on whether it knows the data is being
loaded by a vacuum or sequential scan rather than an index lookup. In practice
Postgres has gone with ARC which I suppose a kernel could implement anyways,
but afaik neither linux nor BSD choose to do anything like it.

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Josh Berkus
Thomas,

 As a result, I was intending to inflate the value of
 effective_cache_size to closer to the amount of unused RAM on some of
 the machines I admin (once I've verified that they all have a unified
 buffer cache). Is that correct?

Currently, yes.  Right now, e_c_s is used just to inform the planner and make 
index vs. table scan and join order decisions.

The problem which Simon is bringing up is part of a discussion about doing 
*more* with the information supplied by e_c_s.He points out that it's not 
really related to the *real* probability of any particular table being 
cached.   At least, if I'm reading him right.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Curt Sampson
On Wed, 26 Oct 2004, Greg Stark wrote:

  I don't see why mmap is any more out of reach than O_DIRECT; it's not
  all that much harder to implement, and mmap (and madvise!) is more
  widely available.

 Because there's no way to prevent a write-out from occurring and no way to be
 notified by mmap before a write-out occurs, and Postgres wants to do its WAL
 logging at that time if it hasn't already happened.

I already described a solution to that problem in a post earlier in this
thread (a write queue on the block). I may even have described it on
this list a couple of years ago, that being about the time I thought
it up. (The mmap idea just won't die, but at least I wasn't the one to
bring it up this time. :-))

 Well firstly it depends on the work profile. It can probably get much higher
 than we saw in that profile

True, but 1% was is much, much lower than I'd expected. That tells me
that my intuitive idea of the performance model is wrong, which means,
for me at least, it's time to shut up or put up some benchmarks.

 Secondly it also reduces the amount of cache available. If you have 256M of
 ram with about 200M free, and 40Mb of ram set aside for Postgres's buffer
 cache then you really only get 160Mb. It's costing you 20% of your cache, and
 reducing the cache hit rate accordingly.

Yeah, no question about that.

 Thirdly the kernel doesn't know as much as Postgres about the load. Postgres
 could optimize its use of cache based on whether it knows the data is being
 loaded by a vacuum or sequential scan rather than an index lookup. In practice
 Postgres has gone with ARC which I suppose a kernel could implement anyways,
 but afaik neither linux nor BSD choose to do anything like it.

madvise().

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-26 Thread Scott Marlowe
On Mon, 2004-10-25 at 23:53, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  Another issue is what we do with the effective_cache_size value once we
  have a number we trust.  We can't readily change the size of the ARC
  lists on the fly.
 
  Huh? I thought effective_cache_size was just used as an factor the cost
  estimation equation.
 
 Today, that is true.  Jan is speculating about using it as a parameter
 of the ARC cache management algorithm ... and that worries me.

Because it's so often set wrong I take it.  But if it's set right, and
it makes the the database faster to pay attention to it, then I'd be in
favor of it.  Or at least having a switch to turn on the ARC buffer's
ability to look at it.

Or is it some other issue, having to do with the idea of knowing
effective cache size cause a positive effect overall on the ARC
algorhythm?


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Tom Lane
Jan Wieck [EMAIL PROTECTED] writes:
 This all only holds water, if the OS is allowed to swap out shared 
 memory. And that was my initial question, how likely is it to find this 
 to be true these days?

I think it's more likely that not that the OS will consider shared
memory to be potentially swappable.  On some platforms there is a shmctl
call you can make to lock your shmem in memory, but (a) we don't use it
and (b) it may well require privileges we haven't got anyway.

This has always been one of the arguments against making shared_buffers
really large, of course --- if the buffers aren't all heavily used, and
the OS decides to swap them to disk, you are worse off than you would
have been with a smaller shared_buffers setting.


However, I'm still really nervous about the idea of using
effective_cache_size to control the ARC algorithm.  That number is
usually entirely bogus.  Right now it is only a second-order influence
on certain planner estimates, and I am afraid to rely on it any more
heavily than that.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Greg Stark
Greg Stark [EMAIL PROTECTED] writes:

 However I wonder about another approach entirely. If postgres timed how long
 reads took it shouldn't find it very hard to distinguish between a cached
 buffer being copied and an actual i/o operation. It should be able to track
 the percentage of time that buffers requested are in the kernel's cache and
 use that directly instead of the estimated cache size.

I tested this with a program that times seeking to random locations in a file.
It's pretty easy to spot the break point. There are very few fetches that take
between 50us and 1700us, probably they come from the drive's onboard cache.

The 1700us bound probably would be lower for high end server equipment with
10k RPM drives and RAID arrays. But I doubt it will ever come close to the
100us edge, not without features like cache ram that Postgres would be better
off considering to be part of effective_cache anyways.

So I would suggest using something like 100us as the threshold for determining
whether a buffer fetch came from cache.

Here are two graphs, one showing a nice curve showing how disk seek times are
distributed. It's neat to look at for that alone:

inline: plot1.png
This is the 1000 fastest data points zoomed to the range under 1800us:

inline: plot2.png
This is the program I wrote to test this:

#include unistd.h
#include sys/types.h
#include sys/stat.h
#include sys/time.h
#include fcntl.h
#include time.h
#include stdlib.h
#include stdio.h

int main(int argc, char *argv[]) 
{
  int rep = atoi(argv[1]);
  int i;
  char *filename;
  int fd;
  struct stat statbuf;
  off_t filesize;
  unsigned blocksize;
  void *blockbuf;

  filename = argv[2];
  fd = open(filename, O_RDONLY);

  fstat(fd, statbuf);
  filesize = statbuf.st_size;
  blocksize = statbuf.st_blksize;
  blockbuf = malloc(blocksize);

  srandom(getpid()^clock());

  for (i=0;irep;i++) {
struct timeval timeval1,timeval2;
off_t offset = random()%filesize / blocksize * blocksize;

gettimeofday(timeval1,NULL);
lseek(fd, offset, SEEK_SET);
read(fd, blockbuf, blocksize);
gettimeofday(timeval2,NULL);

/*printf(Read (%d at %ld) took %ld us\n, blocksize, offset, timeval2.tv_usec-timeval1.tv_usec + (timeval2.tv_sec-timeval1.tv_sec)*100);*/
printf(%ld\n, timeval2.tv_usec-timeval1.tv_usec + (timeval2.tv_sec-timeval1.tv_sec)*100);
  }

  return 0;
}



Here are the commands I used to generate the graphs:

$ dd bs=1M count=1024 if=/dev/urandom of=/tmp/big
$ ./a.out 1 /tmp/big  /tmp/l
$ gnuplot
gnuplot set terminal png
gnuplot set output /tmp/plot1.png
gnuplot plot '/tmp/l2' with points pointtype 1 pointsize 1
gnuplot set output /tmp/plot2.png
gnuplot plot [0:2000] [0:1000] '/tmp/l2' with points pointtype 1 pointsize 1



-- 
greg

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 So I would suggest using something like 100us as the threshold for
 determining whether a buffer fetch came from cache.

I see no reason to hardwire such a number.  On any hardware, the
distribution is going to be double-humped, and it will be pretty easy to
determine a cutoff after minimal accumulation of data.  The real question
is whether we can afford a pair of gettimeofday() calls per read().
This isn't a big issue if the read actually results in I/O, but if it
doesn't, the percentage overhead could be significant.

If we assume that the effective_cache_size value isn't changing very
fast, maybe it would be good enough to instrument only every N'th read
(I'm imagining N on the order of 100) for this purpose.  Or maybe we
need only instrument reads that are of blocks that are close to where
the ARC algorithm thinks the cache edge is.

One small problem is that the time measurement gives you only a lower
bound on the time the read() actually took.  In a heavily loaded system
you might not get the CPU back for long enough to fool you about whether
the block came from cache or not.

Another issue is what we do with the effective_cache_size value once we
have a number we trust.  We can't readily change the size of the ARC
lists on the fly.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Tom Lane
Kenneth Marshall [EMAIL PROTECTED] writes:
 How invasive would reading the CPU counter be, if it is available?

Invasive or not, this is out of the question; too unportable.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Greg Stark

Is something broken with the list software? I'm receiving other emails from
the list but I haven't received any of the mails in this thread. I'm only able
to follow the thread based on the emails people are cc'ing to me directly.

I think I've caught this behaviour in the past as well. Is it a misguided list
software feature trying to avoid duplicates or something like that? It makes
it really hard to follow threads in MUAs with good filtering since they're
fragmented between two mailboxes.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-25 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Another issue is what we do with the effective_cache_size value once we
 have a number we trust.  We can't readily change the size of the ARC
 lists on the fly.

 Huh? I thought effective_cache_size was just used as an factor the cost
 estimation equation.

Today, that is true.  Jan is speculating about using it as a parameter
of the ARC cache management algorithm ... and that worries me.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] ARC Memory Usage analysis

2004-10-22 Thread Simon Riggs
On Fri, 2004-10-22 at 21:45, Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  What do you think about my other theory to make C actually 2x effective 
  cache size and NOT to keep T1 in shared buffers but to assume T1 lives 
  in the OS buffer cache?
 
 What will you do when initially fetching a page?  It's not supposed to
 go directly into T2 on first use, but we're going to have some
 difficulty accessing a page that's not in shared buffers.  I don't think
 you can equate the T1/T2 dichotomy to is in shared buffers or not.
 

Yes, there are issues there. I want Jan to follow his thoughts through.
This is important enough that its worth it - there's only a few even
attempting this.

 You could maybe have a T3 list of pages that aren't in shared buffers
 anymore but we think are still in OS buffer cache, but what would be
 the point?  It'd be a sufficiently bad model of reality as to be pretty
 much useless for stats gathering, I'd think.
 

The OS cache is in many ways a wild horse, I agree. Jan is trying to
think of ways to harness it, whereas I had mostly ignored it - but its
there. Raw disk usage never allowed this opportunity.

For high performance systems, we can assume that the OS cache is ours to
play with - what will we do with it? We need to use it for some
purposes, yet would like to ignore it for others.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]