[HACKERS] Lazy Snapshots

2009-08-18 Thread si...@2ndquadrant.com



One of the problems with Hot Standby is that a long running query on the standby
can conflict with VACUUMed rows on the primary, causing queries to be cancelled.


I've been looking at this problem for about a year now from various angles. Jeff
Jane's recent thoughts on procarray scalability have led me down an interesting
path, described here. Taken together, this has led me to rethink completely the
strategy used for avoiding conflicts in the Hot Standby patch.


Currently, we take eager snapshots, meaning we take a snapshot at the start of
each statement whether or not it is necessary. Snapshots exist to disambiguate
the running state of recent transactions, so if a statement never sees data
written by recent transactions then we will never actually use the snapshot.


Another way of doing this is to utilize lazy snapshots: do not take a snapshot
when a statement starts and only take one at the point that we need one. No
other changes to the MVCC mechanisms are proposed.


Is that possible?


The time the snapshot is taken is the time of the consistent viewpoint from
which all data access during a statement is judged. Taking the snapshot later,
at an undefined point in the future means that the consistent viewpoint is
actually floating. When we execute the statement we won't actually know which
viewpoint will be used to derive the answers to a query.


A floating, yet consistent viewpoint is in my opinion a good thing, since it
includes a more recent database state in the answer to a query than we would
otherwise have used. Consider the case where a very large table has a "select
count(*)" executed on it. The scan begins at block 0 and continues through the
table until the end, which for purposes of an example we will say takes 1 hour.
Rows are added to the table at a constant rate of 100/sec and immediately
committed. So by the time the scan has finished it will deliver an answer that
is wrong by 36. Using a lazy snapshot would give us an answer almost exactly
correct, though of course Heisenbuggers may dispute the existence of a "correct"
answer in this case.


So let's look at some theory details:


* Scan begins, no snapshot set. A row is inserted and transaction commits. Scan
progresses until it sees a recent row. Scan takes snapshot; the row is now
visible to it and progresses. Another row is inserted and transaction commits.
When we later come to second new row, we already have a snapshot, so that row is
invisible to us. Results of query are consistent to the point we took the
snapshot, which happened when we saw the first row. Are the results consistent
only to end of transaction that created that row? No, other transactions can
also have committed after it and yet before we take snapshot. The recent
transaction is the catalyst for us to take a snapshot, though the snapshot is
not dependent upon the xid of the new row we have seen.


* Scan begins, no snapshot set. Ahead of scan a row that would have been visible
at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no
evidence that a removal has taken place, never sees contention and thus never
takes a snapshot. This isn't a problem; the row removal created an implicit xmin
for our scan. If we later took a snapshot the xmin of the snapshot would be
equal or later than our previous implicit xmin and so MVCC would be working.
This shows that it is wrong to presume that taking no snapshot at all means that
the time consistent point on the scan was at the start of a statement, it may
not be.


* We open a cursor, then start issuing updates where current of. Does the cursor
need a snapshot? I don't think it does, since we have special visibility rules
for rows produced by our own transaction and we do not need a snapshot to
disambiguate them. ISTM there may be a corner case where we need cursors to take
snapshots, but I haven't seen it yet.


Does that cover all the cases? Some main ones, but let's see if other problems
emerge? Anyone?


OK, in theory it seems to work, so how will it work in practice and does that
cause other difficulties?


* We will hold a new global variable LastGlobalXmin, which is maintained by
GetSnapshotData(). We can access it atomically, without locks.


* In XidInMVCCSnapshot() if our snapshot is NULL then we update RecentXmin from
LastGlobalXmin and test using that because we don't have a snapshot xmin. If
this is sufficient to return false then that's all we do. Otherwise, we now get
a full snapshot and then continue as normal. (There may be some API rework to
allow this to happen, so I think I papered over a few difficulties here, but in
broad terms, this appears to work).


Lazy snapshots mean that some things normally updated during snapshot taking
will fall behind somewhat. This has a couple of effects that we can mitigate in
various ways


* In TransactionIdIsInProgress() if xid < RecentXmin we update RecentXmin from
globalxmin and retry the test.


* We probably need to do something wi

Re: [HACKERS] XLogFlush

2009-08-21 Thread si...@2ndquadrant.com

On 21 August 2009 at 10:18 Jeff Janes  wrote:

> The effect of this seems to be that when WALInsertLock is busy, group
> commits are suppressed.


Agreed, but its not a place to look at just yet since this is changing as part
of sync rep patch.


We do need to change this to make group commit work better.


Glad to see someone willing to get involved.


Best Regards, Simon Riggs

Re: [HACKERS] Lazy Snapshots

2009-08-21 Thread si...@2ndquadrant.com

On 19 August 2009 at 00:09 Josh Berkus  wrote:
 
> When we discussed Hot Standby at pgCon 2008, we discussed the
> possibility of stopping the replications stream whenever a VACUUM came
> through until any queries currently running on the slave completed.  Did
> that approach turn out to be completely unworkable?  Why?


It will be an option, in two forms and not the part I was removing. Neither of
those do all the things I want to be able to do, so I had been searching for
another way to do replay and uninterrupted queries.


Best Regards, Simon Riggs

Re: [HACKERS] Lazy Snapshots

2009-08-21 Thread si...@2ndquadrant.com

On 18 August 2009 at 16:18 Tom Lane  wrote:

> Simon, this concept is completely broken, as far as I can tell. 


Thanks for the precise example.


Yes, I agree it is not going to work when there could be more than one row
version in the relation. Which doesn't leave useful wriggle room, so not
pursuing this further.


What a shame.


As discussed on -perform, it does seem as if we might have retrospectively
calculated snapshots if we had the right data structure. But that's a different
set of thoughts for another day.


Best Regards, Simon Riggs