subject:"\[HACKERS\] Serializable Snapshot Isolation"

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-24 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote:
 
 3. Limited shared memory space to hold information about
 committed transactions that are still interesting.
 
 It's a challenging problem, however, and the current solution is
 less than ideal.
  
 I'd go further than that and say that it clearly needs to be
 fixed. 
 
 OK, this will remain an open issue then.
 
This seems to me to be by far the biggest problem with the patch. 
I've been working through various ideas, and I think I see the light
at the end of the tunnel.  I'm posting the ideas for a reality
check.  I'm hoping for comments.
 
It seems clear to me that we need to attack this from two
directions:
 
(1) Mitigation, by more aggressively identifying transactions which
are no longer interesting so they can be cleaned up.
 
(2) Graceful degradation, by somehow summarizing information as we
approach the hard limit, so that we incrementally increase the
probability of false positives rather than resorting to either of
the two simple solutions (refusing new serializable transactions
or canceling the oldest running serializable transactions).
 
(Suggestions for other ways to attack this are welcome.)
 
It seemed to me likely that investigating (1) would help to clarify
how to do (2), so here's what I've got in the mitigation department:
 
(1A) A committed transaction TX which hasn't written data can be
cleaned up when there is no overlapping non-read-only transaction
which is active and which overlaps a committed transaction which
wrote data and committed too soon to overlap TX.
 
(1B) Predicate locks and information about rw-conflicts *in* for a
committed transaction can be released when there are no overlapping
non-read-only transactions active.  Except for transactions
described in (1A), the fact that the transaction was a serializable
transaction with a rw-conflict *out* is significant, but nothing
else about the transaction matters, and we don't need to know
details of the conflict or maintain a reverse pointer.
 
(1C) A committing transaction which has written data can clear its
conflict out pointer if it points to an active transaction which
does not overlap a committed transaction which wrote data. 
(Obviously, the corresponding pointer in the other direction can
also be cleared.)
 
(1D) A committed transaction with no rw-conflict out cannot become a
pivot in a dangerous structure, because the transaction on the out
side of a pivot must commit first for there to be a problem.
(Obviously, two committed transactions cannot develop a new
conflict.)  Since a read-only transaction can only participate in a
dangerous structure through a conflict with a pivot, transactions in
this state can be ignored when a read-only transaction is checking
for conflicts.
 
That's all I've been able to come up with so far.  Let's see how
that plays with the SSI worst case -- a long running transaction
concurrent with many faster transactions.
 
(2A) If the long-running transaction is read-only, it looks pretty
good.  We can clear concurrent transactions on a reasonable schedule
and just maintain a list of committed serializable transactions with
rw-conflicts out which wrote data and have xids above the
serializable global xmin.  We can probably make room for such a list
of xids somehow -- I could even see potentially using the SLRU
mechanism for this without killing performance -- the long-running
read-only transaction would just need to look up a particular xid in
the list whenever it read a non-visible tuple.  If it exists, the
long-running transaction must roll back with a serialization
failure.  Normally the list should be short enough to stay in RAM.
 
(2B) It gets more complicated if the long-running transaction can
also write.  This is both because the predicate lock information
must be maintained and associated with something, and because the
long running transaction can become a pivot with an old short-lived
transaction on the rw-conflict *in* side.  The rw-conflicts *out*
can be handled the same as for a long-running read-only transaction
(described above).
 
I think that tempered with the above, the performance impact will be
minimal if we apply the technique Heikki described here to the
oldest committed transactions when the list becomes full:
 
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01477.php
 
I don't think this change will affect predicate.h or any of the code
which uses its API.  The only thing which uses the internal
structures outside of predicate.c is the code to show the SIReadLock
entries in the pg_locks view.  I'm not sure how we should show locks
for which we no longer have an associated virtual transaction ID. 
Does anyone have thoughts on that?  Just leave virtualtransaction
NULL, and leave the rest as-is?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-21 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 
 Also, it appears to be non-deterministic, to a degree at least,
 so you may not observe the problem in the exact way that I do.
 The SELECTs only look at the root and the predicate doesn't match.
 So each SELECT sets an SIReadLock on block 0 and exits the search.
 Looks good so far.
 
 T1 then inserts, and it has to modify page 0, so it does
 FlagRWConflict(). That sets writer-inConflict = reader and
 reader-outConflict = writer (where writer is T1 and reader is
 T2); and T1-outConflict and T2-inConflict remain NULL.
 
 Then T2 inserts, and I didn't catch that part in as much detail in
 gdb, but it apparently has no effect on that state, so we still
 have T1-inConflict = T2, T1-outConflict = NULL, T2-inConflict =
 NULL, and T2-outConflict = T1.
 
I now see where the wheels fall off.  The GiST query initially stops
at a high level, so predicate locks only go that deep, and the
*first* insert of a conflicting row must ripple up and modify a
locked page; but *subsequent* inserts may only need to modify the
leaf level.  Even though your particular example doesn't involve a
cycle and therefore doesn't require a rollback for correctness
(although it might tend to generate a false positive if index page
locks were working correctly), you've exposed a flaw in the
GiST AM implementation of predicate locks.
 
On a first glance, it appears that we would need to check for
conflicts as we move down through the index to find the right spot
for an insert, not as we modify pages for the insert.  I hope
there's some more subtle technique or some way to qualify it;
otherwise a search which stops at the root page would generate a
conflict out to just about any index insertion from a concurrent
transaction.
 
I will add this to my list of issues to fix based on your review,
unless it's something you would like to tackle -- I'm not going to
chase away anyone who wants to help with this.  :-)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-21 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 
 That looks like a reasonable state to me, but I'm not sure exactly
 what the design calls for. I am guessing that the real problem is
 in PreCommit_CheckForSerializationFailure(), where there are 6
 conditions that must be met for an error to be thrown. T2 falls
 out right away at condition 1. T1 falls out on condition 4. I
 don't really understand condition 4 at all -- can you explain it?
 And can you explain conditions 5 and 6 too?
 
Since most transactions are rolled back on a conflict detection
during a read or write attempt, there are only a few very specific
conditions which can slip through to where they need to be
detected on commit.  Here's the code with the six conditions:
 
if (MySerializableXact-inConflict != InvalidSerializableXact
   MySerializableXact-inConflict != MySerializableXact
   !(MySerializableXact-inConflict-rolledBack)
   MySerializableXact-inConflict-inConflict !=
   InvalidSerializableXact
   !SxactIsCommitted(MySerializableXact-inConflict)
   !SxactIsCommitted(MySerializableXact-inConflict-inConflict))
 
Condition 4 is testing whether MySerializableXact is on the out
side of a pivot -- in the parlance of most examples, is
MySerializableXact TN?
 
Condition 5 and 6 confirm that neither T0 nor T1 have committed
first; we can only have a problem if TN commits first.
 
Basically, when we already have a pivot, but no transaction has yet
committed, we wait to see if TN commits first.  If so, we have a
problem; if not, we don't.  There's probably some room for improving
performance by cancelling T0 or T1 instead of TN, at least some of
the time; but in this pass we are always cancelling the transaction
in whose process we detect the need to cancel something.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-21 Thread Jeff Davis

On Thu, 2010-10-21 at 10:29 -0500, Kevin Grittner wrote:
 Basically, when we already have a pivot, but no transaction has yet
 committed, we wait to see if TN commits first.  If so, we have a
 problem; if not, we don't.  There's probably some room for improving
 performance by cancelling T0 or T1 instead of TN, at least some of
 the time; but in this pass we are always cancelling the transaction
 in whose process we detect the need to cancel something.

Well, in this case we do clearly have a problem, because the result is
not equal to the serial execution of the transactions in either order.

So the question is: at what point is the logic wrong? It's either:
  1. PreCommit_CheckForSerializationFailure() is missing a failure case.
  2. The state prior to entering that function (which I believe I
sufficiently described) is wrong.

If it's (2), then what should the state look like, and how is the GiST
code supposed to result in that state?

I know some of these questions are answered in the relevant research,
but I'd like to discuss this concrete example specifically.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-21 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 
 in this case we do clearly have a problem, because the result is
 not equal to the serial execution of the transactions in either
 order.
 
Yeah, you're right.  I misread that example -- newbie with the
PERIOD type.
 
 So the question is: at what point is the logic wrong? It's either:
   1. PreCommit_CheckForSerializationFailure() is missing a failure
 case.
   2. The state prior to entering that function (which I believe I
 sufficiently described) is wrong.
 
It's (2).  For the reasons I described in my previous email.  Even
though misread the specifics of your example, I was close enough to
see where the problem was accurately.  :-/
 
 If it's (2), then what should the state look like, and how is the
 GiST code supposed to result in that state?
 
The second insert should create conflicts similar to what the first
did, but in the other direction -- simple write skew.  How GiST is
supposed to catch this is the big question.  My logic that a
conflicting insert will modify a page read by the other transaction
only holds until someone inserts a conflicting entry.  That's why it
wasn't reliably failing until you had and example where both
transactions accessing the same leaf page.
 
In your example, session 1's insert creates the leaf entry and
propagates entries up to the root.  When session 2 inserts, it
can just modify the leaf, so the conflict is missed.  As I said, the
most obvious way to fix this is to look for conflicts while
descending to the leaf for an insert.  I'm almost sure we can do
better than that, but I haven't finished thinking it through.  A
rough idea might be that when we find a conflict on an insert, we
acquire additional predicate locks on everything between the lowest
point of conflict and the leaf; the rest of the logic would remain
as-is.  I haven't finished mulling that over, but it seems likely to
work.  If we did that, session 2 would detect the conflict on the
insert to the leaf, and all would be well.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-21 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 
 When using locks in an unconventional way, it would be helpful to
 describe the invalid schedules that you're preventing. Perhaps an
 example if you think it would be reasonably simple? Also some
 indication of how another process is intended to modify the list
 without walking it.
 
I've just pushed some comment changes intended to address this.  Did
I hit the mark?
 
-Kevin
 
P.S.  Sorry for the delay in responding to such simple requests --
I've been tied up with a family medical crisis; I hope to crank
through much of what you've raised this weekend.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-20 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 One thing that would work, but I really don't think I like it, is
 that a request for a snapshot for such a transaction would not
 only block until it could get a clean snapshot (no overlapping
 serializable non-read-only transactions which overlap
 serializable transactions which wrote data and then committed in
 time to be visible to the snapshot being acquired), but it would
 *also* block *other* serializable transactions, if they were
 non-read-only, on an attempt to acquire a snapshot.
 
 This seems pretty close to guaranteeing serializability by running
 transactions one at a time (i.e. I don't think it's likely to be
 acceptable from a performance standpoint).
 
It absolutely makes no sense except for long-running read-only
transactions, and would only be used when explicitly requested; and
like I said, I really don't think I like it even on that basis --
just putting it out there as the only alternative I've found so far
to either tolerating possible serialization anomalies in pg_dump
output (albeit only when compared to the state the database reached
after the dump's snapshot) or waiting indefinitely for a clean
snapshot to become available.
 
FWIW from a brainstorming perspective, while waiting for problem
transactions to clear so we could get a clean snapshot for the dump
I think it would work even better to block the *commit* of
serializable transactions which *had done* writes than to block
snapshot acquisition for serializable transactions which were not
read-only.  Still pretty icky, though.  I am loathe to compromise
the no new blocking promise of SSI.
 
[thinks]
 
Actually, maybe we can reduce the probability of needing to retry
at each iteration of the non-blocking alternative by checking the
conflict information for the problem transactions after they commit.
Any transaction which didn't *actually* generate a read-write
conflict out to a transaction visible to the dump's candidate
snapshot could not cause an anomaly.  If none of the problem
transactions actually generates a rw-conflict we can use the
candidate snapshot.  Adding that logic to the non-blocking
alternative might actually work pretty well.
 
There might be some workloads where conflicts would be repeatedly
generated, but there would be a lot where they wouldn't.  If we add
a switch to pg_dump to allow users to choose, I think this algorithm
works.  It never affects a transaction unless it has explicitly
requested SERIALIZABLE READ ONLY DEFERRABLE, and the only impact is
that startup may be deferred until a snapshot can be acquired which
ensures serializable behavior without worrying about SIRead locks.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-20 Thread Jeff Davis

On Sun, 2010-10-17 at 22:53 -0700, Jeff Davis wrote:
 2. I think there's a GiST bug (illustrating with PERIOD type):
 
   create table foo(p period);
   create index foo_idx on foo using gist (p);
   insert into foo select period(
   '2009-01-01'::timestamptz + g * '1 microsecond'::interval,
   '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval)
 from generate_series(1,200) g;
 
 Session1:
   begin isolation level serializable;
   select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
   insert into foo values('[2009-01-01, 2009-01-01]'::period);
 
 Session2:
   begin isolation level serializable;
   select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
   insert into foo values('[2009-01-01, 2009-01-01]'::period);
   commit;
 
 Session1:
   commit;
 
 In pg_locks (didn't paste here due to formatting), it looks like the
 SIRead locks are holding locks on different pages. Can you clarify your
 design for GiST and the interaction with page-level locks? It looks like
 you're making some assumption about which pages will be visited when
 searching for conflicting values which doesn't hold true. However, that
 seems odd, because even if the value is actually inserted in one
 transaction, the other doesn't seem to find the conflict. Perhaps the
 bug is simpler than that? Or perhaps I have some kind of odd bug in
 PERIOD's gist implementation?
 
 Also, it appears to be non-deterministic, to a degree at least, so you
 may not observe the problem in the exact way that I do.
 

I have more information on this failure. Everything in GiST actually
looks fine. I modified the example slightly:

 T1: begin isolation level serializable;
 T2: begin isolation level serializable;
 T1: select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
 T2: select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
 T2: commit;
 T1: commit;

The SELECTs only look at the root and the predicate doesn't match. So
each SELECT sets an SIReadLock on block 0 and exits the search. Looks
good so far.

T1 then inserts, and it has to modify page 0, so it does
FlagRWConflict(). That sets writer-inConflict = reader and
reader-outConflict = writer (where writer is T1 and reader is T2); and
T1-outConflict and T2-inConflict remain NULL.

Then T2 inserts, and I didn't catch that part in as much detail in gdb,
but it apparently has no effect on that state, so we still have
T1-inConflict = T2, T1-outConflict = NULL, T2-inConflict = NULL, and
T2-outConflict = T1.

That looks like a reasonable state to me, but I'm not sure exactly what
the design calls for. I am guessing that the real problem is in
PreCommit_CheckForSerializationFailure(), where there are 6 conditions
that must be met for an error to be thrown. T2 falls out right away at
condition 1. T1 falls out on condition 4. I don't really understand
condition 4 at all -- can you explain it? And can you explain conditions
5 and 6 too?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-19 Thread Jeff Davis

On Mon, 2010-10-18 at 22:12 -0500, Kevin Grittner wrote:
 Hmmm...  When Joe was looking at the patch he exposed an intermittent
 problem with btree indexes which turned out to be related to improper
 handling of the predicate locks during index page clean-up caused by a
 vacuum.  Easy to fix once found, but it didn't always happen, even
 with identical runs.  (I'm guessing that was due to the randomness in
 picking a page to split when inserting into a group of identical
 keys.)  Perhaps a similar bug lurks in the GiST predicate locking.

I briefly looked into this when I woke up this morning, and I think I'm
close. I can reproduce it every time, so I should be able to fix this as
soon as I can find some free time (tomorrow night, probably).

I might also be able to help with the 2PC issue, but it will be at least
a week or two before I have the free time to dig into that one.
 
  Eventually, we may need to keep statistics about the number of
  conflicts happening, and start to rate-limit new serializable
  transactions (effectively reducing parallelism) to ensure that
  reasonable progress can be made (hopefully faster than serial
  execution).
  
 Ah, you've exposed just how self-serving my interest in an admission
 control policy mechanism is!  ;-)
  
 http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php

Cool!
 
  I also *really* hope to add the SERIALIZABLE READ ONLY
  DEFERRABLE mode so that pg_dump and other read-only transactions
  don't push things into a state where the rollback rate spikes:
 
  http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php
 
  One potential issue there is starvation. I don't see how you would
  guarantee that there will ever be a point to grab a safe snapshot,
  even if all of the other transactions are completing.
  
 After mulling it over in greater detail the previously, I see your
 point.  I'll think about it some more, but this particular idea might
 be a dead end.

I didn't quite mean that it's a dead-end. It still seems tempting to at
least try to get a safe snapshot, especially for a transaction that you
know will be long-running. Maybe a timeout or something?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-19 Thread Kevin Grittner

Jeff Davis pg...@j-davis.com wrote:
 
 I briefly looked into this when I woke up this morning, and I
 think I'm close. I can reproduce it every time, so I should be
 able to fix this as soon as I can find some free time (tomorrow
 night, probably).
 
OK, I'll focus on other areas.
 
 I might also be able to help with the 2PC issue, but it will be at
 least a week or two before I have the free time to dig into that
 one.
 
If I get through the other points you raised, I'll start reading up
on 2PC.
 
 I didn't quite mean that it's a dead-end. It still seems tempting
 to at least try to get a safe snapshot, especially for a
 transaction that you know will be long-running. Maybe a timeout or
 something?
 
But what would you do when you hit the timeout?
 
One thing that would work, but I really don't think I like it, is
that a request for a snapshot for such a transaction would not only
block until it could get a clean snapshot (no overlapping
serializable non-read-only transactions which overlap serializable
transactions which wrote data and then committed in time to be
visible to the snapshot being acquired), but it would *also* block
*other* serializable transactions, if they were non-read-only, on an
attempt to acquire a snapshot.  That would at least guarantee that
the serializable read only deferrable transaction could get its
snapshot as soon as the initial set of problem overlapping
transactions completed, but it would be an exception to the SSI
introduces no new blocking guarantee.  :-(  I was OK with that for
the particular transaction where DEFERRABLE was requested, but to
have that block other serializable transactions seems pretty iffy to
me.
 
Short of that, I think you would just have to wait for completion of
all known problem transactions and then try again.  On a server with
a heave write load, particularly if the length of the writing
transactions was highly variable, you could go through a lot of
iterations before getting that clean snapshot.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-19 Thread Robert Haas

On Tue, Oct 19, 2010 at 6:28 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 One thing that would work, but I really don't think I like it, is
 that a request for a snapshot for such a transaction would not only
 block until it could get a clean snapshot (no overlapping
 serializable non-read-only transactions which overlap serializable
 transactions which wrote data and then committed in time to be
 visible to the snapshot being acquired), but it would *also* block
 *other* serializable transactions, if they were non-read-only, on an
 attempt to acquire a snapshot.

This seems pretty close to guaranteeing serializability by running
transactions one at a time (i.e. I don't think it's likely to be
acceptable from a performance standpoint).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-18 Thread Kevin Grittner

First off, thanks for the review!  I know that it's a lot of work,
and I do appreciate it.
 
Jeff Davis pg...@j-davis.com wrote:
 
 * Trivial stuff:
 
 I get a compiler warning:
 
 indexfsm.c: In function *RecordFreeIndexPage*:
 indexfsm.c:55: warning: implicit declaration of function
 *PageIsPredicateLocked*
 
Oops.  Fixed on my git repo now.
 
 * Open issues, as I see it:
 
 1. 2PC and SSI don't mix (this may be a known issue, because
 there's not really any code in the current implementation to deal
 with 2PC):
 
 Looks like we need to track information about prepared
 transactions in shared memory. I think you'll need to keep the
 information in the 2PC state file as well, so that it can be
 rebuilt after a crash or restart.  It all looks solvable at first
 glance, but it looks like it might be some work.
 
Argh!  I didn't anticipate an interaction with prepared
transactions, probably because I've never used them or taken a close
look at them.  I'll take a look.
 
 2. I think there's a GiST bug (illustrating with PERIOD type):
 
 Can you clarify your design for GiST and the interaction with
 page-level locks? It looks like you're making some assumption
 about which pages will be visited when searching for conflicting
 values which doesn't hold true. However, that seems odd, because
 even if the value is actually inserted in one transaction, the
 other doesn't seem to find the conflict. Perhaps the bug is
 simpler than that? Or perhaps I have some kind of odd bug in
 PERIOD's gist implementation?
 
My assumptions for GiST were that:
 
(1) A search for a matching value could bail out at any level in the
tree; there is no requirement for the search to proceed to the leaf
level to get a negative index probe.
 
(2) An index insertion which would cause a row to be found if an
earlier search was repeated must modify some page which was read by
the earlier search.
 
(3) Because of the above, it is necessary and sufficient to acquire
an SIRead lock all pages visited in a search of a GiST index, and
flag a conflict on insertion into a locked page at any level of the
index.
 
That logic still seems sound to me, so if someone else sees a flaw
in it, please point it out.  Assuming that logic is sound, I'll poke
around to see where the flaw in implementation may be.  If you have
a full self-contained test case to demonstrate the failure here,
could you send it to me?
 
 Also, it appears to be non-deterministic, to a degree at least, so
 you may not observe the problem in the exact way that I do.
 
Yeah, I have tested this area without seeing the failure , so that
self-contained example would be a big help.
 
 3. Limited shared memory space to hold information about committed
 transactions that are still interesting. Relevant thread:
 
 http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
 
 It's a challenging problem, however, and the current solution is
 less than ideal.
 
I'd go further than that and say that it clearly needs to be fixed. 
The scale of the issue was somewhat disguised in my testing when I
was using the hash table for managing acquisition of shared memory
for what was essentially an unordered list.  Now that I've made it a
proper list with a hard limit on entries, the problem is pretty easy
to provoke, and just reserving space for A Very Large Number Of
Entries is clearly not an adequate solution.  :-(
 
 Idle transactions can mean that all new serializable transactions
 fail until the idle transactions start to terminate. I don't like
 that very much, because people expect to have to retry
 serializable transactions, but retrying here has no real hope
 (except that some time has elapsed, and maybe the other
 transaction decided to commit).
 
Agreed.
 
 Does it make sense to kill the existing transactions that are
 holding everything up, rather than the new transaction? Or would
 that just confuse matters more? This does not necessarily
 guarantee that progress can be made, either, but intuitively it
 seems more likely.
 
Canceling an old transaction to allow new transactions to begin
*might* be better than the current situation, but I think we can and
should do better.  The best I have been able to come up with is in
this post:
 
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php
 
There's a fair amount of work there, so I was hoping for some
confirmation that this combination of steps was both sane and had
some chance of being considered sufficient and acceptable before
diving in to code it.  I also *really* hope to add the SERIALIZABLE
READ ONLY DEFERRABLE mode so that pg_dump and other read-only
transactions don't push things into a state where the rollback rate
spikes:
 
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php
 
 4. A few final details:
 
   a. We should probably have a compatibility GUC that makes
 SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
 should be only for compatibility, and should default to off
 (i.e. SSI

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-18 Thread Jeff Davis

On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote: 
  2. I think there's a GiST bug (illustrating with PERIOD type):
  
 My assumptions for GiST were that:
  
 (1) A search for a matching value could bail out at any level in the
 tree; there is no requirement for the search to proceed to the leaf
 level to get a negative index probe.
  
 (2) An index insertion which would cause a row to be found if an
 earlier search was repeated must modify some page which was read by
 the earlier search.
  
 (3) Because of the above, it is necessary and sufficient to acquire
 an SIRead lock all pages visited in a search of a GiST index, and
 flag a conflict on insertion into a locked page at any level of the
 index.
  
 That logic still seems sound to me, so if someone else sees a flaw
 in it, please point it out.

Looks sound to me, as well.

  Also, it appears to be non-deterministic, to a degree at least, so
  you may not observe the problem in the exact way that I do.
  
 Yeah, I have tested this area without seeing the failure , so that
 self-contained example would be a big help.

I assume here that you mean that you _did_ see the failure
(serialization error) and therefore did not see the problem? Also, are
you sure it was using the GiST index for the searches and didn't just
get a full table lock or full index lock?

I'll try to narrow it down. It could be a problem with PERIOD, or maybe
it wasn't using the GiST index for a search when I thought it was (I
didn't run EXPLAIN at every point, so I'll double-check). Clearly, a
problem exists somewhere though. 

  3. Limited shared memory space to hold information about committed
  transactions that are still interesting. Relevant thread:
  
  http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
  
  It's a challenging problem, however, and the current solution is
  less than ideal.
  
 I'd go further than that and say that it clearly needs to be fixed. 

OK, this will remain an open issue then.

 Canceling an old transaction to allow new transactions to begin
 *might* be better than the current situation, but I think we can and
 should do better.  The best I have been able to come up with is in
 this post:
  
 http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php

At first I didn't like that approach much, but after re-reading it, I am
more optimistic. What I like most about it is that a transaction won't
be canceled if it doesn't interfere at all (e.g. operating on different
tables). There are still edge cases where it can be canceled, like when
the memory is so exhausted that it can't even track a couple table-level
locks, but that doesn't sound as bad (closer to the current restriction
on the total number of locks).

Ultimately I think this will be one of those problems with no 100% clean
solution. However, I think it's important that we try to minimize these
degenerate cases. Eventually, we may need to keep statistics about the
number of conflicts happening, and start to rate-limit new serializable
transactions (effectively reducing parallelism) to ensure that
reasonable progress can be made (hopefully faster than serial
execution).

 There's a fair amount of work there, so I was hoping for some
 confirmation that this combination of steps was both sane and had
 some chance of being considered sufficient and acceptable before
 diving in to code it.  I also *really* hope to add the SERIALIZABLE
 READ ONLY DEFERRABLE mode so that pg_dump and other read-only
 transactions don't push things into a state where the rollback rate
 spikes:
  
 http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php

One potential issue there is starvation. I don't see how you would
guarantee that there will ever be a point to grab a safe snapshot, even
if all of the other transactions are completing.

  4. A few final details:
  
a. We should probably have a compatibility GUC that makes
  SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
  should be only for compatibility, and should default to off
  (i.e. SSI code enabled) either in 9.1 or soon after.
  
 I'm inclined to agree with you, with the strong preference for a 9.1
 setting of off.  This was previously discussed, and there were
 people who felt that we should avoid a behavior-changing GUC like
 that, so I didn't add it.  It wouldn't be hard to put it in, if
 that's the community consensus.

I think that was me, but I'm OK with behavior-changing GUCs as long as
they are there for compatibility and we intend to push people toward the
new behavior in the long run (like standard_conforming_strings, but
hopefully not as long-lived).

Maybe it's not necessary at all, because your patch offers strictly
better guarantees (at the cost of more serialization failures), and if
they want the old behavior then REPEATABLE READ is already there.

  1. For TargetTagIsCoveredBy(), why is it impossible for the
  covering tag to have an offset?
  
 Because a covering lock is a lock at a

Re: [HACKERS] Serializable snapshot isolation patch

2010-10-18 Thread Kevin Grittner

 Jeff Davis  wrote:
 On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote:
 
 I assume here that you mean that you _did_ see the failure
 (serialization error) and therefore did not see the problem?
 
Yeah.
 
 Also, are you sure it was using the GiST index for the searches and
 didn't just get a full table lock or full index lock?
 
I think so, but I'm at home and don't have details here.  Will check
tomorrow.
 
 I'll try to narrow it down. It could be a problem with PERIOD, or
 maybe it wasn't using the GiST index for a search when I thought it
 was (I didn't run EXPLAIN at every point, so I'll double-check).
 Clearly, a problem exists somewhere though.
 
Hmmm...  When Joe was looking at the patch he exposed an intermittent
problem with btree indexes which turned out to be related to improper
handling of the predicate locks during index page clean-up caused by a
vacuum.  Easy to fix once found, but it didn't always happen, even
with identical runs.  (I'm guessing that was due to the randomness in
picking a page to split when inserting into a group of identical
keys.)  Perhaps a similar bug lurks in the GiST predicate locking.
 
 Eventually, we may need to keep statistics about the number of
 conflicts happening, and start to rate-limit new serializable
 transactions (effectively reducing parallelism) to ensure that
 reasonable progress can be made (hopefully faster than serial
 execution).
 
Ah, you've exposed just how self-serving my interest in an admission
control policy mechanism is!  ;-)
 
http://archives.postgresql.org/pgsql-hackers/2009-12/msg02189.php
 
 I also *really* hope to add the SERIALIZABLE READ ONLY
 DEFERRABLE mode so that pg_dump and other read-only transactions
 don't push things into a state where the rollback rate spikes:

 http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php

 One potential issue there is starvation. I don't see how you would
 guarantee that there will ever be a point to grab a safe snapshot,
 even if all of the other transactions are completing.
 
After mulling it over in greater detail the previously, I see your
point.  I'll think about it some more, but this particular idea might
be a dead end.
 
 a. We should probably have a compatibility GUC that makes
 SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
 should be only for compatibility, and should default to off
 (i.e. SSI code enabled) either in 9.1 or soon after.

 I'm inclined to agree with you, with the strong preference for a
 9.1 setting of off. This was previously discussed, and there were
 people who felt that we should avoid a behavior-changing GUC
 like that, so I didn't add it. It wouldn't be hard to put it in,
 if that's the community consensus.

 I think that was me, but I'm OK with behavior-changing GUCs as long
 as they are there for compatibility and we intend to push people
 toward the new behavior in the long run (like
 standard_conforming_strings, but hopefully not as long-lived).

 Maybe it's not necessary at all, because your patch offers strictly
 better guarantees (at the cost of more serialization failures), and
 if they want the old behavior then REPEATABLE READ is already
 there.
 
Anyone else with an opinion on this?
 
 So a lock can't ever cover itself?
 
No.  If a transaction requests an SIRead lock identical to one it
already holds, we ignore the request.
 
 When using locks in an unconventional way, it would be helpful to
 describe the invalid schedules that you're preventing. Perhaps an
 example if you think it would be reasonably simple? Also some
 indication of how another process is intended to modify the list
 without walking it.
 
I will review that comment and add something along those lines.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Serializable snapshot isolation patch

2010-10-17 Thread Jeff Davis

This is based on the Kevin's git repo at:

  git://git.postgresql.org/git/users/kgrittn/postgres.git
  SHA1: 729541fa5ea94d66e6f4b22fb65bfef92214cd6b

* Trivial stuff:

I get a compiler warning:

indexfsm.c: In function ‘RecordFreeIndexPage’:
indexfsm.c:55: warning: implicit declaration of function
‘PageIsPredicateLocked’

* Open issues, as I see it:

1. 2PC and SSI don't mix (this may be a known issue, because there's not
really any code in the current implementation to deal with 2PC):

   Session1:
 BEGIN ISOLATION LEVEL SERIALIZABLE;
 select count(*) from a;
 insert into a values(1);
 PREPARE TRANSACTION 't1';

   Session2:
 BEGIN ISOLATION LEVEL SERIALIZABLE;
 select count(*) from a;
 insert into a values(1);
 COMMIT;

   Session1:
 COMMIT PREPARED 't1';

Looks like we need to track information about prepared transactions in
shared memory. I think you'll need to keep the information in the 2PC
state file as well, so that it can be rebuilt after a crash or restart.
It all looks solvable at first glance, but it looks like it might be
some work.

2. I think there's a GiST bug (illustrating with PERIOD type):

  create table foo(p period);
  create index foo_idx on foo using gist (p);
  insert into foo select period(
  '2009-01-01'::timestamptz + g * '1 microsecond'::interval,
  '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval)
from generate_series(1,200) g;

Session1:
  begin isolation level serializable;
  select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
  insert into foo values('[2009-01-01, 2009-01-01]'::period);

Session2:
  begin isolation level serializable;
  select * from foo where p  '[2009-01-01, 2009-01-01]'::period;
  insert into foo values('[2009-01-01, 2009-01-01]'::period);
  commit;

Session1:
  commit;

In pg_locks (didn't paste here due to formatting), it looks like the
SIRead locks are holding locks on different pages. Can you clarify your
design for GiST and the interaction with page-level locks? It looks like
you're making some assumption about which pages will be visited when
searching for conflicting values which doesn't hold true. However, that
seems odd, because even if the value is actually inserted in one
transaction, the other doesn't seem to find the conflict. Perhaps the
bug is simpler than that? Or perhaps I have some kind of odd bug in
PERIOD's gist implementation?

Also, it appears to be non-deterministic, to a degree at least, so you
may not observe the problem in the exact way that I do.

3. Limited shared memory space to hold information about committed
transactions that are still interesting. Relevant thread:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php

It's a challenging problem, however, and the current solution is less
than ideal. Idle transactions can mean that all new serializable
transactions fail until the idle transactions start to terminate. I
don't like that very much, because people expect to have to retry
serializable transactions, but retrying here has no real hope (except
that some time has elapsed, and maybe the other transaction decided to
commit).

A comparison is made (in the aforementioned thread) to the existing
limitation on the number of locks. However, it's much easier to
understand normal locks, and for a given workload usually you can put an
upper bound on the number of locks required (right?).

Does it make sense to kill the existing transactions that are holding
everything up, rather than the new transaction? Or would that just
confuse matters more? This does not necessarily guarantee that progress
can be made, either, but intuitively it seems more likely.

4. A few final details:

  a. We should probably have a compatibility GUC that makes SERIALIZABLE
equal to REPEATABLE READ. My opinion is that this should be only for
compatibility, and should default to off (i.e. SSI code enabled)
either in 9.1 or soon after.

  b. Docs.

* Questions:

1. For TargetTagIsCoveredBy(), why is it impossible for the covering tag
to have an offset?

2. The explanation for SerializablePredicateLockListLock is a little
confusing. It makes it sound like other processes can't walk the list,
but they can modify it?

* Summary

Great patch! I didn't make it through the patch in as much detail as I
would have liked, because the theory behind it is quite complex and it
will take longer for me to absorb. But the implementation looks good and
the use case is very important.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Greg Stark

On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 One place I'm particularly interested in using such a feature is in
 pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't seem
 good for a backup program) or using REPEATABLE READ (to get current
 snapshot isolation behavior), which might capture a view of the data
 which contains serialization anomalies.

I'm puzzled how pg_dump could possibly have serialization anomalies.
Snapshot isolation gives pg_dump a view of the database containing all
modifications committed before it started and no modifications which
committed after it started. Since pg_dump makes no database
modifications itself it can always just be taken to occur
instantaneously before any transaction which committed after it
started.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Nicolas Barbier

[ Forgot the list, resending. ]

2010/9/25 Greg Stark gsst...@mit.edu:

 On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 One place I'm particularly interested in using such a feature is in
 pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't seem
 good for a backup program) or using REPEATABLE READ (to get current
 snapshot isolation behavior), which might capture a view of the data
 which contains serialization anomalies.

 I'm puzzled how pg_dump could possibly have serialization anomalies.
 Snapshot isolation gives pg_dump a view of the database containing all
 modifications committed before it started and no modifications which
 committed after it started. Since pg_dump makes no database
 modifications itself it can always just be taken to occur
 instantaneously before any transaction which committed after it
 started.

I guess that Kevin is referring to [1], where the dump would take the
role of T3. That would mean that the dump itself must be aborted
because it read inconsistent data.

AFAICS, whether that reasoning means that a dump can produce an
inconsistent backup is debatable. After restoring, all transactions
that would have been in-flight at the moment the dump took its
snapshot are gone, so none of their effects happened. We would be in
exactly the same situation as if all running transactions would be
forcibly aborted at the moment that the dump would have started.

OTOH, if one would compare the backup with what really happened,
things may look inconsistent. The dump would show what T3 witnessed
(i.e., the current date is incremented and the receipts table is
empty), although the current state of the database system shows
otherwise (i.e., the current date is incremented and the receipts
table has an entry for the previous date).

IOW, one could say that the backup is consistent only if it were never
compared against the system as it continued running after the dump
took place.

This stuff will probably confuse the hell out of most DBAs :-).

Nicolas

[1] URL:http://archives.postgresql.org/pgsql-hackers/2010-05/msg01360.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 One place I'm particularly interested in using such a feature is in
 pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't seem
 good for a backup program) or using REPEATABLE READ (to get current
 snapshot isolation behavior), which might capture a view of the data
 which contains serialization anomalies.

 I'm puzzled how pg_dump could possibly have serialization anomalies.

At the moment, it can't.  If this patch means that it can, that's going
to be a mighty good reason not to apply the patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Kevin Grittner

Greg Stark  wrote:
 Kevin Grittner  wrote:
 
 One place I'm particularly interested in using such a feature is
 in pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't
 seem good for a backup program) or using REPEATABLE READ (to get
 current snapshot isolation behavior), which might capture a view
 of the data which contains serialization anomalies.
  
 I'm puzzled how pg_dump could possibly have serialization
 anomalies.  Snapshot isolation gives pg_dump a view of the database
 containing all modifications committed before it started and no
 modifications which committed after it started. Since pg_dump makes
 no database modifications itself it can always just be taken to
 occur instantaneously before any transaction which committed after
 it started.
 
Well, in the SQL-92 standard[1] the definition of serializable
transactions was changed to the following:
 
| The execution of concurrent SQL-transactions at isolation level
| SERIALIZABLE is guaranteed to be serializable. A serializable
| execution is defined to be an execution of the operations of
| concurrently executing SQL-transactions that produces the same
| effect as some serial execution of those same SQL-transactions. A
| serial execution is one in which each SQL-transaction executes to
| completion before the next SQL-transaction begins.
 
It hasn't changed since then.
 
Some people in the community cling to the notion, now obsolete for
almost two decades, that serializable transactions are defined by the
same anomalies which define the other three levels of transaction
isolation.  (It's probably time to catch up on that one.)  Note that
the above does *not* say it's OK for a SELECT in a transaction
executing at the serializable isolation level to produce results
which are not consistent with any serial execution of serializable
transactions, as long as the database *eventually* reaches a state
where a repeat of the same SELECT in a new transaction produces
results consistent with such execution.  Under the standard, even
read-only transactions have to follow the rules, and I think most
people would want that.
 
Now, commit order in itself doesn't directly affect the apparent
order of execution.  It's only directs the apparent order of
execution to the extent that multiple transactions access the same
data.  Read-read conflicts don't matter in this scheme, and
write-write conflicts are already handled under snapshot isolation by
preventing both writers from committing -- if one commits, the other
is forced to roll back with a serialization failure.  That leaves
write-read and read-write conflicts.
 
In write-read conflicts, one transaction writes data and then commits
in time for another transaction to see it.  This implies that the
writing transaction came first first in the apparent order of
execution.  Now for the tricky one: in read-write conflicts (often
written as rw-conflict) the reading transaction cannot see the write
of a concurrent transaction because of snapshot visibility rules.
Since the reading tranaction is unable to see the work of the writing
transaction, it must be considered to have come first in the apparent
order of execution.
 
In order to have a serialization anomaly under snapshot isolation,
you need a situation like this: a transaction which I'll call T0
(matching much discussion on the topic published in recent years) has
a rw-dependency on a transaction concurrent to T0, which I'll call
T1.  In addition, T1 has a rw-dependency on a transaction which is
concurrent to it, which I'll call TN.  The reason it's not T2 is that
it can be the same transaction as T0 or a third transaction.  So,
because of the rw-conflicts, T0 appears to execute before T1, which
appears to execute before TN.  (At this point it should be obvious
why you've got a problem if T0 and TN are the same transaction.)
 
If T0 and TN are distinct, we still haven't quite met the conditions
required to produce a serialization anomaly, however,  The next
requirement is that TN (the third in apparent order of execution)
actually commits first.  At this point, the third transaction's
writes are exposed for the world to see, while there are still two
uncommitted tranactions which appear to have committed first.  There
are so many ways that this can lead to a cycle in apparent order of
execution, some of which can happen in the client application, that
Serializable Snapshot Isolaiton (SSI) doesn't pretend to track that.
 
Barring one exception that I worked out myself (although I'd be
shocked if someone didn't beat me to it and I just haven't found a
paper describing it in my researches), the above describes the
conditions under which one of the transactions must be rolled back to
prevent a serialization anomaly.
 
The exception is interesting here, though.  It is that if T0 is a
READ ONLY transaction, it can't participate in an anomaly unless TN
commits before T0 acquires its snapshot.

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Kevin Grittner

Nicolas Barbier  wrote:
 
 IOW, one could say that the backup is consistent only if it were
 never compared against the system as it continued running after the
 dump took place.
 
Precisely.  I considered making that point in the email I just sent,
but figured I had rambled enough.  I suppose I should have gone
there; thanks for covering the omission.  :-)
 
-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Greg Stark

On Sat, Sep 25, 2010 at 4:24 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 OK, to get back to the question -- pg_dump's transaction (T0) could
 see an inconsistent version of the database if one transaction (TN)
 writes to a table, another transaction (T1) overlaps TN and can't
 read something written by TN because they are concurrent, TN commits
 before T0 acquires its snapshot, T1 writes to a table, T0 starts
 before T1 commits, and T0 can't read something which T1 wrote (which
 is sort of a given for a database dump and overlapping transactions).

Can I collapse this into a single list of events (apologies, this
isn't going to line up, I'm writing it in a proportional font :( )

TN starts
  T1 starts
TN writes
  T1 reads
TN commits
 T0 starts (pg_dump)
  T1 writes
 T0 reads (pg_dump)
  T1 commits

So T1 must have happened before TN because it wrote something based on
data as it was before TN modified it. But T0 can see TN but not T1 so
there's no complete ordering between the three transactions that makes
them all make sense.

The thing is that the database state is reasonable, the database state
is after it would be if the ordering were T1,TN with T0 happening any
time. And the backup state is reasonable, it's as if it occurred after
TN and before T1. They just don't agree.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Kevin Grittner

Greg Stark  wrote:
 
 So T1 must have happened before TN because it wrote something based
 on data as it was before TN modified it. But T0 can see TN but not
 T1 so there's no complete ordering between the three transactions
 that makes them all make sense.
 
Correct.
 
 The thing is that the database state is reasonable, the database
 state is after it would be if the ordering were T1,TN with T0
 happening any time. And the backup state is reasonable, it's as if
 it occurred after TN and before T1. They just don't agree.
 
I agree that the database state eventually settles into a valid
long-term condition in this particular example.  The point you are
conceding seems to be that the image captured by pg_dump is not
consistent with that.  If so, I agree.  You don't see that as a
problem; I do.  I'm not sure where we go from there.  Certainly that
is better than making pg_dump vulnerable to serialization failure --
if we don't implement the SERIALIZABLE READ ONLY DEFERRABLE
transactions I was describing, we can change pg_dump to use
REPEATABLE READ and we will be no worse off than we are now.
 
The new feature I was proposing was that we create a SERIALIZABLE
READ ONLY DEFERRABLE transaction style which would, rather than
acquiring predicate locks and watching for conflicts, potentially
wait until it could acquire a snapshot which was guaranteed to be
conflict-free.  In the example discussed on this thread, if we
changed pg_dump to use such a mode, when it went to acquire a
snapshot it would see that it overlapped T1, which was not READ ONLY,
which in turn overlapped TN, which had written to a table and
committed.  It would then block until completion of the T1
transaction and adjust its snapshot to make that transaction visible.
You would now have a backup entirely consistent with the long-term
state of the database, with no risk of serialization failure and no
bloating of the predicate lock structures.
 
The only down side is that there could be blocking when such a
transaction acquires its snapshot.  That seems a reasonable price to
pay for backup integrity.  Obviously, if we had such a mode, it would
be trivial to add a switch to the pg_dump command line which would
let the user choose between guaranteed dump integrity and guaranteed
lack of blocking at the start of the dump.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Greg Stark

Just to be clear I wasn't saying it was or wasn't a problem, I was just
trying to see if I understand the problem and if I do maybe help bring
others up to speed.
On 25 Sep 2010 23:28, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Greg Stark wrote:

 So T1 must have happened before TN because it wrote something based
 on data as it was before TN modified it. But T0 can see TN but not
 T1 so there's no complete ordering between the three transactions
 that makes them all make sense.

 Correct.

 The thing is that the database state is reasonable, the database
 state is after it would be if the ordering were T1,TN with T0
 happening any time. And the backup state is reasonable, it's as if
 it occurred after TN and before T1. They just don't agree.

 I agree that the database state eventually settles into a valid
 long-term condition in this particular example. The point you are
 conceding seems to be that the image captured by pg_dump is not
 consistent with that. If so, I agree. You don't see that as a
 problem; I do. I'm not sure where we go from there. Certainly that
 is better than making pg_dump vulnerable to serialization failure --
 if we don't implement the SERIALIZABLE READ ONLY DEFERRABLE
 transactions I was describing, we can change pg_dump to use
 REPEATABLE READ and we will be no worse off than we are now.

 The new feature I was proposing was that we create a SERIALIZABLE
 READ ONLY DEFERRABLE transaction style which would, rather than
 acquiring predicate locks and watching for conflicts, potentially
 wait until it could acquire a snapshot which was guaranteed to be
 conflict-free. In the example discussed on this thread, if we
 changed pg_dump to use such a mode, when it went to acquire a
 snapshot it would see that it overlapped T1, which was not READ ONLY,
 which in turn overlapped TN, which had written to a table and
 committed. It would then block until completion of the T1
 transaction and adjust its snapshot to make that transaction visible.
 You would now have a backup entirely consistent with the long-term
 state of the database, with no risk of serialization failure and no
 bloating of the predicate lock structures.

 The only down side is that there could be blocking when such a
 transaction acquires its snapshot. That seems a reasonable price to
 pay for backup integrity. Obviously, if we had such a mode, it would
 be trivial to add a switch to the pg_dump command line which would
 let the user choose between guaranteed dump integrity and guaranteed
 lack of blocking at the start of the dump.

 -Kevin

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Robert Haas

On Sat, Sep 25, 2010 at 10:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark gsst...@mit.edu writes:
 On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 One place I'm particularly interested in using such a feature is in
 pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't seem
 good for a backup program) or using REPEATABLE READ (to get current
 snapshot isolation behavior), which might capture a view of the data
 which contains serialization anomalies.

 I'm puzzled how pg_dump could possibly have serialization anomalies.

 At the moment, it can't.  If this patch means that it can, that's going
 to be a mighty good reason not to apply the patch.

It certainly can, as can any other read-only transaction.  This has
been discussed many times here before with detailed examples, mostly
by Kevin.  T0 reads A and writes B.  T1 then reads B and writes C.  T0
commits.   pg_dump runs.  T1 commits.  What is the fully serial order
of execution consistent with this chronology?  Clearly, T1 must be run
before T0, since it doesn't see T0's update to B.  But pg_dump sees
the effects of T0 but not T1, so T0 must be run before T1.  Oops.  Now
you might say that this won't be a problem for most people in
practice, and I think that's true, but it's still unserializable.  And
pg_dump is the reason, because otherwise T1 then T0 would be a valid
serialization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Kevin Grittner

Greg Stark  wrote:
 
 Just to be clear I wasn't saying it was or wasn't a problem, I was
 just trying to see if I understand the problem and if I do maybe
 help bring others up to speed.
 
Thanks for that, and my apologies for misunderstanding you.  It does
sound like you have a firm grasp on my concern, and you expressed it
clearly and concisely.
 
To build on what you said, there are two properties which bother me
about a backup based on snapshot isolation in an environment where
data integrity is enforced with application code, including
user-written triggers.
 
(1) If the backup is used to make a copy of the database for running
reports, while the main database continues to be updated, it could
represent a state which never existed in the database, at least as
visible to serializable transactions.  This makes any reports run
against it of dubious value.
 
(2) It may not be possible to update such a copy of the database to
the later state of the database using replay of the same transaction
stream, in any order -- particularly if the recovery counts on firing
of the same triggers to supply default data or enforce integrity
rules which were in place during the initial run.  To continue with
the receipting example, replaying the receipt which was in flight
during the pg_dump run would result in assigning the wrong deposit
date. This could cause problems for some replication or recovery
strategies, although it's not apparent that it breaks anything
actually shipped in the base PostgreSQL distribution.
 
FWIW, it seems premature to spend a lot of time on the concept of the
DEFERRABLE (or whatever term we might choose) transaction snapshots. 
I think I'll update the patch to use REPEATABLE READ for pg_dump for
now, and we can revisit this as a possible enhancement later.  As you
point out, it doesn't really impact the usefulness of the dump for
crash recovery beyond the issue I mention in point 2 above, and
that's only going to come into play for certain recovery strategies. 
Even then, it's only an issue if pg_dump gets its snapshot while
certain very specific conditions exist.  And we're certainly in no
worse shape regarding these concerns with the patch than without;
they're long-standing issues.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-24 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 My aim is still to put an upper bound on the amount of shared
 memory required, regardless of the number of committed but still
 interesting transactions.
 
 That maps nicely to a SLRU table
 
Well, that didn't take as long to get my head around as I feared.
 
I think SLRU would totally tank performance if used for this, and
would really not put much of a cap on the memory taken out of
circulation for purposes of caching.  Transactions are not
referenced more heavily at the front of the list nor are they
necessarily discarded more or less in order of acquisition.  In
transaction mixes where all transaction last about the same length
of time, the upper limit of interesting transactions is about twice
the number of active transactions, so memory demands are pretty
light.  The problems come in where you have at least one long-lived
transaction and a lot of concurrent short-lived transactions.  Since
all transactions are scanned for cleanup every time a transaction
completes, either they would all be taking up cache space or
performance would drop to completely abysmal levels as it pounded
disk.  So SLRU in this case would be a sneaky way to effectively
dynamically allocate shared memory, but about two orders of
magnitude slower, at best.
 
Here are the things which I think might be done, in some
combination, to address your concern without killing performance:
 
(1) Mitigate memory demand through more aggressive cleanup.  As an
example, a transaction which is READ ONLY (or which hasn't written
to a relevant table as tracked by a flag in the transaction
structure) is not of interest after commit, and can be immediately
cleaned up, unless there is an overlapping non-read-only transaction
which overlaps a committed transaction which wrote data.  This is
clearly not a solution to your concern in itself, but it combines
with the other suggestions to make them more effective.
 
(2) Similar to SLRU, allocate pages from shared buffers for lists,
but pin them in memory without ever writing them to disk.  A buffer
could be freed when the last list item in it was freed and the
buffer count for the list was above some minimum.  This could deal
with the episodic need for larger than typical amounts of RAM
without permanently taking large quantities our of circulation. 
Obviously, we would still need some absolute cap, so this by itself
doesn't answer your concern, either -- it just the impact to scale
to the need dynamically and within bounds.  It has the same
effective impact on memory usage as SLRU for this application
without the same performance penalty.
 
(3) Here's the meat of it.  When the lists hit their maximum, have
some way to gracefully degrade the accuracy of the conflict
tracking.  This is similar to your initial suggestion that once a
transaction committed we would not track it in detail, but
implemented at need when memory resources for tracking the detail
become exhausted.  I haven't worked out all the details, but I have
a rough outline in my head.  I wanted to run this set of ideas past
you before I put the work in to fully develop it.  This would be an
alternative to just canceling the oldest running serializable
transaction, which is the solution we could use right now to live
within some set limit, possibly with (1) or (2) to help push back
the point at which that's necessary.  Rather than deterministically
canceling the oldest active transaction, it would increase the
probability of transactions being canceled because of false
positives, with the chance we'd get through the peak without any
such cancellations.
 
Thoughts?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-24 Thread Robert Haas

On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Thoughts?

Premature optimization is the root of all evil.  I'm not convinced
that we should tinker with any of this before committing it and
getting some real-world experience.  It's not going to be perfect in
the first version, just like any other major feature.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-24 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Thoughts?
 
 Premature optimization is the root of all evil.  I'm not convinced
 that we should tinker with any of this before committing it and
 getting some real-world experience.  It's not going to be perfect
 in the first version, just like any other major feature.
 
In terms of pure optimization, I totally agree -- that's why I'm
submitting early without a number of potential optimizations.  I
think we're better off getting a solid base and then attempting to
prove the merits of each optimization separately.  The point Heikki
is on about, however, gets into user-facing behavior issues.  The
current implementation will give users an out of shared memory
error if they attempt to start a SERIALIZABLE transaction when our
preallocated shared memory for tracking such transactions reaches
its limit.  A fairly easy alternative would be to kill running
SERIALIZABLE transactions, starting with the oldest, until a new
request can proceed.  The question is whether either of these is
acceptable behavior for an initial implementation, or whether
something fancier is needed up front.
 
Personally, I'd be fine with out of shared memory for an excess of
SERIALIZABLE transactions for now, and leave refinement for later --
I just want to be clear that there is user-visible behavior involved
here.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-24 Thread Robert Haas

On Fri, Sep 24, 2010 at 1:35 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Thoughts?

 Premature optimization is the root of all evil.  I'm not convinced
 that we should tinker with any of this before committing it and
 getting some real-world experience.  It's not going to be perfect
 in the first version, just like any other major feature.

 In terms of pure optimization, I totally agree -- that's why I'm
 submitting early without a number of potential optimizations.  I
 think we're better off getting a solid base and then attempting to
 prove the merits of each optimization separately.  The point Heikki
 is on about, however, gets into user-facing behavior issues.  The
 current implementation will give users an out of shared memory
 error if they attempt to start a SERIALIZABLE transaction when our
 preallocated shared memory for tracking such transactions reaches
 its limit.  A fairly easy alternative would be to kill running
 SERIALIZABLE transactions, starting with the oldest, until a new
 request can proceed.  The question is whether either of these is
 acceptable behavior for an initial implementation, or whether
 something fancier is needed up front.

 Personally, I'd be fine with out of shared memory for an excess of
 SERIALIZABLE transactions for now, and leave refinement for later --
 I just want to be clear that there is user-visible behavior involved
 here.

Yeah, I understand, but I think the only changes we should make now
are things that we're sure are improvements.  I haven't read the code,
but based on reading the thread so far, we're off into the realm of
speculating about trade-offs, and I'm not sure that's a good place for
us to be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-24 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 
 I think the only changes we should make now are things that we're
 sure are improvements.
 
In that vein, anyone who is considering reviewing the patch should
check the latest from the git repo or request an incremental patch. 
I've committed a few things since the last patch post, but it
doesn't seem to make sense to repost the whole thing for them.  I
fixed a bug in the new shared memory list code, fixed a misleading
hint, and fixed some whitespace and comment issues.
 
The changes I've committed to the repo so far based on Heikki's
comments are, I feel, clear improvements.  It was actually fairly
embarrassing that I didn't notice some of that myself.
 
 based on reading the thread so far, we're off into the realm of
 speculating about trade-offs
 
This latest issue seems that way to me.  We're talking about
somewhere around 100 kB of shared memory in a 64 bit build with the
default number of connections, with a behavior on exhaustion which
matches what we do on normal locks.  This limit is easier to hit,
and we should probably revisit it, but I am eager to get the feature
as a whole in front of people, to see how well it works for them in
other respects.
 
I'll be quite surprised if we've found all the corner cases, but it
is working, and working well, in a variety of tests.  It has been
for months, really; I've been holding back, as requested, to avoid
distracting people from the 9.0 release.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-23 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 On 23/09/10 02:14, Kevin Grittner wrote:
 There is a rub on the other point, though.  Without transaction
 information you have no way of telling whether TN committed
 before T0, so you would need to assume that it did.  So on this
 count, there is bound to be some increase in false positives
 leading to transaction rollback.  Without more study, and maybe
 some tests, I'm not sure how significant it is.  (Actually, we
 might want to track commit sequence somehow, so we can determine
 this with greater accuracy.)
 
 I'm confused. AFAICS there is no way to tell if TN committed
 before T0 in the current patch either.
 
Well, we can certainly infer it if the finishedBefore values differ.
And, as I said, if we don't eliminate this structure for committed
transactions, we could add a commitId or some such, with precedes
and follows tests similar to TransactionId.
 
 The other way we can detect conflicts is a read by a serializable
 transaction noticing that a different and overlapping
 serializable transaction wrote the tuple we're trying to read. 
 How do you propose to know that the other transaction was
 serializable without keeping the SERIALIZABLEXACT information?
 
 Hmm, I see. We could record which transactions were serializable
 in a new clog-like structure that wouldn't exhaust shared memory.
 
  And how do you propose to record the conflict without it?
 
 I thought you just abort the transaction that would cause the
 conflict right there. The other transaction is committed already,
 so you can't do anything about it anymore.
 
No, it always requires a rw-conflict from T0 to T1 and a rw-conflict
from T1 to TN, as well as TN committing first and (T0 not being READ
ONLY or TN not overlapping T0).  The number and complexity of the
conditions which must be met to cause a serialization failure are
what keep the failure rate reasonable.  If we start rolling back
transactions every time one transaction simply reads a row modified
by a concurrent transaction I suspect that we'd have such a storm of
serialization failures in most workloads that nobody would want to
use it.
 
 Finally, this would preclude some optimizations which I *think*
 will pay off, which trade a few hundred kB more of shared memory,
 and some additional CPU to maintain more detailed conflict data,
 for a lower false positive rate -- meaning fewer transactions
 rolled back for hard-to-explain reasons.  This more detailed
 information is also what seems to be desired by Dan S (on another
 thread) to be able to log the information needed to be able to
 reduce rollbacks.
 
 Ok, I think I'm ready to hear about those optimizations now :-).
 
Dan Ports is eager to implement next key predicate locking for
indexes, but wants more benchmarks to confirm the benefit.  (Most of
the remaining potential optimizations carry some risk of being
counter-productive, so we want to go in with something conservative
and justify each optimization separately.)  That one only affects
your proposal to the extent that the chance to consolidate locks on
the same target by committed transactions would likely have fewer
matches to collapse.
 
One that I find interesting is the idea that we could set a
SERIALIZABLE READ ONLY transaction with some additional property
(perhaps DEFERRED or DEFERRABLE) which would cause it to take a
snapshot and then wait until there were no overlapping serializable
transactions which are not READ ONLY which overlap a running
SERIALIZABLE transaction which is not READ ONLY.  At this point it
could make a valid snapshot which would allow it to run without
taking predicate locks or checking for conflicts.  It would have no
chance of being rolled back with a serialization failure *or* of
contributing to the failure of any other transaction, yet it would
be guaranteed to see a view of the database consistent with the
actions of all other serializable transactions.
 
One place I'm particularly interested in using such a feature is in
pg_dump. Without it we have the choice of using a SERIALIZABLE
transaction, which might fail or cause failures (which doesn't seem
good for a backup program) or using REPEATABLE READ (to get current
snapshot isolation behavior), which might capture a view of the data
which contains serialization anomalies.  The notion of capturing a
backup which doesn't comply with business rules enforced by
serializable transactions gives me the willies, but it would be
better than not getting a backup reliably, so in the absence of this
feature, I think we need to change pg_dump to use REPEATABLE READ. 
I can't see how to do this without keeping information on committed
transactions.
 
This next paragraph is copied straight from the Wiki page:
 
It appears that when a pivot is formed where T0 is a flagged as a
READ ONLY transaction, and it is concurrent with TN, we can wait to
see whether anything really needs to roll back. If T1 commits before
developing a

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-23 Thread Heikki Linnakangas


On 23/09/10 18:08, Kevin Grittner wrote:

Less important than any of the above, but still significant in my
book, I fear that conflict recording and dangerous structure
detection could become very convoluted and fragile if we eliminate
this structure for committed transactions.  Conflicts among specific
sets of transactions are the linchpin of this whole approach, and I
think that without an object to represent each one for the duration
for which it is significant is dangerous.  Inferring information
from a variety of sources feels wrong to me.


Ok, so if we assume that we must keep all the information we have now, 
let me try again with that requirement. My aim is still to put an upper 
bound on the amount of shared memory required, regardless of the number 
of committed but still interesting transactions.


Cahill's thesis mentions that the per-transaction information can be 
kept in a table like this:


txnID beginTime commitTime inConf outConf
 1001000   1100  N   Y
 1011000   1500  N   N
 1021200   N/A   Y   N

That maps nicely to a SLRU table, truncated from the top as entries 
become old enough, and appended to the end.


In addition to that, we need to keep track of locks held by each 
transaction, in a finite amount of shared memory. For each predicate 
lock, we need to store the lock tag, and the list of transactions 
holding the lock. The list of transactions is where the problem is, 
there is no limit on its size.


Conveniently, we already have a way of representing an arbitrary set of 
transactions with a single integer: multi-transactions, in multixact.c.


Now, we have a little issue in that read-only transactions don't have 
xids, and can't therefore be part of a multixid, but it could be used as 
a model to implement something similar for virtual transaction ids.


Just a thought, not sure what the performance would be like or how much 
work such a multixid-like structure would be to implement..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-23 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 On 23/09/10 18:08, Kevin Grittner wrote:
 Less important than any of the above, but still significant in my
 book, I fear that conflict recording and dangerous structure
 detection could become very convoluted and fragile if we
 eliminate this structure for committed transactions.  Conflicts
 among specific sets of transactions are the linchpin of this
 whole approach, and I think that without an object to represent
 each one for the duration for which it is significant is
 dangerous.  Inferring information from a variety of sources
feels wrong to me.
 
 Ok, so if we assume that we must keep all the information we have
 now, let me try again with that requirement. My aim is still to
 put an upper bound on the amount of shared memory required,
 regardless of the number of committed but still interesting
 transactions.
 
 Cahill's thesis mentions that the per-transaction information can
 be kept in a table like this:
 
 txnID beginTime commitTime inConf outConf
   1001000   1100  N   Y
   1011000   1500  N   N
   1021200   N/A   Y   N
 
 That maps nicely to a SLRU table, truncated from the top as
 entries become old enough, and appended to the end.
 
Well, the inConf and outConf were later converted to pointers in
Cahill's work, and our MVCC implementation doesn't let us use times
quite that way -- we're using xmins and such, but I assume the point
holds regardless of such differences.  (I mostly mention it to avoid
confusion for more casual followers of the thread.)
 
 In addition to that, we need to keep track of locks held by each 
 transaction, in a finite amount of shared memory. For each
 predicate lock, we need to store the lock tag, and the list of
 transactions holding the lock. The list of transactions is where
 the problem is, there is no limit on its size.
 
 Conveniently, we already have a way of representing an arbitrary
 set of transactions with a single integer: multi-transactions, in
 multixact.c.
 
 Now, we have a little issue in that read-only transactions don't
 have xids, and can't therefore be part of a multixid, but it could
 be used as a model to implement something similar for virtual
 transaction ids.
 
 Just a thought, not sure what the performance would be like or how
 much work such a multixid-like structure would be to implement..
 
You're pointing toward some code I haven't yet laid eyes on, so it
will probably take me a few days to really digest your suggestion
and formulate an opinion.  This is just to let you know I'm working
on it.
 
I really appreciate your attention to this.  Thanks!
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-22 Thread Heikki Linnakangas


On 19/09/10 21:57, I wrote:

Putting that aside for now, we have one very serious problem with this
algorithm:


While they [SIREAD locks] are associated with a transaction, they must
survive
a successful COMMIT of that transaction, and remain until all overlapping

  transactions complete.

Long-running transactions are already nasty because they prevent VACUUM
from cleaning up old tuple versions, but this escalates the problem to a
whole new level. If you have one old transaction sitting idle, every
transaction that follows consumes a little bit of shared memory, until
that old transaction commits. Eventually you will run out of shared
memory, and will not be able to start new transactions anymore.

Is there anything we can do about that? Just a thought, but could you
somehow coalesce the information about multiple already-committed
transactions to keep down the shared memory usage? For example, if you
have this:

1. Transaction slow begins
2. 100 other transactions begin and commit

Could you somehow group together the 100 committed transactions and
represent them with just one SERIALIZABLEXACT struct?


Ok, I think I've come up with a scheme that puts an upper bound on the 
amount of shared memory used, wrt. number of transactions. You can still 
run out of shared memory if you lock a lot of objects, but that doesn't 
worry me as much.


When a transaction is commits, its predicate locks must be held, but 
it's not important anymore *who* holds them, as long as they're hold for 
long enough.


Let's move the finishedBefore field from SERIALIZABLEXACT to 
PREDICATELOCK. When a transaction commits, set the finishedBefore field 
in all the PREDICATELOCKs it holds, and then release the 
SERIALIZABLEXACT struct. The predicate locks stay without an associated 
SERIALIZABLEXACT entry until finishedBefore expires.


Whenever there are two predicate locks on the same target that both 
belonged to an already-committed transaction, the one with a smaller 
finishedBefore can be dropped, because the one with higher 
finishedBefore value covers it already.


There. That was surprisingly simple, I must be missing something.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-22 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 When a transaction is commits, its predicate locks must be held,
 but it's not important anymore *who* holds them, as long as
 they're hold for long enough.

 Let's move the finishedBefore field from SERIALIZABLEXACT to
 PREDICATELOCK. When a transaction commits, set the finishedBefore
 field in all the PREDICATELOCKs it holds, and then release the
 SERIALIZABLEXACT struct. The predicate locks stay without an
 associated SERIALIZABLEXACT entry until finishedBefore expires.

 Whenever there are two predicate locks on the same target that
 both belonged to an already-committed transaction, the one with a
 smaller finishedBefore can be dropped, because the one with higher
 finishedBefore value covers it already.

I don't think this works.  Gory details follow.

The predicate locks only matter when a tuple is being written which
might conflict with one.  In the notation often used for the
dangerous structures, the conflict only occurs if TN writes
something which T1 can't read or T1 writes something which T0 can't
read.  When you combine this with the fact that you don't have a
problem unless TN commits *first*, then you can't have a problem
with TN looking up a predicate lock of a committed transaction; if
it's still writing tuples after T1's commit, the conflict can't
matter and really should be ignored.  If T1 is looking up a
predicate lock for T0 and finds it committed, there are two things
which must be true for this to generate a real conflict: TN must
have committed before T0, and T0 must have overlapped T1 -- T0 must
not have been able to see T1's write.  If we have a way to establish
these two facts without keeping transaction level data for committed
transactions, predicate lock *lookup* wouldn't stand in the way of
your proposal.

Since the writing transaction is active, if the xmin of its starting
transaction comes before the finishedBefore value, they must have
overlapped; so I think we have that part covered, and I can't see a
problem with your proposed use of the earliest finishedBefore value.

There is a rub on the other point, though.  Without transaction
information you have no way of telling whether TN committed before
T0, so you would need to assume that it did.  So on this count,
there is bound to be some increase in false positives leading to
transaction rollback.  Without more study, and maybe some tests, I'm
not sure how significant it is.  (Actually, we might want to track
commit sequence somehow, so we can determine this with greater
accuracy.)

But wait, the bigger problems are yet to come.

The other way we can detect conflicts is a read by a serializable
transaction noticing that a different and overlapping serializable
transaction wrote the tuple we're trying to read.  How do you
propose to know that the other transaction was serializable without
keeping the SERIALIZABLEXACT information?  And how do you propose to
record the conflict without it?  The wheels pretty much fall off the
idea entirely here, as far as I can see.

Finally, this would preclude some optimizations which I *think* will
pay off, which trade a few hundred kB more of shared memory, and
some additional CPU to maintain more detailed conflict data, for a
lower false positive rate -- meaning fewer transactions rolled back
for hard-to-explain reasons.  This more detailed information is also
what seems to be desired by Dan S (on another thread) to be able to
log the information needed to be able to reduce rollbacks.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-22 Thread Heikki Linnakangas


On 23/09/10 02:14, Kevin Grittner wrote:

There is a rub on the other point, though.  Without transaction
information you have no way of telling whether TN committed before
T0, so you would need to assume that it did.  So on this count,
there is bound to be some increase in false positives leading to
transaction rollback.  Without more study, and maybe some tests, I'm
not sure how significant it is.  (Actually, we might want to track
commit sequence somehow, so we can determine this with greater
accuracy.)


I'm confused. AFAICS there is no way to tell if TN committed before T0 
in the current patch either.



But wait, the bigger problems are yet to come.

The other way we can detect conflicts is a read by a serializable
transaction noticing that a different and overlapping serializable
transaction wrote the tuple we're trying to read.  How do you
propose to know that the other transaction was serializable without
keeping the SERIALIZABLEXACT information?


Hmm, I see. We could record which transactions were serializable in a 
new clog-like structure that wouldn't exhaust shared memory.



 And how do you propose to record the conflict without it?


I thought you just abort the transaction that would cause the conflict 
right there. The other transaction is committed already, so you can't do 
anything about it anymore.



Finally, this would preclude some optimizations which I *think* will
pay off, which trade a few hundred kB more of shared memory, and
some additional CPU to maintain more detailed conflict data, for a
lower false positive rate -- meaning fewer transactions rolled back
for hard-to-explain reasons.  This more detailed information is also
what seems to be desired by Dan S (on another thread) to be able to
log the information needed to be able to reduce rollbacks.


Ok, I think I'm ready to hear about those optimizations now :-).

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-21 Thread Dan S

A starvation scenario is what worries me:

Lets say we have a slow complex transaction with many tables involved.
Concurrently smaller transactions begins and commits .

Wouldn't it be possible for a starvation scenario where the slower
transaction will
never run to completion but give a serialization failure over and over again
on retry ?

If I know at what sql-statement the serialization failure occurs can i then
conclude that
some of the tables in that exact query were involved in the conflict ?

If the serialization failure occurs at commit time what can I conclude then
?
They can  occur at commit time right ?

What is the likelyhood that there exists an update pattern that always give
the failure in the slow transaction ?

How would one break such a recurring pattern ?
You could maybe try to lock each table used in the slow transaction but that
would be prohibitively costly
for concurrency.
But what else if there is no way of knowing what the slow transaction
conflicts against.

As things with concurrency involved have a tendency to pop up in production
and not in test I think it is important to
start thinking about them as soon as possible.

Best Regards
Dan S

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-21 Thread Kevin Grittner

Dan S strd...@gmail.com wrote:
 
 A starvation scenario is what worries me:
 
 Lets say we have a slow complex transaction with many tables
 involved.  Concurrently smaller transactions begins and commits .
 
 Wouldn't it be possible for a starvation scenario where the slower
 transaction will never run to completion but give a serialization
 failure over and over again on retry ?
 
At least theoretically, yes.  One of the reasons I want to try
converting the single conflict reference to a list is to make for a
better worst-case situation.  Since anomalies can only occur when
the TN transaction (convention used in earlier post) commits first,
and by definition TN has done writes, with a list of conflicts you
could make sure that some transaction which writes has successfully
committed before any transaction rolls back.  So progress with
writes would be guaranteed.  There would also be a guarantee that if
you restart a canceled transaction, it would not immediately fail
again on conflicts *with the same transactions*.  Unfortunately,
with the single field for tracking conflicts, the self-reference on
multiple conflicting transactions loses detail, and you lose these
guarantees.
 
Now, could the large, long-running transaction still be the
transaction canceled?  Yes.  Are there ways to ensure it can
complete?  Yes.  Some are prettier than others.  I've already come
up with some techniques to avoid some classes of rollbacks with
transactions flagged as READ ONLY, and with the conflict lists there
would be a potential to recognize de facto read only transactions
apply similar logic, so a long-running transaction which didn't
write to any permanent tables (or at least not to ones which other
transactions were reading) would be pretty safe -- and with one of
our RD point, you could guarantee its safety by blocking the
acquisition of its snapshot until certain conditions were met.
 
With conflict lists we would also always have two candidates for
cancellation at the point where we found something needed to be
canceled.  Right now I'm taking the coward's way out and always
canceling the transaction active in the process which detects the
need to roll something back.  As long as one process can cancel
another, we can use other heuristics for that.  Several possible
techniques come to mind to try to deal with the situation you raise.
 
If all else fails, the transaction could acquire explicit table
locks up front, but that sort of defeats the purpose of having an
isolation level which guarantees full serializable behavior without
adding any blocking to snapshot isolation.  :-(
 
 If I know at what sql-statement the serialization failure occurs
 can i then conclude that some of the tables in that exact query
 were involved in the conflict ?
 
No.  It could be related to any statements which had executed in the
transaction up to that point.
 
 If the serialization failure occurs at commit time what can I
 conclude then ?
 
That a dangerous combination of read-write dependencies occurred
which involved this transaction.
 
 They can  occur at commit time right ?
 
Yes.  Depending on the heuristics chosen, it could happen while
idle in transaction.  (We can kill transactions in that state now,
right?)
 
 What is the likelyhood that there exists an update pattern that
 always give the failure in the slow transaction ?
 
I don't know how to quantify that.  I haven't seen it yet in
testing, but many of my tests so far have been rather contrived.  We
disparately need more testing of this patch with realistic
workloads.
 
 How would one break such a recurring pattern ?
 
As mentioned above, the conflict list enhancement would help ensure
that *something* is making progress.  As mentioned above, we could
tweak the heuristics on *what* gets canceled to try to deal with
this.
 
 You could maybe try to lock each table used in the slow
 transaction but that would be prohibitively costly for
 concurrency.
 
Exactly.
 
 But what else if there is no way of knowing what the slow
 transaction conflicts against.
 
Well, that is supposed to be the situation where this type of
approach is a good thing.  The trick is to get enough experience
with different loads to make sure we're using good heuristics to
deal with various loads well.  Ultimately, there may be some loads
for which this technique is just not appropriate.  Hopefully those
cases can be addressed with the techniques made possible with
Florian's patch.
 
 As things with concurrency involved have a tendency to pop up in
 production and not in test I think it is important to start
 thinking about them as soon as possible.
 
Oh, I've been thinking about it a great deal for quite a while.  The
problem is exactly as you state -- it is very hard to construct
tests which give a good idea of what the impact will be in
production loads.  I'm sure I could construct a test which would
make the patch look glorious.  I'm sure I could construct a test
which would make the patch look

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-21 Thread Robert Haas

On Tue, Sep 21, 2010 at 12:57 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 What is the likelyhood that there exists an update pattern that
 always give the failure in the slow transaction ?

 I don't know how to quantify that.  I haven't seen it yet in
 testing, but many of my tests so far have been rather contrived.  We
 disparately need more testing of this patch with realistic
 workloads.

I'm really hoping that Tom or Heikki will have a chance to take a
serious look at this patch soon with a view to committing it.  It
sounds like Kevin has done a great deal of testing on his own, but
we're not going to really get field experience with this until it's in
the tree.  It would be nice to get this in well before feature freeze
so that we have a chance to see what shakes out while there's still
time to adjust it.  Recall that Hot Standby was committed in December
and we were still adjusting the code in May.  It would be much nicer
to commit in September and finish up adjusting the code in February.
It helps get the release out on schedule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-20 Thread Kevin Grittner

I wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 ISTM you never search the SerializableXactHash table using a hash
 key, except the one call in CheckForSerializableConflictOut, but
 there you already have a pointer to the SERIALIZABLEXACT struct.
 You only re-find it to make sure it hasn't gone away while you
 trade the shared lock for an exclusive one. If we find another
 way to ensure that, ISTM we don't need SerializableXactHash at
 all. My first thought was to forget about VirtualTransactionId
 and use TransactionId directly as the hash key for
 SERIALIZABLEXACT. The problem is that a transaction doesn't have
 a transaction ID when RegisterSerializableTransaction is called.
 We could leave the TransactionId blank and only add the
 SERIALIZABLEXACT struct to the hash table when an XID is
 assigned, but there's no provision to insert an existing struct
 to a hash table in the current hash table API.

 So, I'm not sure of the details yet, but it seems like it could
 be made simpler somehow..
 
 After tossing it around in my head for a bit, the only thing that
 I see (so far) which might work is to maintain a *list* of
 SERIALIZABLEXACT objects in memory rather than a using a hash
 table.  The recheck after releasing the shared lock and acquiring
 an exclusive lock would then go through SerializableXidHash.  I
 think that can work, although I'm not 100% sure that it's an
 improvement.  I'll look it over in more detail.  I'd be happy to
 hear your thoughts on this or any other suggestions.
 
I haven't come up with any better ideas.  Pondering this one, it
seems to me that a list would be better than a hash table if we had
a list which would automatically allocate and link new entries, and
would maintain a list of available entries for (re)use.  I wouldn't
want to sprinkle such an implementation in with predicate locking
and SSI code, but if there is a feeling that such a thing would be
worth having in shmqueue.c or some new file which uses the SHM_QUEUE
structure to provide an API for such functionality, I'd be willing
to write that and use it in the SSI code.  Without something like
that, I have so far been unable to envision an improvement along the
lines Heikki is suggesting here.
 
Thoughts?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Serializable snapshot isolation error logging

2010-09-20 Thread Dan S

Hi !

I wonder if the SSI implementation will give some way of detecting the cause
of a serialization failure.
Something like the deadlock detection maybe where you get the sql-statements
involved.

Best Regards
Dan S

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-20 Thread Kevin Grittner

Dan S strd...@gmail.com wrote:
 
 I wonder if the SSI implementation will give some way of detecting
 the cause of a serialization failure.
 Something like the deadlock detection maybe where you get the
 sql-statements involved.
 
I've been wondering what detail to try to include.  There will often
be three transactions involved in an SSI serialization failure,
although the algorithm we're using (based on the referenced papers)
may only know about one or two of them at the point of failure,
because conflicts with multiple other transactions get collapsed to
a self-reference.  (One optimization I want to try is to maintain
a list of conflicts rather than doing the above -- in which case we
could always show all three transactions; but we may run out of time
for that, and even if we don't, the decreased rollbacks might not
pay for the cost of maintaining such a list.)
 
The other information we would have would be the predicate locks
held by whatever transactions we know about at the point of
cancellation, based on what reads they've done; however, we wouldn't
know about the writes done by those transaction, or which of the
reads resulting in conflicts.
 
So, given the above, any thoughts on what we *should* show?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-20 Thread Dan S

Well I guess one would like some way to find out which statements in the
involved transactions are the cause of the serialization failure and what
programs they reside in.

Also which relations were involved, the sql-statements may contain many
relations but just one or a few might be involved in the failure, right ?

The tuples involved if available.

I don't know how helpful it would be to know the pages involved might be, I
certainly wouldn't know what to do with that info.

All this is of course to be able to guess at which statements to modify or
change execution order of, take an explicit lock on and so on to reduce
serialization failure rate.

If holding a list of the involved transactions turns out to be expensive,
maybe one should be able to turn it on by a GUC only when you have a problem
and need the extra information to track it down.

Best Regards
Dan S


2010/9/20 Kevin Grittner kevin.gritt...@wicourts.gov

 Dan S strd...@gmail.com wrote:

  I wonder if the SSI implementation will give some way of detecting
  the cause of a serialization failure.
  Something like the deadlock detection maybe where you get the
  sql-statements involved.

 I've been wondering what detail to try to include.  There will often
 be three transactions involved in an SSI serialization failure,
 although the algorithm we're using (based on the referenced papers)
 may only know about one or two of them at the point of failure,
 because conflicts with multiple other transactions get collapsed to
 a self-reference.  (One optimization I want to try is to maintain
 a list of conflicts rather than doing the above -- in which case we
 could always show all three transactions; but we may run out of time
 for that, and even if we don't, the decreased rollbacks might not
 pay for the cost of maintaining such a list.)

 The other information we would have would be the predicate locks
 held by whatever transactions we know about at the point of
 cancellation, based on what reads they've done; however, we wouldn't
 know about the writes done by those transaction, or which of the
 reads resulting in conflicts.

 So, given the above, any thoughts on what we *should* show?

 -Kevin

Re: [HACKERS] Serializable snapshot isolation error logging

2010-09-20 Thread Kevin Grittner

Dan S strd...@gmail.com wrote:
 
 Well I guess one would like some way to find out which statements
 in the involved transactions are the cause of the serialization
 failure and what programs they reside in.
 
Unless we get the conflict list optimization added after the base
patch, you might get anywhere from one to three of the two to three
transactions involved in the serialization failure.  We can also
report the position they have in the dangerous structure and
mention that there are other, unidentified, transactions
participating in the conflict.  Once I get through with the issue
I'm working on based on Heikki's observations, I'll take a look at
this.
 
 Also which relations were involved, the sql-statements may contain
 many relations but just one or a few might be involved in the
 failure, right ?
 
The conflicts would have occurred on specific relations, but we
don't store all that -- it would be prohibitively expensive.  What
we track is that transaction T0's read couldn't see the write from
transaction T1.  Once you know that, SSI doesn't require that you
know which or how many relations were involved in that -- you've
established that T0 must logically come before T1.  That in itself
is no problem, of course.  But if you also establish that T1 must
come before TN (where TN might be T0 or a third transaction), you've
got a pivot at T1.  You're still not dead in the water yet, but if
that third logical transaction actually *commits* first, you're
probably in trouble.  The only way out is that if T0 is not TN, T0
is read only, and TN did *not* commit before T0 got its snapshot,
you're OK.
 
Where it gets complicated is that in the algorithm in the paper,
which we are following for the initial commit attempt, each
transaction keeps one conflictIn and one conflictOut pointer
for checking all this.  If you already have a conflict with one
transaction and then detect a conflict of the same type with
another, you change the conflict pointer to a self-reference --
which means you conflict with *all* other concurrent transactions
in that direction.  You also have lost the ability to report all
transaction which are involved in the conflict.
 
 The tuples involved if available.
 
 I don't know how helpful it would be to know the pages involved
 might be, I certainly wouldn't know what to do with that info.
 
That information would only be available on the *read* side.  We
count on MVCC data on the *write* side, and I'm not aware of any way
for a transaction to list everything it's written.  Since we're not
recording the particular points of conflict between transactions,
there's probably not a lot of point in listing it anyway -- there
might be a conflict on any number of tuples out of a great many read
or written.
 
 All this is of course to be able to guess at which statements to
 modify or change execution order of, take an explicit lock on and
 so on to reduce serialization failure rate.
 
I understand the motivation, but the best this technique is likely
to be able to provide is the transactions involved, and that's not
always going to be complete unless we convert those single-
transaction conflict fields to lists.
 
 If holding a list of the involved transactions turns out to be
 expensive, maybe one should be able to turn it on by a GUC only
 when you have a problem and need the extra information to track it
 down.
 
That might be doable.  If we're going to add such a GUC, though, it
should probably be considered a tuning GUC, with the list setting
recommended for debugging problems.  Of course, if you change it
from field to list the problem might disappear.  Hmmm.  Unless
we also had a debug setting which kept track of the list but
ignored it for purposes of detecting the dangerous structures
described above.
 
Of course, you will always know what transaction was canceled.
That does give you something to look at.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-19 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:

 ISTM you never search the SerializableXactHash table using a hash
 key, except the one call in CheckForSerializableConflictOut, but
 there you already have a pointer to the SERIALIZABLEXACT struct.
 You only re-find it to make sure it hasn't gone away while you
 trade the shared lock for an exclusive one. If we find another way
 to ensure that, ISTM we don't need SerializableXactHash at all. My
 first thought was to forget about VirtualTransactionId and use
 TransactionId directly as the hash key for SERIALIZABLEXACT. The
 problem is that a transaction doesn't have a transaction ID when
 RegisterSerializableTransaction is called. We could leave the
 TransactionId blank and only add the SERIALIZABLEXACT struct to the
 hash table when an XID is assigned, but there's no provision to
 insert an existing struct to a hash table in the current hash table
 API.

 So, I'm not sure of the details yet, but it seems like it could be
 made simpler somehow..

After tossing it around in my head for a bit, the only thing that I
see (so far) which might work is to maintain a *list* of
SERIALIZABLEXACT objects in memory rather than a using a hash table.
The recheck after releasing the shared lock and acquiring an
exclusive lock would then go through SerializableXidHash.  I think
that can work, although I'm not 100% sure that it's an improvement.
I'll look it over in more detail.  I'd be happy to hear your thoughts
on this or any other suggestions.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-19 Thread Heikki Linnakangas


On 19/09/10 16:48, Kevin Grittner wrote:

After tossing it around in my head for a bit, the only thing that I
see (so far) which might work is to maintain a *list* of
SERIALIZABLEXACT objects in memory rather than a using a hash table.
The recheck after releasing the shared lock and acquiring an
exclusive lock would then go through SerializableXidHash.  I think
that can work, although I'm not 100% sure that it's an improvement.


Yeah, also keep in mind that a linked list with only a few items is 
faster to scan through than sequentially scanning an almost empty hash 
table.


Putting that aside for now, we have one very serious problem with this 
algorithm:



While they [SIREAD locks] are associated with a transaction, they must survive
a successful COMMIT of that transaction, and remain until all overlapping

 transactions complete.

Long-running transactions are already nasty because they prevent VACUUM 
from cleaning up old tuple versions, but this escalates the problem to a 
whole new level. If you have one old transaction sitting idle, every 
transaction that follows consumes a little bit of shared memory, until 
that old transaction commits. Eventually you will run out of shared 
memory, and will not be able to start new transactions anymore.


Is there anything we can do about that? Just a thought, but could you 
somehow coalesce the information about multiple already-committed 
transactions to keep down the shared memory usage? For example, if you 
have this:


1. Transaction slow begins
2. 100 other transactions begin and commit

Could you somehow group together the 100 committed transactions and 
represent them with just one SERIALIZABLEXACT struct?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-18 Thread Kevin Grittner

[Apologies for not reply-linking this; work email is down so I'm
sending from gmail.]

Based on feedback from Heikki and Tom I've reworked how I find the
top-level transaction.  This is in the git repo, and the changes can
be viewed at:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b

-Kevin

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-18 Thread Heikki Linnakangas


On 18/09/10 21:52, Kevin Grittner wrote:

[Apologies for not reply-linking this; work email is down so I'm
sending from gmail.]

Based on feedback from Heikki and Tom I've reworked how I find the
top-level transaction.  This is in the git repo, and the changes can
be viewed at:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b


Thanks, much simpler. Now let's simplify it some more ;-)

ISTM you never search the SerializableXactHash table using a hash key, 
except the one call in CheckForSerializableConflictOut, but there you 
already have a pointer to the SERIALIZABLEXACT struct. You only re-find 
it to make sure it hasn't gone away while you trade the shared lock for 
an exclusive one. If we find another way to ensure that, ISTM we don't 
need SerializableXactHash at all. My first thought was to forget about 
VirtualTransactionId and use TransactionId directly as the hash key for 
SERIALIZABLEXACT. The problem is that a transaction doesn't have a 
transaction ID when RegisterSerializableTransaction is called. We could 
leave the TransactionId blank and only add the SERIALIZABLEXACT struct 
to the hash table when an XID is assigned, but there's no provision to 
insert an existing struct to a hash table in the current hash table API.


So, I'm not sure of the details yet, but it seems like it could be made 
simpler somehow..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Heikki Linnakangas


On 17/09/10 01:35, Kevin Grittner wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  wrote:


The functions are well commented, but an overview at the top of
the file of all the hash tables and other data structures would be
nice. What is stored in each, when are they updated, etc.


I moved all the structures from predicate.h and predicate.c to a new
predicate_internal.h file and added comments.  You can view its
current contents here:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate_internal.h;h=7cdb5af6eebdc148dd5ed5030847ca50d7df4fe8;hb=7f05b21bc4d846ad22ae8c160b1bf495e254

Does this work for you?


Yes, thank you, that helps a lot.

So, the purpose of SerializableXidHash is to provide quick access to the 
SERIALIZABLEXACT struct of a top-level transaction, when you know its 
transaction id or any of its subtransaction ids. To implement the or 
any of its subtransaction ids part, you need to have a SERIALIZABLEXID 
struct for each subtransaction in shared memory. That sounds like it can 
eat through your shared memory very quickly if you have a lot of 
subtransactions.


Why not use SubTransGetTopmostTransaction() ?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Kevin Grittner

Heikki Linnakangas  wrote:
 
 So, the purpose of SerializableXidHash is to provide quick access
 to the SERIALIZABLEXACT struct of a top-level transaction, when you
 know its transaction id or any of its subtransaction ids.
 
Right.
 
 To implement the or any of its subtransaction ids part, you need
 to have a SERIALIZABLEXID struct for each subtransaction in shared
 memory.
 
Close -- each subtransaction which writes any tuples.
 
 That sounds like it can eat through your shared memory very quickly
 if you have a lot of subtransactions.
 
Hmmm  I've never explicitly used subtransactions, so I don't tend
to think of them routinely going too deep.  And the struct is pretty
small.
 
 Why not use SubTransGetTopmostTransaction() ?
 
This needs to work when the xid of a transaction is found in the MVCC
data of a tuple for any overlapping serializable transaction -- even
if that transaction has completed and its connection has been
closed. It didn't look to me like SubTransGetTopmostTransaction()
would work after the transaction was gone.
 
I guess that's something I should mention in the comments
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Heikki Linnakangas


On 17/09/10 14:56, Kevin Grittner wrote:

Heikki Linnakangas  wrote:

Why not use SubTransGetTopmostTransaction() ?


This needs to work when the xid of a transaction is found in the MVCC
data of a tuple for any overlapping serializable transaction -- even
if that transaction has completed and its connection has been
closed. It didn't look to me like SubTransGetTopmostTransaction()
would work after the transaction was gone.


You're right, it doesn't retain that old transactions. But it could 
easily be modified to do so.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Kevin Grittner

Heikki Linnakangas  wrote:
 On 17/09/10 14:56, Kevin Grittner wrote:
 Heikki Linnakangas wrote:
 Why not use SubTransGetTopmostTransaction() ?

 This needs to work when the xid of a transaction is found in the
 MVCC data of a tuple for any overlapping serializable transaction
 -- even if that transaction has completed and its connection has
 been closed.  It didn't look to me like
 SubTransGetTopmostTransaction() would work after the transaction
 was gone.

 You're right, it doesn't retain that old transactions. But it could
 easily be modified to do so.
 
I shall look into it.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Tom Lane

Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Heikki Linnakangas  wrote:
 That sounds like it can eat through your shared memory very quickly
 if you have a lot of subtransactions.
 
 Hmmm  I've never explicitly used subtransactions, so I don't tend
 to think of them routinely going too deep.  And the struct is pretty
 small.

That assumption is absolutely, totally not going to fly.

 Why not use SubTransGetTopmostTransaction() ?
 
 This needs to work when the xid of a transaction is found in the MVCC
 data of a tuple for any overlapping serializable transaction -- even
 if that transaction has completed and its connection has been
 closed. It didn't look to me like SubTransGetTopmostTransaction()
 would work after the transaction was gone.

Yes, it should work.  If it doesn't, you are failing to manage the
TransactionXmin horizon correctly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-17 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 
 That assumption is absolutely, totally not going to fly.
 
Understood; I'm already working on it based on Heikki's input.
 
 This needs to work when the xid of a transaction is found in the
 MVCC data of a tuple for any overlapping serializable transaction
 -- even if that transaction has completed and its connection has
 been closed. It didn't look to me like
 SubTransGetTopmostTransaction() would work after the transaction
 was gone.
 
 Yes, it should work.  If it doesn't, you are failing to manage the
 TransactionXmin horizon correctly.
 
So far I haven't wanted to mess with the global xmin values for fear
of the possible impact on other transactions.  It actually hasn't
been that hard to maintain a SerializableGlobalXmin value, which is
more efficient than the existing ones for predicate lock cleanup
purposes.  That still isn't exactly what I need to modify cleanup of
the subtransaction information, though.  Once I've got my head
around the subtrans.c code, I think I'll need to maintain a minimum
that includes the xids for serializable transactions which *overlap*
SerializableGlobalXmin.  That doesn't seem very hard to do; I just
haven't needed it until now.  Then I'll modify the subtransaction
cleanup to only remove entries before the earlier of the global xmin
of all transactions and the xmin of serializable transactions which
overlap active serializable transactions.
 
Does all that sound reasonable?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-16 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 The functions are well commented, but an overview at the top of
 the file of all the hash tables and other data structures would be
 nice. What is stored in each, when are they updated, etc.
 
I moved all the structures from predicate.h and predicate.c to a new
predicate_internal.h file and added comments.  You can view its
current contents here:
 
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate_internal.h;h=7cdb5af6eebdc148dd5ed5030847ca50d7df4fe8;hb=7f05b21bc4d846ad22ae8c160b1bf495e254
 
Does this work for you?
 
That leaves the predicate.h file with just this:
 
http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate.h;h=7dcc2af7628b860f9cec9ded6b78f55163b58934;hb=7f05b21bc4d846ad22ae8c160b1bf495e254
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-16 Thread Alvaro Herrera

Excerpts from Kevin Grittner's message of mié sep 15 14:52:36 -0400 2010:
 Alvaro Herrera alvhe...@commandprompt.com wrote:
  
  I think that would also solve a concern that I had, which is that
  we were starting to include relcache.h (and perhaps other headers
  as well, but that's the one that triggered it for me) a bit too
  liberally, so +1 from me.
  
 Unfortunately, what I proposed doesn't solve that for relcache.h,
 although it does eliminate lock.h from almost everywhere and htup.h
 from everywhere.

Now that I look at your new patch, I noticed that I was actually
confusing relcache.h with rel.h.  The latter includes a big chunk of our
headers, but relcache.h is pretty thin.  Including relcache.h in another
header is not much of a problem.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-16 Thread Kevin Grittner

Alvaro Herrera alvhe...@commandprompt.com wrote:
 
 Now that I look at your new patch, I noticed that I was actually
 confusing relcache.h with rel.h.  The latter includes a big chunk
 of our headers, but relcache.h is pretty thin.  Including
 relcache.h in another header is not much of a problem.
 
OK, thanks for the clarification.
 
With the structures all brought back together in a logical order,
and the new comments in front of the structure declarations, do you
think a summary at the top of the file is still needed in that
header file?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-15 Thread Heikki Linnakangas


On 15/09/10 00:49, Kevin Grittner wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  wrote:

A short description of how the predicate locks help to implement
serializable mode would be nice too.  I haven't read Cahill's
papers, and I'm left wondering what the RW conflicts and
dependencies are, when you're supposed to grab predicate locks
etc.


Again -- why be stingy?  Given a more complete README file, how
about something like?:


Well, if it's explained in the readme, that's probably enough.


/*
  * A rw-conflict occurs when a read by one serializable transaction
  * does not see the write of a concurrent serializable transaction
  * when that write would have been visible had the writing
  * transaction committed before the start of the reading
  * transaction. When the write occurs first, the read can detect
  * this conflict by examining the MVCC information.  When the read
  * occurs first, it must record this somewhere so that writes can
  * check for a conflict.  Predicate locks are used for this.
  * Detection of such a conflict does not cause blocking, and does
  * not, in itself, cause a transaction rollback.
  *
  * Transaction rollback is required when one transaction (called a
  * pivot) has a rw-conflict *in* (a concurrent transaction
  * couldn't see its write) as well as *out* (it couldn't see the
  * write of another transaction).  In addition, the transaction on
  * the out side of the pivot must commit first, and if the
  * transaction on the in side of the pivot is read-only, it must
  * acquire its snapshot after the successful commit of the
  * transaction on the out side of the pivot.
  */

Would something like that have helped?


Yes. An examples would be very nice too, that description alone is 
pretty hard to grasp. Having read the Wiki page, and the slides from 
your presentation at pg east 2010, I think understand it now.


Now that I understand what the predicate locks are for, I'm now trying 
to get my head around all the data structures in predicate.c. The 
functions are well commented, but an overview at the top of the file of 
all the hash tables and other data structures would be nice. What is 
stored in each, when are they updated, etc.


I've been meaning to look at this patch for some time, but now I'm 
actually glad I haven't because I'm now getting a virgin point of view 
on the code, seeing the problems that anyone who's not familiar with the 
approach will run into. :-)


BTW, does the patch handle prepared transactions yet? It introduces a 
call to PreCommit_CheckForSerializationFailure() in CommitTransaction, I 
think you'll need that in PrepareTransaction as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-15 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 Now that I understand what the predicate locks are for, I'm now
 trying to get my head around all the data structures in
 predicate.c.  The functions are well commented, but an overview at
 the top of the file of all the hash tables and other data
 structures would be nice. What is stored in each, when are they
 updated, etc.
 
It probably doesn't help that they're split between predicate.c and
predicate.h.  (They were originally all in predicate.c because
nobody else needed to see them, but we moved some to the .h file to
expose them to lockfuncs.c to support listing the locks.)
 
I'm inclined to move everything except the function prototypes out
of predicate.h to a new predicate_interal.h, and move the structures
defined in predicate.c there, too.  And, of course, add the overview
comments in the new file.  If that sounds good, I can probably
post a new patch with those changes today -- would that be a good
idea, or should I wait for more feedback before doing that?  (It
will be in the git repo either way.)
 
 BTW, does the patch handle prepared transactions yet? It
 introduces a call to PreCommit_CheckForSerializationFailure() in
 CommitTransaction, I think you'll need that in PrepareTransaction
 as well.
 
Good point.  In spite of the NB comment, I did not notice that. 
Will fix.
 
Thanks for the feedback!
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-15 Thread Alvaro Herrera

Excerpts from Kevin Grittner's message of mié sep 15 09:15:53 -0400 2010:

 I'm inclined to move everything except the function prototypes out
 of predicate.h to a new predicate_interal.h, and move the structures
 defined in predicate.c there, too.

I think that would also solve a concern that I had, which is that we
were starting to include relcache.h (and perhaps other headers as well,
but that's the one that triggered it for me) a bit too liberally, so +1
from me.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-15 Thread Kevin Grittner

Alvaro Herrera alvhe...@commandprompt.com wrote:
 
 I think that would also solve a concern that I had, which is that
 we were starting to include relcache.h (and perhaps other headers
 as well, but that's the one that triggered it for me) a bit too
 liberally, so +1 from me.
 
Unfortunately, what I proposed doesn't solve that for relcache.h,
although it does eliminate lock.h from almost everywhere and htup.h
from everywhere.  (The latter seemed to be left over from an
abandoned approach, and was no longer needed in predicate.h in any
event.)
 
Most of the functions in predicate.c take a Relation as a parameter.
I could split out the function prototypes for those which *don't*
use it to a separate .h file if you think it is worthwhile.  The
functions would be:
 
void InitPredicateLocks(void);
Size PredicateLockShmemSize(void);
void RegisterSerializableTransaction(const Snapshot snapshot);
void ReleasePredicateLocks(const bool isCommit);
void PreCommit_CheckForSerializationFailure(void);
 
The files where these are used are:
 
src/backend/storage/ipc/ipci.c
src/backend/utils/time/snapmgr.c
src/backend/utils/resowner/resowner.c
src/backend/access/transam/xact.c
 
So any of these files which don't already include relcache.h could
remain without it if we make this split.  Is there an easy way to
check which might already include it?  Is it worth adding one more
.h file to avoid including relcache.h and snapshot.h in these four
files?
 
Let me know -- I'm happy to arrange this any way people feel is most
appropriate.  I have a profound appreciation for the organization of
this code, and want to maintain it, even if I don't possess the
perspective to know how to best do so.  The respect comes from
developing this patch -- every time I gave my manager an estimate of
how long it would take to do something, I found it actually took
about one-third of that time -- and it was entirely due to the
organization and documentation of the code.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-14 Thread Heikki Linnakangas


On 14/09/10 19:34, Kevin Grittner wrote:

Attached is the latest Serializable Snapshot Isolation (SSI) patch.


Great work! A year ago I thought it would be impossible to have a true 
serializable mode in PostgreSQL because of the way we do MVCC, and now 
we have a patch.


At a quick read-through, the code looks very tidy and clear now. Some 
comments:


Should add a citation to Cahill's work this is based on. Preferably with 
a hyperlink. A short description of how the predicate locks help to 
implement serializable mode would be nice too. I haven't read Cahill's 
papers, and I'm left wondering what the RW conflicts and dependencies 
are, when you're supposed to grab predicate locks etc.


If a page- or relation level SILOCK is taken, is it possible to get 
false conflicts? Ie. a transaction is rolled back because it modified a 
tuple on a page where some other transaction modified another tuple, 
even though there's no dependency between the two.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-14 Thread Kevin Grittner

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 Great work! A year ago I thought it would be impossible to have a
 true serializable mode in PostgreSQL because of the way we do
 MVCC, and now we have a patch.
 
 At a quick read-through, the code looks very tidy and clear now.
 Some comments:
 
 Should add a citation to Cahill's work this is based on.
 Preferably with a hyperlink.
 
I'm planning on drawing from the current Wiki page:
 
http://wiki.postgresql.org/wiki/Serializable
 
to put together a README file; do you think the references should go
in the README file, the source code, or both?
 

 A short description of how the predicate locks help to implement
 serializable mode would be nice too. I haven't read Cahill's
 papers, and I'm left wondering what the RW conflicts and
 dependencies are, when you're supposed to grab predicate locks
 etc.
 
Again, I summarize that in the Wiki page, and was planning on
putting it into the README.  If you've read the Wiki page and it's
not clear, then I definitely have some work to do there.
 
 If a page- or relation level SILOCK is taken, is it possible to
 get false conflicts?
 
Yes.  This technique will generate some false positive rollbacks. 
Software will need to be prepared to retry any database transaction
which fails with a serialization failure SQLSTATE.  I expect that
proper connection pooling will be particularly important when using
SSI, and flagging transactions which don't write to permanent tables
as READ ONLY transactions will help reduce the rollback rate, too.
 
Some of the optimizations we have sketched out will definitely
reduce the rate of false positives; however, we don't want to
implement them without a better performance baseline because the
cost of tracking the required information and the contention for LW
locks to maintain the information may hurt performance more than the
restart of transactions which experience serialization failure.
 
I don't want to steal Dan's thunder after all the hard work he's
done to get good numbers from the DBT-2 benchmark, but suffice it to
say that I've been quite pleased with the performance on that
benchmark.  He's pulling together the data for a post on the topic.
 
 Ie. a transaction is rolled back because it modified a tuple on a
 page where some other transaction modified another tuple, even
 though there's no dependency between the two.
 
Well, no, because this patch doesn't do anything new with write
conflicts.  It's all about the apparent order of execution, based on
one transaction not being able to read what was written by a
concurrent transaction.  The reading transaction must be considered
to have run first in that case (Hey, now you know what a rw-conflict
is!) -- but such references can create a cycle -- which is the
source of all serialization anomalies in snapshot isolation.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serializable Snapshot Isolation

2010-09-14 Thread Kevin Grittner

I've been thinking about these points, and reconsidered somewhat.

Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 Should add a citation to Cahill's work this is based on.
 Preferably with a hyperlink.
 
I've been thinking that this should be mentioned in both the README
and the source code.
 
 A short description of how the predicate locks help to implement
 serializable mode would be nice too.  I haven't read Cahill's
 papers, and I'm left wondering what the RW conflicts and
 dependencies are, when you're supposed to grab predicate locks
 etc.
 
Again -- why be stingy?  Given a more complete README file, how
about something like?:
 
/*
 * A rw-conflict occurs when a read by one serializable transaction
 * does not see the write of a concurrent serializable transaction
 * when that write would have been visible had the writing
 * transaction committed before the start of the reading
 * transaction. When the write occurs first, the read can detect
 * this conflict by examining the MVCC information.  When the read
 * occurs first, it must record this somewhere so that writes can
 * check for a conflict.  Predicate locks are used for this. 
 * Detection of such a conflict does not cause blocking, and does
 * not, in itself, cause a transaction rollback.
 *
 * Transaction rollback is required when one transaction (called a
 * pivot) has a rw-conflict *in* (a concurrent transaction
 * couldn't see its write) as well as *out* (it couldn't see the
 * write of another transaction).  In addition, the transaction on
 * the out side of the pivot must commit first, and if the
 * transaction on the in side of the pivot is read-only, it must
 * acquire its snapshot after the successful commit of the
 * transaction on the out side of the pivot.
 */
 
Would something like that have helped?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

64 matches

Mail list logo