Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-10 Thread Wood, Dan
I found one glitch with our merge of the original dup row fix.  With that 
corrected AND Alvaro’s Friday fix things are solid.
No dup’s.  No index corruption.

Thanks so much. 

On 10/10/17, 7:25 PM, "Michael Paquier"  wrote:

On Tue, Oct 10, 2017 at 11:14 PM, Alvaro Herrera
 wrote:
> I was seeing just the reindex problem.  I don't see any more dups.
>
> But I've tried to reproduce it afresh now, and let it run for a long
> time and nothing happened.  Maybe I made a mistake last week and
> ran an unfixed version.  I don't see any more problems now.

Okay, so that's one person more going to this trend, making three with
Peter and I.

>> If you are getting the dup rows consider the code in the block in
>> heapam.c that starts with the comment “replace multi by update xid”.
>>
>> When I repro this I find that MultiXactIdGetUpdateXid() returns 0.
>> There is an updater in the multixact array however the status is
>> MultiXactStatusForNoKeyUpdate and not MultiXactStatusNoKeyUpdate.  I
>> assume this is a preliminary status before the following row in the
>> hot chain has it’s multixact set to NoKeyUpdate.
>
> Yes, the "For" version is the locker version rather than the actual
> update.  That lock is acquired by EvalPlanQual locking the row just
> before doing the update.  I think GetUpdateXid has no reason to return
> such an Xid, since it's not an update.
>
>> Since a 0 is returned this does precede cutoff_xid and
>> TransactionIdDidCommit(0) will return false.  This ends up aborting
>> the multixact on the row even though the real xid is committed.  This
>> sets XMAX to 0 and that row becomes visible as one of the dups.
>> Interestingly the real xid of the updater is 122944 and the cutoff_xid
>> is 122945.
>
> I haven't seen this effect. Please keep us updated if you're able to
> verify corruption this way.

Me neither. It would be nice to not live long with such a sword of Damocles.
-- 
Michael




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-08 Thread Wood, Dan
I’m unclear on what is being repro’d in 9.6.  Are you getting the duplicate 
rows problem or just the reindex problem?  Are you testing with asserts 
enabled(I’m not)?

If you are getting the dup rows consider the code in the block in heapam.c that 
starts with the comment “replace multi by update xid”.
When I repro this I find that MultiXactIdGetUpdateXid() returns 0.  There is an 
updater in the multixact array however the status is 
MultiXactStatusForNoKeyUpdate and not MultiXactStatusNoKeyUpdate.  I assume 
this is a preliminary status before the following row in the hot chain has it’s 
multixact set to NoKeyUpdate.

Since a 0 is returned this does precede cutoff_xid and 
TransactionIdDidCommit(0) will return false.  This ends up aborting the 
multixact on the row even though the real xid is committed.  This sets XMAX to 
0 and that row becomes visible as one of the dups.  Interestingly the real xid 
of the updater is 122944 and the cutoff_xid is 122945.

I’m still debugging but I start late so I’m passing this incomplete info along 
now.

On 10/7/17, 4:25 PM, "Alvaro Herrera"  wrote:

Peter Geoghegan wrote:
> On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera  
wrote:
> >> As you must have seen, Alvaro said he has a variant of Dan's original
> >> script that demonstrates that a problem remains, at least on 9.6+,
> >> even with today's fix. I think it's the stress-test that plays with
> >> fillfactor, many clients, etc [1].
> >
> > I just execute setup.sql once and then run this shell command,
> >
> > while :; do
> > psql -e -P pager=off -f ./repro.sql
> > for i in `seq 1 5`; do
> > psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> > done
> > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> > psql -P pager=off -e --no-psqlrc -f ./report.sql
> > echo "done"
> > done
> 
> I cannot reproduce the problem on my personal machine using this
> script/stress-test. I tried to do so on the master branch git tip.
> This reinforces the theory that there is some timing sensitivity,
> because the remaining race condition is very narrow.

Hmm, I think I added a random sleep (max. 100ms) right after the
HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
makes the race easier to hit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-05 Thread Wood, Dan
Yes, I’ve been testing 9.6.  I’ll try Alvaro’s patch today.

I would prefer to focus on either latest 9X or 11dev.  Does Alvaro’s patch 
presume any of the other patch to set COMMITTED in the freeze code?


On 10/4/17, 7:17 PM, "Michael Paquier"  wrote:

On Thu, Oct 5, 2017 at 10:39 AM, Wood, Dan  wrote:
> Whatever you do make sure to also test 250 clients running lock.sql.  
Even with the communities fix plus YiWen’s fix I can still get duplicate rows.  
What works for “in-block” hot chains may not work when spanning blocks.

Interesting. Which version did you test? Only 9.6?

> Once nearly all 250 clients have done their updates and everybody is 
waiting to vacuum which one by one will take a while I usually just “pkill -9 
psql”.  After that I have many of duplicate “id=3” rows.  On top of that I 
think we might have a lock leak.  After the pkill I tried to rerun setup.sql to 
drop/create the table and it hangs.  I see an autovacuum process starting and 
existing every couple of seconds.  Only by killing and restarting PG can I drop 
the table.

Yeah, that's more or less what I have been doing. My tests involve
using your initial script with way more sessions triggering lock.sql,
minus the kill-9 portion (good idea actually). I can of course see the
sessions queuing for VACUUM, still I cannot see duplicated rows, even
if I headshot Postgres in the middle of the VACUUM waiting queue. Note
that I have just tested Alvaro's patch on 9.3.
-- 
Michael




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Wood, Dan
Whatever you do make sure to also test 250 clients running lock.sql.  Even with 
the communities fix plus YiWen’s fix I can still get duplicate rows.  What 
works for “in-block” hot chains may not work when spanning blocks.

Once nearly all 250 clients have done their updates and everybody is waiting to 
vacuum which one by one will take a while I usually just “pkill -9 psql”.  
After that I have many of duplicate “id=3” rows.  On top of that I think we 
might have a lock leak.  After the pkill I tried to rerun setup.sql to 
drop/create the table and it hangs.  I see an autovacuum process starting and 
existing every couple of seconds.  Only by killing and restarting PG can I drop 
the table.

On 10/4/17, 6:31 PM, "Michael Paquier"  wrote:

On Wed, Oct 4, 2017 at 10:46 PM, Alvaro Herrera  
wrote:
> Wong, Yi Wen wrote:
>> My interpretation of README.HOT is the check is just to ensure the chain 
is continuous; in which case the condition should be:
>>
>> > if (TransactionIdIsValid(priorXmax) &&
>> > !TransactionIdEquals(priorXmax, 
HeapTupleHeaderGetRawXmin(htup)))
>> > break;
>>
>> So the difference is GetRawXmin vs GetXmin, because otherwise we get the 
FreezeId instead of the Xmin when the transaction happened
>
> I independently arrived at the same conclusion.  Since I was trying with
> 9.3, the patch differs -- in the old version we must explicitely test
> for the FrozenTransactionId value, instead of using GetRawXmin.
> Attached is the patch I'm using, and my own oneliner test (pretty much
> the same I posted earlier) seems to survive dozens of iterations without
> showing any problem in REINDEX.

Confirmed, the problem goes away with this patch on 9.3.

> This patch is incomplete, since I think there are other places that need
> to be patched in the same way (EvalPlanQualFetch? heap_get_latest_tid?).
> Of course, for 9.4 and onwards we need to patch like you described.

I have just done a lookup of the source code, and here is an
exhaustive list of things in need of surgery:
- heap_hot_search_buffer
- heap_get_latest_tid
- heap_lock_updated_tuple_rec
- heap_prune_chain
- heap_get_root_tuples
- rewrite_heap_tuple
- EvalPlanQualFetch (twice)

> This bit in EvalPlanQualFetch caught my attention ... why is it saying
> xmin never changes?  It does change with freezing.
>
> /*
>  * If xmin isn't what we're expecting, the slot 
must have been
>  * recycled and reused for an unrelated tuple.  
This implies that
>  * the latest version of the row was deleted, so 
we need do
>  * nothing.  (Should be safe to examine xmin 
without getting
>  * buffer's content lock, since xmin never 
changes in an existing
>  * tuple.)
>  */
> if 
(!TransactionIdEquals(HeapTupleHeaderGetXmin(tuple.t_data),
>  
priorXmax))

Agreed. That's not good.
-- 
Michael




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Wood, Dan
One minor side note…   Is it weird for xmin/xmax to go backwards in a hot row 
chain?

lp | t_ctid | lp_off | t_infomask | t_infomask2 | t_xmin | t_xmax 
++++-++
  1 | (0,1)  |   8152 |   2818 |   3 |  36957 |  0
  2 ||  5 || ||   
  3 ||  0 || ||   
  4 ||  0 || ||   
  5 | (0,6)  |   8112 |   9986 |   49155 |  36962 |  36963
  6 | (0,7)  |   8072 |   9986 |   49155 |  36963 |  36961
  7 | (0,7)  |   8032 |  11010 |   32771 |  36961 |  0
(7 rows)


On 10/3/17, 6:20 PM, "Peter Geoghegan"  wrote:

On Tue, Oct 3, 2017 at 6:09 PM, Wood, Dan  wrote:
> I’ve just started looking at this again after a few weeks break.

> if (TransactionIdIsValid(priorXmax) &&
> !TransactionIdEquals(priorXmax, 
HeapTupleHeaderGetXmin(htup)))
> break;

> We need to understand why these TXID equal checks exist.  Can we 
differentiate the cases they are protecting against with the two exceptions 
I’ve found?

I haven't read your remarks here in full, since I'm about to stop
working for the day, but I will point out that
src/backend/access/heap/README.HOT says a fair amount about this,
under "Abort Cases".


-- 
Peter Geoghegan




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Wood, Dan
I’ve just started looking at this again after a few weeks break.

There is a tangled web of issues here.  With the community fix we get a 
corrupted page(invalid redirect ptr from indexed item).  The cause of that is:
pruneheap.c:

  /*
   * Check the tuple XMIN against prior XMAX, if any
   */
  if (TransactionIdIsValid(priorXmax) &&
  !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), 
priorXmax))
  break;

chainitems[nchain++] = offnum;

The priorXmax is a multixact key share lock, thus not equal to xmin.  This 
terminates the scan.  Unlike the scan termination with “if (!recent_dead) 
break;” below the switch (...SatisiesVacuum…), this “break” does not put the 
offnum into the chain even though it is in the chain.  If the first not-deleted 
item isn’t put in the chain then we’ll not call heap_prune_record_redirect().

I do not know what the above ‘if’ test is protecting.  Normally the xmin is 
equal to the priorXmax.  Other than protecting against corruption a key share 
lock can cause this.  So, I tried a fix which does the “if” check after adding 
it to chainitems.  This will break whatever real situation this IF was 
protecting.  Tom Lane put this in.

OK:  With that hack of a fix the redirect now works correctly.  However, we 
still get the reindex problem with not finding the parent.  That problem is 
caused by:
Pruneheap.c:heap_get_root_tuples()

if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(priorXmax, 
HeapTupleHeaderGetXmin(htup)))
break;

In this case, instead of these not being equal because of a multixact key share 
lock, it is because XMIN is frozen and FrozenTransactionId doesn’t equal the 
priorXmax.  Thus, we don’t build the entire chain from root to most current row 
version and this causes the reindex failure.

If we disable this ‘if’ as a hack then we no longer get a problem on the 
reindex.  However, YiWen reported that at the end of an install check out index 
checking reported corruption in the system catalogues.  So we are still looking.

We need to understand why these TXID equal checks exist.  Can we differentiate 
the cases they are protecting against with the two exceptions I’ve found?

FYI, someone should look at the same ”if”  test in heapam.c: 
heap_lock_updated_tuple_rec().  Also, I hope there are no strange issues with 
concurrent index builds.

Finally, the idea behind the original fix was to simply NOT to do an 
unsupported freeze on a dead tuple.  It had two drawbacks:
1) CLOG truncation.  This could have been handled by keeping track of the old 
unpruned item found and using that to update the table’s/DB’s freeze xid.
2) Not making freeze progress.   The only reason the prune would fail should be 
because of an open TXN.  Unless that TXN was so old such that it’s XID was as 
old as the ?min freeze threshold? then we would make progress.  If we were 
doing TXN’s that old then we’d be having problems anyway.


On 10/3/17, 5:15 PM, "Michael Paquier"  wrote:

On Wed, Oct 4, 2017 at 8:10 AM, Peter Geoghegan  wrote:
> I now think that it actually is a VACUUM problem, specifically a
> problem with VACUUM pruning. You see the HOT xmin-to-xmax check
> pattern that you mentioned within heap_prune_chain(), which looks like
> where the incorrect tuple prune (or possibly, at times, redirect?)
> takes place. (I refer to the prune/kill that you mentioned today, that
> frustrated your first attempt at a fix -- "I modified the multixact
> freeze code...".)

My lookup of the problem converges to the same conclusion. Something
is wrong with the vacuum's pruning. I have spent some time trying to
design a patch, all the solutions I tried have proved to make the
problem harder to show up, but it still showed up, sometimes after
dozens of attempts.

> The attached patch "fixes" the problem -- I cannot get amcheck to
> complain about corruption with this applied. And, "make check-world"
> passes. Hopefully it goes without saying that this isn't actually my
> proposed fix. It tells us something that this at least *masks* the
> problem, though; it's a start.

Yep.

> FYI, the repro case page contents looks like this with the patch applied:
> postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
> to_hex(t_infomask) as infomask,
> to_hex(t_infomask2) as infomask2
> from heap_page_items(get_raw_page('t', 0));
>  lp | lp_flags | t_xmin  | t_xmax | t_ctid | infomask | infomask2
> +--+-+++--+---
>   1 |1 | 1845995 |  0 | (0,1)  | b02  | 3
>   2 |2 | |||  |
>   3 |0 | |||  |
>   4 |0 | |||

[HACKERS] PGCon 2017 registration now open

2017-04-17 Thread Dan Langille
Join us in Ottawa for the 11th annual PGCon.  On May 23-26, users and 
developers from 
around the world arrive for what has become a traditional gathering of the 
PostgreSQL
community.

There will be two days of tutorials on Tuesday and Wednesday.  The best of the 
best 
will be available to help you learn great things about PostgreSQL and its tools.
See http://www.pgcon.org/2017/schedule/track/Tutorial/index.en.html 
<http://www.pgcon.org/2017/schedule/track/Tutorial/index.en.html>

On Wednesday, there will be a Developer Unconference (non-developers are 
welcome 
too).  The Unconference first appeared at PGCon 2013 and was a instant success.
See https://wiki.postgresql.org/wiki/PgCon_2017_Developer_Unconference 
<https://wiki.postgresql.org/wiki/PgCon_2017_Developer_Unconference>

For Thursday and Friday, the submitted talks will be presented as everyone 
gathers in 
one location to learn, discuss, and collaborate.

Full list of talks: http://www.pgcon.org/2017/schedule/events.en.html 
<http://www.pgcon.org/2017/schedule/events.en.html>

In summary:

• Tutorials: 23-34 May 2017 (Tue & Wed)
• Unconference: 24 May 2016
• Talks: 25-26 May 2016 (Thu-Fri).

Registration is now open at http://www.pgcon.org/2017/registration.php 
<http://www.pgcon.org/2017/registration.php>

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org <mailto:d...@langille.org>



[HACKERS] potential hardware donation

2017-01-27 Thread Dan Langille
If someone wanted to donate a SuperServer 6028TR-D72R 
(http://www.supermicro.com/products/system/2U/6028/SYS-6028TR-D72R.cfm) to the 
PostgreSQL project, would it be used?

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org




[HACKERS] reminder: PGCon 2017 CFP

2017-01-17 Thread Dan Langille
Hello,

There are two days left in the PGCon 2017 CFP, which closes on 19 January.
Please get your submissions in soon.

PGCon 2017 will be on 23-26 May 2017 at University of Ottawa.

* 23-24 (Tue-Wed) tutorials
* 24 (Wed) The Unconference
* 25-26 (Thu-Fri) talks - the main part of the conference

See http://www.pgcon.org/2017/ <http://www.pgcon.org/2017/>


We are now accepting proposals for the main part of the conference (25-26 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2016 Proposal acceptance begins
19 Jan 2017 Proposal acceptance ends
19 Feb 2017 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2017/papers.php 
<http://www.pgcon.org/2017/papers.php>>

Instructions for submitting a proposal to PGCon 2017 are available
from: <http://www.pgcon.org/2017/submissions.php 
<http://www.pgcon.org/2017/submissions.php>>

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org <mailto:d...@langille.org>



[HACKERS] PGCon 2016 CFP - one week left

2016-01-12 Thread Dan Langille
Hello

There is one week left in the PGCon CFP.  Details below.  Please submit.  
Thanks.

PGCon 2016 will be on 17-21 May 2016 at University of Ottawa.

* 17-18 (Tue-Wed) tutorials
* 19 & 20 (Thu-Fri) talks - the main part of the conference
* 17 & 21 (Wed & Sat) The Developer Unconference & the User Unconference (both 
very popular)

PLEASE NOTE: PGCon 2016 is in May.

See http://www.pgcon.org/2016/

We are now accepting proposals for the main part of the conference (19-20 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2015 Proposal acceptance begins
19 Jan 2016 Proposal acceptance ends
19 Feb 2016 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also 

Instructions for submitting a proposal to PGCon 2016 are available
from: 



signature.asc
Description: Message signed with OpenPGP using GPGMail


[HACKERS] PGCon 2016 call for papers

2016-01-03 Thread Dan Langille
In case you've overlooked it, you have about two weeks to submit your proposal.

PGCon 2016 will be on 17-21 May 2016 at University of Ottawa.

* 17-18 (Tue-Wed) tutorials
* 19 & 20 (Thu-Fri) talks - the main part of the conference
* 17 & 21 (Wed & Sat) The Developer Unconference & the User Unconference (both 
very popular)

PLEASE NOTE: PGCon 2016 is in May.

See http://www.pgcon.org/2016/

We are now accepting proposals for the main part of the conference (19-20 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2015 Proposal acceptance begins
19 Jan 2016 Proposal acceptance ends
19 Feb 2016 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also 

Instructions for submitting a proposal to PGCon 2016 are available
from: 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] could not truncate directory "pg_subtrans": apparent wraparound

2015-06-08 Thread Dan Langille
If there's anything I can try on my servers to help diagnose the issues,
please let me know.  If desired, I can arrange access for debugging.

On Sat, Jun 6, 2015 at 12:51 AM, Thomas Munro  wrote:

> On Sat, Jun 6, 2015 at 1:25 PM, Alvaro Herrera 
> wrote:
> > Thomas Munro wrote:
> >
> >> My idea was that if I could get oldestXact == next XID in
> >> TruncateSUBSTRANS, then TransactionIdToPage(oldestXact) for a value of
> >> oldestXact that happens to be immediately after a page boundary (so
> >> that xid % 2048 == 0) might give page number that is >=
> >> latest_page_number, causing SimpleLruTruncate to print that message.
> >> But I can't figure out how to get next XID == oldest XID, because
> >> vacuumdb --freeze --all consumes xids itself, so in my first attempt
> >> at this, next XID is always 3 ahead of the oldest XID when a
> >> checkpoint is run.
> >
> > vacuumdb starts by querying pg_database, which eats one XID.
> >
> > Vacuum itself only uses one XID when vac_truncate_clog() is called.
> > This is called from vac_update_datfrozenxid(), which always happen at
> > the end of each user-invoked VACUUM (so three times for vacuumdb if you
> > have three databases); autovacuum does it also at the end of each run.
> > Maybe you can get autovacuum to quit before doing it.
> >
> > OTOH, if the values in the pg_database entry do not change,
> > vac_truncate_clog is not called, and thus vacuum would finish without
> > consuming an XID.
>
> I have manage to reproduce it a few times but haven't quite found the
> right synchronisation hacks to make it reliable so I'm not posting a
> repro script yet.
>
> I think it's a scary sounding message but very rare and entirely
> harmless (unless you really have wrapped around...).  The fix is
> probably something like: if oldest XID == next XID, then just don't
> call SimpleLruTruncate (truncation is deferred until the next
> checkpoint), or perhaps (if we can confirm this doesn't cause problems
> for dirty pages or that there can't be any dirty pages before cutoff
> page because of the preceding flush (as I suspect)) we could use
> cutoffPage = TransactionIdToPage(oldextXact - 1) if oldest == next, or
> maybe even always.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


[HACKERS] could not truncate directory "pg_subtrans": apparent wraparound

2015-06-05 Thread Dan Langille
I noticed this today on my 9.4.2 server running on FreeBSD 10.1:

Jun  5 18:59:40 slocum postgres[986]: [3957-1] LOG:  could not truncate 
directory "pg_subtrans": apparent wraparound

Looking at a post from 2010, Tom Lane suggest this information was useful:

[root@slocum:/usr/local/pgsql/data/pg_subtrans] # ls -l
total 1
-rw---  1 pgsql  pgsql  8192 Jun  5 19:04 0032

This not not a high throughput server.

—
Dan Langille
http://langille.org/







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] PGCon hacker lounge

2015-05-28 Thread Dan Langille

> On May 27, 2015, at 12:06 PM, Alexander Korotkov  
> wrote:
> 
> On Wed, May 27, 2015 at 7:00 PM, Dan Langille  <mailto:d...@langille.org>> wrote:
> Have you been to PGCon before?  Do you remember the hacker lounge?  Do you 
> remember going there to work on stuff?  Do you recall anything about it?
> 
> I remember I've tried to visit it in 2012 or 2013. That time I found empty 
> room and nobody there. Didn't try to visit it anytime after.

The reason I asked: I was trying to gauge the usefulness of the PGCon hacking 
lounge since it was first added to the schedule in 2012.

It seems it goes unused, and I was trying to see if anyone found it useful in 
the past.  At BSDCan, for example, you can find people there every night 
discussing and working.  Or perhaps just socializing.  It's a major gathering 
point.

If there is interest, we'll retain for 2015, but it seems best to remove it 
from the schedule.

—
Dan Langille
http://langille <http://langille/>.org/







signature.asc
Description: Message signed with OpenPGP using GPGMail


[HACKERS] PGCon hacker lounge

2015-05-27 Thread Dan Langille
Have you been to PGCon before?  Do you remember the hacker lounge?  Do you 
remember going there to work on stuff?  Do you recall anything about it?

—
Dan Langille
http://langille.org/







signature.asc
Description: Message signed with OpenPGP using GPGMail


[HACKERS] PGCon 2015

2015-05-04 Thread Dan Langille
In 6 weeks, people start arriving in Ottawa for PGCon 2015. Have you 
registered? There's still time. Get in today.

We have a great list of talks: http://www.pgcon.org/2015/schedule/events.en.html

Given by great speakers: http://www.pgcon.org/2015/schedule/speakers.en.html

You'll want to be there.  Don't leave it much longer.

—
Dan Langille
http://langille.org/







signature.asc
Description: Message signed with OpenPGP using GPGMail


[HACKERS] HEADS UP: PGCon 2015 major schedule changes

2015-02-05 Thread Dan Langille
Hello,

By request, the format of PGCon 2015 will differ significantly from previous 
year.
Our goal is to give you more of what you want while still keeping the stuff 
you've always liked.

In June 2015, PGCon will be structured as follows:

Unconference: 16-17 June 2015 (Tue afternoon & all day Wed)

Beginner Tutorials: 17 June 2015 (Wed)

Talks: 18-19 June 2015 (Thu-Fri)

Advanced Tutorial: 20 June 2015 (Sat)

The big changes are:
- Unconference moved to weekdays and now 1.5 days (was one day; Saturday)
- Tutorials split between beginner and advanced, and now on Wednesday & Saturday
  (was Tuesday & Wednesday)

Why?

The unconference has become a bigger and more significant part of PGCon
for PostgreSQL contributors.  It has moved to earlier in the week to
coordinate with other developer meetings, in order to expand the
participation in development discussions and meetings around PGCon.
Additionally, the shift of some tutorials to Saturday allows tutorials to 
involve key PostgreSQL contributors without schedule conflicts.

Unfortunately, this meant moving something else to Saturday, at least for this 
year.
We considered moving the talks to earlier in the week, but we felt that our 
changes
were already disruptive and wanted to minimize the effects this late change may
have on people who have already booked travel / accommodation.  To those 
affected, we apologize and hope that this new structure will benefit everyone.

— 
Dan Langille
http://langille.org/







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2015 - last day

2015-01-19 Thread Dan Langille
Today is your last day to submit your PGCon 2015 proposal.

-- 
Dan Langille
http://langille.org/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2015

2015-01-18 Thread Dan Langille
Is your PGCon 2015 submission going in today or tomorrow?

-- 
Dan Langille
http://langille.org/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2015 call for papers - reminder

2015-01-12 Thread Dan Langille
A reminder, only about a week to submit your proposal: 
http://www.pgcon.org/2015/papers.php

PGCon 2015 will be on 18-19 June 2015 at University of Ottawa.

* 16-17 (Tue-Wed) tutorials
* 18-19 (Thu-Fri) talks - the main part of the conference
* 20 (Sat) The Unconference (very popular)

PLEASE NOTE: PGCon 2015 is in June.

See http://www.pgcon.org/2015/

We are now accepting proposals for the main part of the conference (18-19 June).
Proposals can be quite simple. We do not require academic-style papers.

You have about two weeks left before submissions close.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2014 Proposal acceptance begins
19 Jan 2015 Proposal acceptance ends
19 Feb 2015 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2015/papers.php>

Instructions for submitting a proposal to PGCon 2015 are available
from: <http://www.pgcon.org/2015/submissions.php>

—
Dan Langille
http://langille.org/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2015 call for papers

2015-01-05 Thread Dan Langille
PGCon 2015 will be on 18-19 June 2015 at University of Ottawa.

* 16-17 (Tue-Wed) tutorials
* 18-19 (Thu-Fri) talks - the main part of the conference
* 20 (Sat) The Unconference (very popular)

PLEASE NOTE: PGCon 2015 is in June.

See http://www.pgcon.org/2015/

We are now accepting proposals for the main part of the conference (18-19 June).
Proposals can be quite simple. We do not require academic-style papers.

You have about two weeks left before submissions close.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2014 Proposal acceptance begins
19 Jan 2015 Proposal acceptance ends
19 Feb 2015 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2015/papers.php>

Instructions for submitting a proposal to PGCon 2015 are available
from: <http://www.pgcon.org/2015/submissions.php>

—
Dan Langille
http://langille.org/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Validating CHECK constraints with SPI

2014-10-29 Thread Dan Robinson
On Wed, Oct 29, 2014 at 7:17 AM, Alvaro Herrera 
wrote:

> Dan Robinson wrote:
> > Hi all,
> >
> > If I'm reading correctly in src/backend/commands/tablecmds.c, it looks
> like
> > PostgreSQL does a full table scan in validateCheckConstraint and in the
> > constraint validation portion of ATRewriteTable.
> >
> > Since the table is locked to updates while the constraint is validating,
> > this means you have to jump through hoops if you want to add a CHECK
> > constraint to a large table in a production setting. This validation
> could
> > be considerably faster if we enabled it to use relevant indexes or other
> > constraints. Is there a reason not to make an SPI call here, instead?
>
> I don't think SPI would help you here.  But I think you would like to
> add the constraint as NOT VALID and then do an ALTER TABLE .. VALIDATE
> CONSTRAINT command afterwards.  In 9.4, this doesn't require
> AccessExclusive lock on the table.


Interesting! I hadn't seen the patch that makes ALTER TABLE ... VALIDATE
CONSTRAINT require only ShareUpdateExclusive. Very cool.

Yes, that makes this change totally unnecessary.

-Dan


[HACKERS] Validating CHECK constraints with SPI

2014-10-29 Thread Dan Robinson
Hi all,

If I'm reading correctly in src/backend/commands/tablecmds.c, it looks like
PostgreSQL does a full table scan in validateCheckConstraint and in the
constraint validation portion of ATRewriteTable.

Since the table is locked to updates while the constraint is validating,
this means you have to jump through hoops if you want to add a CHECK
constraint to a large table in a production setting. This validation could
be considerably faster if we enabled it to use relevant indexes or other
constraints. Is there a reason not to make an SPI call here, instead?

This would make it possible to do something like:
postgres=# CREATE INDEX CONCURRENTLY foo_temp_idx ON my_table (id) WHERE
foo = 'bar';
postgres=# SELECT COUNT(*) FROM my_table WHERE foo = 'bar';  -- Make sure
this is 0.
postgres=# ALTER TABLE my_table ADD CONSTRAINT my_check CHECK (foo !=
'bar');
postgres=# DROP INDEX foo_temp_idx;

The third step here would be fast, because it would be able to use
foo_temp_idx under the hood. Additionally, it would be easy to get a sense
for how long this operation will lock your table by timing the query in the
second step. (I suppose the latter is true already if you do the same with
enable_indexscan off, but that requires knowing that PostgreSQL is going to
do the seq scan no matter what.)

Would y'all be open to a patch that made this change?

Best,
-Dan


[HACKERS] HEADS UP: PGCon 2015 is in June

2014-09-27 Thread Dan Langille
HEADS UP.

PGCon 2015 will be in June.  That’s a few weeks later than in previous years.

— 
Dan Langille



signature.asc
Description: Message signed with OpenPGP using GPGMail


[HACKERS] PGCon 2014 - last chance

2014-01-19 Thread Dan Langille
Today is your last chance to submit a proposal for PGCon 2014.

PGCon 2014 will be on 20-24 May 2014 at University of Ottawa.

* 20-21 (Tue-Wed) tutorials
* 22-23 (Thu-Fri) talks - the main part of the conference
* 24 (Sat) The Unconference (very popular in 2013, the first year)

See http://www.pgcon.org/2014/


We are now accepting proposals for the main part of the conference (22-23 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2013 Proposal acceptance begins
19 Jan 2014 Proposal acceptance ends
19 Feb 2014 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2014/papers.php>

Instructions for submitting a proposal to PGCon 2014 are available
from: <http://www.pgcon.org/2014/submissions.php>

-- 
Dan Langille - http://langille.org



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] Re: How to reproduce serialization failure for a read only transaction.

2014-01-07 Thread Dan Ports
On Mon, Jan 06, 2014 at 05:14:12PM -0800, AK wrote:
> Also I cannot reproduce a scenario when "applications must not depend on
> results read during a transaction that later aborted;". In this example the
> SELECT itself has failed.
> Can you show an example where a SELECT completes, but the COMMIT blows up?

Actually, no, not for a read-only transaction. It happens that the
final serialization failure check executed on COMMIT only affects
read/write transactions, not read-only ones. That's a pretty specific
implementation detail, though, so I wouldn't necessarily rely on it...

Here's an example of why applications must not depend on results read
during a transaction that later aborted:

   W2: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
   W2: UPDATE t SET count=1 WHERE id=1;
   W1: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
   W1: SELECT * FROM t WHERE id=1;
   W2: COMMIT;
   R : BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY
   R : SELECT * FROM t;
   R : COMMIT;
 ! W1: UPDATE t SET count=1 WHERE id=2;
   W1: COMMIT;

If you try this, it'll cause a serialization failure on the line marked
with a '!'. W1 saw (1,0) in the table, so W1 appears to have executed
before W2. But R saw both (1,1) and (2,0) in the table, and that has to
be a consistent snapshot of the database state, meaning W2 appears to
have executed before W1. That's an inconsistency, so something has to
be rolled back. This particular anomaly requires all three of the
transactions, and so it can't be detected until W1 does its UPDATE.
Postgres detects the conflict at that point and rolls back W1.

So what does this have to do with relying on the results of read-only
transactions that abort? Well, what if you had instead had R ROLLBACK
instead of COMMIT -- maybe because you expected ROLLBACK and COMMIT to
be equivalent for transactions that don't modify the database, or maybe
because something else caused the transaction to abort? When W1 does
its update, it will be checked for serialization failures, but aborted
transactions are (intentionally) not included in those checks. W1 is
therefore allowed to commit; the apparent serial order of execution is
W1 followed by W2, and the results of the aborted transaction R aren't
consistent with that.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2014 call for papers

2013-12-12 Thread Dan Langille
PGCon 2014 will be on 20-24 May 2014 at University of Ottawa.

* 20-21 (Tue-Wed) tutorials
* 22-23 (Thu-Fri) talks - the main part of the conference
* 24 (Sat) The Unconference (very popular in 2013, the first year)

See http://www.pgcon.org/2014/

We are now accepting proposals for the main part of the conference (22-23 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2013 Proposal acceptance begins
19 Jan 2014 Proposal acceptance ends
19 Feb 2014 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2014/papers.php>

Instructions for submitting a proposal to PGCon 2014 are available
from: <http://www.pgcon.org/2014/submissions.php>

-- 
Dan Langille - http://langille.org



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] SSI freezing bug

2013-10-07 Thread Dan Ports
On Mon, Oct 07, 2013 at 12:26:37PM +0300, Heikki Linnakangas wrote:
> When updating a tuple, CheckTargetForConflictsIn() only marks a
> conflict if the transaction holding the predicate lock overlapped
> with the updating transaction.

Ah, this is the bit I was forgetting. (I really ought to have
remembered that, but it's been a while...)

I think it's possible, then, to construct a scenario where a slot is
reused before predicate locks on the old tuple are eligible for
cleanup -- but those locks will never cause a conflict.

So I agree: it's correct to just remove the xmin from the key
unconditionally.

And this is also true:

> And if there's a hole in that thinking I can't see right now,
> the worst that will happen is some unnecessary conflicts, ie. it's
> still correct. It surely can't be worse than upgrading the lock to a
> page-level lock, which would also create unnecessary conflicts.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI freezing bug

2013-10-03 Thread Dan Ports
On Thu, Oct 03, 2013 at 06:19:49AM -0700, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
> > IMHO it would be better to remove xmin from the lock key, and vacuum
> > away the old predicate locks when the corresponding tuple is vacuumed.
> > The xmin field is only required to handle the case that a tuple is
> > vacuumed, and a new unrelated tuple is inserted to the same slot.
> > Removing the lock when the tuple is removed fixes that.

This seems definitely safe: we need the predicate locks to determine if
someone is modifying a tuple we read, and certainly if it's eligible
for vacuum nobody's going to be modifying that tuple anymore.

> > In fact, I cannot even come up with a situation where you would have a
> > problem if we just removed xmin from the key, even if we didn't vacuum
> > away old locks. I don't think the old lock can conflict with anything
> > that would see the new tuple that gets inserted in the same slot. I have
> > a feeling that you could probably prove that if you stare long enough at
> > the diagram of a dangerous structure and the properties required for a
> > conflict.

This would also be safe, in the sense that it's OK to flag a
conflict even if one doesn't exist. I'm not convinced that it isn't
possible to have false positives this way. I think it's possible for a
tuple to be vacuumed away and the ctid reused before the predicate
locks on it are eligible for cleanup. (In fact, isn't this what was
happening in the thread Kevin linked?)

> You are the one who suggested adding xmin to the key:
> 
> http://www.postgresql.org/message-id/4d5a36fc.6010...@enterprisedb.com
> 
> I will review that thread in light of your recent comments, but the
> fact is that xmin was not originally in the lock key, testing
> uncovered bugs, and adding xmin fixed those bugs.  I know I tried
> some other approach first, which turned out to be complex and quite
> messy -- it may have been similar to what you are proposing now.

At the time, we thought it was necessary for a predicate lock to lock
*all future versions* of a tuple, and so we had a bunch of code to
maintain a version chain. That was fraught with bugs, and turned out
not to be necessary (IIRC, we worked that out at the pub at PGcon).
That made it critical to distinguish different tuples that had the same
ctid because they could wind up in the wrong chain or cause a cycle.
With that code ripped out, that's no longer an issue.

But all this is an exceptionally subtle part of what was an
exceptionally complex patch, so a lot of careful thought is needed
here...

> It seems to me that a change such as you are now suggesting is
> likely to be too invasive to back-patch.  Do you agree that it
> would make sense to apply the patch I have proposed, back to 9.1,
> and then consider any alternative as 9.4 material?

I agree with this.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lwlock contention with SSI

2013-04-10 Thread Dan Ports
On Tue, Apr 09, 2013 at 07:49:51PM -0400, Robert Haas wrote:
> These locks are all SSI-related and they're all really hot.  Lock 28
> is SerializableXactHashLock and lock 29 is
> SerializableFinishedListLock; both are acquired an order of magnitude
> more often than any non-SSI lock, and cause two orders of magnitude
> more blocking than any other lock whatsoever.  Lock 30 is
> SerializablePredicateLockListLock, which has no exclusive lock
> acquisitions at all on this test, but the shared acquisitions result
> in significant spinlock contention.

This matches what I saw when I looked into this a while ago. I even
started sketching out some plans of how we might deal with it, but
unfortunately I never had much time to work on it, and that seems
unlikely to change any time soon. :-\

As it is, pretty much any operation involving SSI requires acquiring
SerializableXactHashLock (usually exclusive), except for checking
whether a read or write indicates a conflict. That includes starting
and ending a transaction.

Two things make this hard to fix:
 - SSI is about checking for rw-conflicts, which are inherently about
   *pairs* of transactions. This makes it hard to do fine-grained
   locking, because a lot of operations involve looking at or modifying
   the conflict list of more than one transaction.
 - SerializableXactHashLock protects many things. Besides the 
   SERIALIZABLEXACT structures themselves, there's also the free lists
   for SERIALIZABLEXACTs and RWConflicts, the SerializableXidHash
   table, the latest SxactCommitSeqno and SxactGlobalXmin, etc.

I'm trying to swap back in my notes about how to address this. It is
bound to be a substantial project, however.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] leaking lots of unreferenced inodes (pg_xlog files?), maybe after moving tables and indexes to tablespace on different volume

2013-03-13 Thread Dan Thomas
We're seeing similar behaviour on several of our FreeBSD servers too.
It doesn't look like open files, or filesystem snapshots. Rebooting
does reset it, but restarting PG makes no difference.

We've got an assortment of different versions of both FreeBSD and
PostgreSQL, some of which are demonstrating this behaviour, some
aren't. Here's a quick breakdown of versions and what we've got
running:

FreeBSD   PostgreSQL   Leaking?
8.0   8.4.4no
8.2   9.0.4no
8.3   9.1.4yes
8.3   9.2.3yes
9.1   9.2.3yes

All of these machines are under similar load patterns and (apart from
the version differences), are set up essentially the same and are
doing the same job. They all have hot standbys yet this problem
doesn't exist on any of the standby servers. We haven't done anything
with tablespaces, the database has its own dedicated partition
(although pg_log/pg_xlog are both symlinked out to /usr).

However (just to throw a spanner in the works) we do have another
server running fbsd8.3/pg9.1.4 which ISN'T showing this behaviour -
although its load patterns are quite different.

I'm not sure if this is going to help, but here's a graph of this disk
space disparity over the last few days (Y axis is in gigabytes). The
flat-ish part in the middle is the weekend where we have little
traffic, so we can at least say it's not constant:
http://i.imgur.com/jlbgzNI.png

Up until now we've been upgrading things in the hope that the problem
will go away, but since we've got one server up to fbsd9.1/pg9.2.3 and
still seeing the problem we're a little stumped. Any ideas about how
we can go about debugging this would be appreciated.

Thanks,

Dan

On 13 March 2013 07:39, Magnus Hagander  wrote:
>
> On Mar 13, 2013 3:04 AM, "Tom Lane"  wrote:
>>
>> Palle Girgensohn  writes:
>> > ... I got lots of space freed
>> > up, but it seems that after that the disk usage grows linearly (it seems
>> > to leave many inodes unreferenced).
>>
>> Hm.  We've seen issues in the past with PG processes failing to close
>> no-longer-useful files promptly, but ...
>>
>> > Strange thing is I cannot find any open files.
>>
>> ... that suggests there's something else going on.
>>
>> > The unreferenced inodes are almost exclusively around 16 MB in size, so
>> > i.e. they would most probably all be pg_xlog files.
>>
>> Have you got any sort of WAL archiving active, and if so maybe that's
>> holding onto WAL files?  Not that it's clear how come lsof wouldn't
>> tattle on an archiving process either.
>>
>> > Stopping postgresql briefly did not help, I tried that.
>>
>> That seems to point the finger at some non-postgres cause.  I confess
>> I can't guess what.
>>
>
> Yeah, unreferenced inodes with no open files, and only discoverable with
> fsck sounds like a filsystem bug to me. Particularly since it showed up just
> after a operating system upgrade, and doesn't go away with a postgres
> restart...
>
> /Magnus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2013 - CFP & unconference day

2013-01-03 Thread Dan Langille
Folks,

The PGCon Call for Papers went out last month.  But you have about two weeks
left to respond.  If you are doing something interesting with PostgreSQL, 
please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

See this URL for details:

   http://lists.pgcon.org/pipermail/pgcon-announce/2012-December/90.html 

New this year, we are having an unconference on the Saturday right after the
conference. The content of the unconference will be determined, on the day,
by the attendees. We expect heavy attendance by developers and users of 
PostgreSQL.

Be sure to submit your proposal soon because time is running out.

-- 
Dan Langille - http://langille.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange errors from 9.2.1 and 9.2.2 (I hope I'm missing something obvious)

2012-12-16 Thread Dan Scott
On Dec 11, 2012 9:28 PM, "David Gould"  wrote:
>
> Thank you. I got the example via cut and paste from email and pasted it
> into psql on different hosts. od tells me it ends each line with:
>
>   \n followed by 0xC2 0xA0 and then normal spaces. The C2A0 thing is
>   apparently NO-BREAK SPACE. Invisible, silent, odorless but still deadly.
>
> Which will teach me not to accept text files from the sort of people who
> write code in Word I guess.

It's not just Word... I was bitten by this last week by a WYSIWYG HTML
widget I was using to write some documentation. When I copied the examples
I had created out of said environment during a final technical accuracy
pass and they failed to run in psql, I panicked for a few minutes.

I eventually determined that, rather than just wrapping my code in 
tags, the widget had created   entities that were faithfully converted
into Unicode non-breaking spaces in the psql input.


[HACKERS] PGCon 2013 - call for papers

2012-12-08 Thread Dan Langille
PGCon 2013 will be on 23-24 May 2013 at University of Ottawa.

This year, we are planning to have an un-conference day around PGCon.
This is currently being scheduled.  More information on the
un-conference will be available within a few weeks.

NOTE: the un-conference day content will be set on the day by those turning up
on that day.  We expect heavy attendance by developers and users of PostgreSQL.

We are now accepting proposals for the main part of the conference (23-24 May).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2012 Proposal acceptance begins
19 Jan 2013 Proposal acceptance ends
19 Feb 2013 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also <http://www.pgcon.org/2013/papers.php>

Instructions for submitting a proposal to PGCon 2013 are available
from: <http://www.pgcon.org/2013/submissions.php>

-- 
Dan Langille - http://langille.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
 wrote:
> Dear Dan,
>
> thank you for your reply.
>
> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
> a utf8 local
>
> A short 5 line dictionary file  is sufficient to test:-
>
> raeuz
> 我们
> 𦘭𥎵
> 𪽖𫖂
> 󶒘󴮬
>
> line 1 "raeuz" Zhuang word written using English letters and show up
> under ts_vector ok
> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
> found in Unicode 3.1 which came in about the year 2000  and show up
> under ts_vector ok
> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
> found in Unicode 5.2 which came in about the year 2009 but do not show
> up under ts_vector ok
> line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
> found in PUA area of the font Sawndip.ttf but do not show up under
> ts_vector ok (Font can be downloaded from
> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)
>
> The last two words even though included in a dictionary do not get
> accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the "lc_ctype=C
lc_collate=C" options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('󶒘󴮬');
ts_debug

 (word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
  lc_ctype
-
 en_US.UTF-8
(1 row)

foobaz=# select ts_debug('󶒘󴮬');
ts_debug
-
 (blank,"Space symbols",󶒘󴮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch, normalize search_path in index

2012-09-30 Thread Dan Scott
On Fri, Sep 28, 2012 at 1:40 PM, Karl O. Pinc  wrote:
> Hi,
>
> The attached patch (against git head)
> normalizes "search_path" as the thing indexed
> and uses a secondary index term to distinguish
> the configuration parameter from the run-time
> setting.

Makes sense to me, although I suspect the conceptual material is
better served by the "search path"-the-concept index entry and the
reference material by the "search_path configuration parameter" entry
(so, from that perspective, perhaps the patch should just be to remove
the "search_path" index entry from the DDL schemas conceptual
section).

> "search path" the concept remains distinguished
> in the index from "search_path" the setting/config param.
> It's hard to say whether it's useful to make this
> distinction.

I think that indexing "search path"-the-concept is useful for
translations, and the Japanese translation includes an index (I
couldn't find the index for the French translation).

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending range of to_tsvector et al

2012-09-30 Thread Dan Scott
On Sun, Sep 30, 2012 at 1:56 PM, johnkn63  wrote:
> When using to_tsvector  a number of newer unicode characters and pua
> characters are not included. How do I add the characters which I desire to
> be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql gram.y make rule

2012-09-24 Thread Dan Scott
On Mon, Sep 24, 2012 at 10:21 PM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> I wanted to refactor the highly redundant flex and bison rules
>> throughout the source into common pattern rules.  (Besides saving some
>> redundant code, this could also help some occasionally flaky code in
>> pgxs modules.)  The only outlier that breaks this is in plpgsql
>
>> pl_gram.c: gram.y
>
>> I would like to either rename the intermediate file(s) to gram.{c,h}, or
>> possibly rename the source file to pl_gram.y.  Any preferences or other
>> comments?
>
> Hmmm ... it's annoyed me for a long time that that file is named the
> same as the core backend's gram.y.  So renaming to pl_gram.y might be
> better.  On the other hand I have very little confidence in git's
> ability to preserve change history if we do that.  Has anyone actually
> done a file rename in a project with lots of history, and how well did
> it turn out?  (For instance, does git blame still provide any useful
> tracking of pre-rename changes?  If you try to cherry-pick a patch
> against the new file into a pre-rename branch, does it work?)

git handles renaming just fine with cherry-picks, no special options
necessary. (Well, there are probably corner cases, but it's code,
there are always corner cases!)

For "git log", you'll want to add the --follow parameter if you're
asking for the history of a specific file or directory beyond a
renaming event.

git blame will show you the commit that renamed the file, by default,
but then you can request the revision prior to that using the commit
hash || '^', for example. "git blame 2fb6cc90^ --
src/backend/parser/gram.y" to work your way back through history.

-- 
Dan Scott
Laurentian University


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-11 Thread Dan Ports
On Mon, Sep 10, 2012 at 10:44:57PM -0700, Jeff Davis wrote:
> For the archives, and for those not following the paper in detail, there
> is one situation in which SSI will abort a read-only transaction.
> 
> When there are three transactions forming a dangerous pattern where T1
> (read-only) has a conflict out to T2, and T2 has a conflict out to T3;
> and T3 is committed and T2 is prepared (for two-phase commit). In that
> situation, SSI can't roll back the committed or prepared transactions,
> so it must roll back the read-only transaction (T1).

This is true, but it isn't the only situation where a read-only
transaction can be rolled back -- this can happen even without
two-phase commit involved. 

You can have a situation where two read/write transactions T2 and T3
conflict such that T2 appears to have executed first in the serial
order, but T3 commits before T2. If there's a read-only transaction T1
that takes its snapshot between when T3 and T2 commit, it can't be
allowed to read the data that the other two transactions modified: it'd
see the changes made by T3 but not T2, violating the serial order.

Given a choice, we'd prevent this by aborting one of the read/write
transactions. But if they've both already committed by the time the
read-only transaction T1 does its reads, we'd have to abort it instead.

(Note that this is still an improvement over two-phase locking, which
wouldn't allow any of the transactions to execute concurrently!)


What I was getting at in my previous mail was that there aren't any
situations where COMMIT will return a serialization failure for
a read-only transaction.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc typo: lexems -> lexemes

2012-09-11 Thread Dan Scott
I ran across a minor typo while reviewing the full-text search
documentation. Attached is a patch to address the one usage of "lexems"
in a sea of "lexemes".

diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
new file mode 100644
index 978aa54..5305198
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
*** ts_rank( http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about SSI, subxacts, and aborted read-only xacts

2012-09-10 Thread Dan Ports
On Sat, Sep 08, 2012 at 11:34:56AM -0700, Jeff Davis wrote:
> If so, I think we need a documentation update. The serializable
> isolation level docs don't quite make it clear that serializability only
> applies to transactions that commit. It might not be obvious to a user
> that there's a difference between commit and abort for a RO transaction.
> I think that, in S2PL, serializability applies even to aborted
> transactions (though I haven't spent much time thinking about it), so
> users accustomed to other truly-serializable implementations might be

Yes, I agree that this is probably worth mentioning in the
documentation.

It might be worth noting that serializable mode will not cause
read-only transactions to fail to commit (as might be possible in some
optimistic concurrency control systems). However, it might require
other transactions to be aborted to ensure serializability. If the
user aborts the read-only transaction, that won't necessarily happen.

Figure 2 of the aforementioned paper is actually a nice example of
this. The read-only transaction T1 is allowed to commit, but as a
result T2 has to be aborted. If T1 had ABORTed instead of COMMIT, T2
would be allowed to proceed.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats

2012-03-25 Thread Dan McGee
This is a bit of a corner case in all honesty, but if you have a short
table (under 20 rows), the 10% heuristic used that decides whether
distinct values scale with the row count will result in rather odd
values for stadistinct in pg_statistic, such as '-0.2' or '-0.67',
rather than the expected '2'. Additionally, this can cause only one of
{t, f} to appear in the most common values array.

Does this actually affect query planning in any way? Probably not, but
it is extremely odd to look at pg_stats for these columns, and the
solution seems easy.
---

The only other minor changes included here were to make it clear when we were
comparing float values, so use 0.0 instead of 0.

Example stats output from the database I noticed this on:

archweb=# SELECT c.relname, a.attname, pg_stat_get_live_tuples(c.oid) AS 
n_live_tup, stadistinct, stanullfrac, stawidth, stavalues1, stanumbers1 FROM 
pg_statistic s JOIN pg_class c ON c.oid = s.starelid JOIN pg_attribute a ON 
c.oid = a.attrelid AND a.attnum = s.staattnum LEFT JOIN pg_namespace n ON n.oid 
= c.relnamespace JOIN pg_type t ON t.oid = a.atttypid WHERE NOT a.attisdropped 
AND nspname = 'public' AND t.typname = 'bool' ORDER BY stadistinct, n_live_tup;
relname|attname| n_live_tup | stadistinct | 
stanullfrac | stawidth | stavalues1 |  stanumbers1  
---+---++-+-+--++---
 mirrors_mirrorprotocol| is_download   |  3 |   -0.67 | 
  0 |1 | {t}| {0.67}
 arches| agnostic  |  3 |   -0.67 | 
  0 |1 | {f}| {0.67}
 repos | staging   | 10 |-0.2 | 
  0 |1 | {f,t}  | {0.7,0.3}
 repos | testing   | 10 |-0.2 | 
  0 |1 | {f,t}  | {0.7,0.3}
 devel_pgpsignature| valid |264 |   1 | 
  0 |1 | {t}| {1}
 packages_flagrequest  | is_spam   |415 |   1 | 
  0 |1 | {f}| {1}
 donors| visible   |716 |   1 | 
  0 |1 | {t}| {1}
 auth_user | is_superuser  | 95 |   2 | 
  0 |1 | {f,t}  | {0.957895,0.0421053}
 user_profiles | notify| 95 |   2 | 
  0 |1 | {t,f}  | {0.957895,0.0421053}
 auth_user | is_active | 95 |   2 | 
  0 |1 | {t,f}  | {0.621053,0.378947}
 auth_user | is_staff  | 95 |   2 | 
  0 |1 | {f,t}  | {0.873684,0.126316}
 releng_iso| active|158 |   2 | 
  0 |1 | {f,t}  | {0.89,0.106667}
 mirrors_mirror| isos  |180 |   2 | 
  0 |1 | {t,f}  | {0.972678,0.0273224}
 mirrors_mirror| active|180 |   2 | 
  0 |1 | {t,f}  | {0.672131,0.327869}
 mirrors_mirror| public|180 |   2 | 
  0 |1 | {t,f}  | {0.978142,0.0218579}
 mirrors_mirrorurl | has_ipv6  |379 |   2 | 
  0 |1 | {f,t}  | {0.709763,0.290237}
 mirrors_mirrorurl | has_ipv4  |379 |   2 | 
  0 |1 | {t}| {0.997361}
 packages_flagrequest  | is_legitimate |415 |   2 | 
  0 |1 | {t,f}  | {0.992754,0.00724638}
 packages_signoffspecification | enabled   |   1130 |   2 | 
  0 |1 | {t,f}  | {0.977578,0.0224215}
 packages_signoffspecification | known_bad |   1130 |   2 | 
  0 |1 | {f,t}  | {0.993722,0.00627803}
 mirrors_mirrorlog | is_success|  12715 |   2 | 
  0 |1 | {t,f}  | {0.953345,0.0466552}
 package_depends   | optional  |  28592 |   2 | 
  0 |1 | {f,t}  | {0.880322,0.119678}
 package_files | is_directory  | 225084 |   2 | 
  0 |1 | {f,t}  | {0.829933,0.170067}
(23 rows)


 src/backend/commands/analyze.c |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9cd6e67..995ed9d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2110,7 +2110,7 @@ compute_minimal_stats(VacAttrStatsP stats,
 * least 2 instances in the sample.
 */
if (track_cnt < track_max && to

Re: [HACKERS] a slightly stale comment

2012-03-07 Thread Dan Ports
On Wed, Mar 07, 2012 at 07:46:32AM +, Simon Riggs wrote:
> There is much wisdom there and much wisdom in leaving ancient warnings
> as we find them.

The comment is a wise and insightful statement -- about a totally
different system than we have today.

> Are these the words you object to?
> 
> "we don't need to
> >  *  check commit time against the start time of this transaction
> >  *  because 2ph locking protects us from doing the wrong thing."

Yes, that clearly isn't true, and the subsequent bit about catalog
accesses isn't right either -- they may not be serializable, but that
isn't the reason why.

I don't particularly object to the warning that "the tests in this
routine are correct" (although indeed the fact that they've changed
over the years does seem to belie it).

So I'm also in favor of just removing the comment entirely.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] a slightly stale comment

2012-03-06 Thread Dan Ports
While mucking around in src/backend/utils/time/tqual.c today, I noticed
the following comment attached to HeapTupleSatisfiesNow:

 *  mao says 17 march 1993:  the tests in this routine are correct;
 *  if you think they're not, you're wrong, and you should think
 *  about it again.  i know, it happened to me.  we don't need to
 *  check commit time against the start time of this transaction
 *  because 2ph locking protects us from doing the wrong thing.
 *  if you mess around here, you'll break serializability.  the only
 *  problem with this code is that it does the wrong thing for system
 *  catalog updates, because the catalogs aren't subject to 2ph, so
 *  the serializability guarantees we provide don't extend to xacts
 *  that do catalog accesses.  this is unfortunate, but not critical.

Much as I hate to disturb a comment just before its 19th birthday, the
bit about two-phase locking and serializability hasn't been correct
since around 1999 (when MVCC was added). :-)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible new option for wal_sync_method

2012-03-02 Thread Dan Scales
Hi,

> Got any result so far?

I measured the results with barrier=0, and yes, you are correct -- it seems 
that most of the benefit of the open_direct wal_sync_method is probably from 
not doing the barrier operation at the end of fsync():

  wal_sync_method
   fdatasync   open_direct  open_sync
no archive, barrier=1:  1730918507   17138
no archive, barrier=0:  1777118369   18045

archive, barrier=1   :  1578916592   15645
archive, barrier=0   :  1661616785   16547


It took me a while to look through Linux, and understand why barrier=1 had such 
an effect, even for disks with battery-backed caches.  As you
pointed out, the barrier operation not only flushes the disk cache, but also 
has some queue implications, particularly for Linux releases below
2.6.37.  I've been using 2.6.32, and in that case, the barrier at the end of 
fsync requires that all previously-queued operations be finished before the 
barrier occurs and flushes the disk cache.  This means that each fsync of the 
WAL log is likely waiting for completely unrelated in-flight operations of the 
data files.  That is why getting rid of the fsync of the WAL log has such a 
good performance win, even for disks that don't have a disk cache flush 
(because the cache is battery backed).  This option will probably have less 
benefit for Linux 2.6.37 and above, where
barriers are eliminated, and operations are written more specifically in terms 
of disk cache flushes.

fsync() on ext3 (even for Linux 2.6.37 and above) does still wait for any 
outstanding meta-data transaction to commit.  So, there is still another
reason to put the WAL log and data files on different logical disks (even if 
backed by the same physical disk).

It does still seem to me the sync_file_range() is unsafe in the case of 
non-battery backed disk write caches, since it doesn't sync the disk
cache.  However, if sync_file_range() was being used to optimize checkpoint 
fsyncs, then one final fsync() to an unused file on the same block
device would do the trick of flushing the disk cache.

Dan

- Original Message -
From: "Andres Freund" 
To: pgsql-hackers@postgresql.org
Cc: "Dan Scales" 
Sent: Monday, February 27, 2012 12:43:49 PM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:
> Good point, thanks.  From the ext3 source code, it looks like
> ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
> block device, whereas simple direct IO does not.  So, that would make
> this wal_sync_method option less useful, since, as you say, the user
> would have to know if the block device is doing write caching.
The experiments I know which played with disabling write caches nearly always 
had the result that write caching as worth the overhead of syncing.

> For the numbers I reported, I don't think the performance gain is from
> not doing the block device flush.  The system being measured is a Fibre
> Channel disk which should have a fully-nonvolatile disk array.  And
> measurements using systemtap show that blkdev_issue_flush() always takes
> only in the microsecond range.
Well, I think it has some io queue implications which could explain some of 
the difference. With that regard I think it heavily depends on the kernel 
version as thats an area which had loads of pretty radical changes in nearly 
every release since 2.6.32.

> I think the overhead is still from the fact that ext3_sync_file() waits
> for the current in-flight transaction if there is one (and does an
> explicit device flush if there is no transaction to wait for.)  I do
> think there are lots of meta-data operations happening on the data files
> (especially for a growing database), so the WAL log commit is waiting for
> unrelated data operations.  It would be nice if there a simple file
> system operation that just flushed the cache of the block device
> containing the filesystem (i.e. just does the blkdev_issue_flush() and
> not the other things in ext3_sync_file()).
I think you are right there. I think the metadata issue could be relieved a 
lot by doing the growing of files in way much larger bits than currently. I 
have seen profiles which indicated that lots of time was spent on increasing 
the file size. I would be very interested in seing how much changes in that 
area would benefit real-world benchmarks.

> The ext4_sync_file() code looks fairly similar, so I think it may have
> the same problem, though I can't be positive.  In that case, this
> wal_sync_method option might help ext4 as well.
The journaling code for ext4 is significantly different so I think it very 
well might play a role here - although youre probably right and it wont be in 
*_sync_file.

> With respect to sync_file_range(), the Linux c

Re: [HACKERS] possible new option for wal_sync_method

2012-02-16 Thread Dan Scales
Good point, thanks.  From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not.  So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush.  The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array.  And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.)  I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations.  It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive.  In that case, this
wal_sync_method option might help ext4 as well.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation).  So sync_file_range() may be
not be as useful as thought.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3.  I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Dan


- Original Message -
From: "Andres Freund" 
To: pgsql-hackers@postgresql.org
Cc: "Dan Scales" 
Sent: Thursday, February 16, 2012 10:32:09 AM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote:
> When running Postgres on a single ext3 filesystem on Linux, we find that
> the attached simple patch gives significant performance benefit (7-8% in
> numbers below).  The patch adds a new option for wal_sync_method, which
> is "open_direct".  With this option, the WAL is always opened with
> O_DIRECT (but not O_SYNC or O_DSYNC).  For Linux, the use of only
> O_DIRECT should be correct.  All WAL logs are fully allocated before
> being used, and the WAL buffers are 8K-aligned, so all direct writes are
> guaranteed to complete before returning.  (See
> http://lwn.net/Articles/348739/)
I don't think that behaviour is safe in the face of write caches in the IO 
path. Linux takes care to issue flush/barrier instructions when necessary if 
you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is 
used (That would suck performancewise).
I think that behaviour is safe if you have no externally visible write caching 
enabled but thats not exactly easy to get/document knowledge.


Why should there otherwise be any performance difference between O_DIRECT|
O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be 
written and I have a hard time imaging that the check whether there is 
metadata is that expensive.

I guess a more interesting case would be comparing O_DIRECT|O_SYNC with 
O_DIRECT + fdatasync() or even O_DIRECT +  
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | 
SYNC_FILE_RANGE_WAIT_AFTER)

Any special reason youve did that comparison on ext3? Especially with 
data=ordered its behaviour regarding syncs is pretty insane performancewise. 
Ext4 would be a bit more interesting...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] possible new option for wal_sync_method

2012-02-16 Thread Dan Scales
When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below).  The patch adds a new option for wal_sync_method, which
is "open_direct".  With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC).  For Linux, the use of only
O_DIRECT should be correct.  All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning.  (See
http://lwn.net/Articles/348739/)

The advantage of using O_DIRECT is that there is no fsync/fdatasync()
used.  All of the other wal_sync_methods use fsync/fdatasync(), either
explicitly or implicitly (via the O_SYNC and O_DATASYNC options).
fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed.  There can be many metadata operations happening on the
data files, so the WAL log fsync can wait for metadata operations on
the data files.  Since O_DIRECT does not do any fsync/fdatasync operation,
it avoids this bottleneck, and can finish more quickly on average.
The open_sync and open_dsync options do not have this benefit, because
they do an equivalent of an fsync/fdatasync after every WAL write.

For the open_sync and open_dsync options, O_DIRECT is used for writes
only if the xlog will not need to be consumed by the archiver or
hot-standby.  I am not keying the open_direct behavior based on whether
XLogIsNeeded() is true, because we see performance gain even when
archiving is enabled (using a simple script that copies and compresses
the log segments).  For 2-processor, 50-warehouse DBT2 run on SLES 11, I
get the following NOTPM results:

  wal_sync_method
 fdatasync   open_direct  open_sync

archiving off: 17076   18481   17094
archiving on:  15704   16923   15898


Do folks have any interest in this change, or comments on its
usefulness/correctness?  It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.

Dan
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..a830a01 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -122,6 +122,7 @@ const struct config_enum_entry sync_method_options[] = {
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
 #endif
+	{"open_direct", SYNC_METHOD_OPEN_DIRECT, false},
 	{NULL, 0, false}
 };
 
@@ -1925,7 +1926,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		 * fsync more than one file.
 		 */
 		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
+			sync_method != SYNC_METHOD_OPEN_DSYNC &&
+			sync_method != SYNC_METHOD_OPEN_DIRECT)
 		{
 			if (openLogFile >= 0 &&
 !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
@@ -8958,6 +8960,15 @@ get_sync_bit(int method)
 		case SYNC_METHOD_OPEN_DSYNC:
 			return OPEN_DATASYNC_FLAG | o_direct_flag;
 #endif
+   case SYNC_METHOD_OPEN_DIRECT:
+			/*
+			 * Open the log with O_DIRECT flag only.  O_DIRECT guarantees
+			 * that data is written to disk when the IO completes if and
+			 * only if the file is fully allocated.  Fortunately, the log
+			 * files are always fully allocated by XLogFileInit() (or are
+			 * recycled from a fully-allocated log).
+			 */
+			return O_DIRECT;
 		default:
 			/* can't happen (unless we are out of sync with option array) */
 			elog(ERROR, "unrecognized wal_sync_method: %d", method);
@@ -9031,6 +9042,7 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
+		case SYNC_METHOD_OPEN_DIRECT:
 			/* write synced it already */
 			break;
 		default:
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..97acde5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -564,3 +564,4 @@
 #--
 
 # Add settings for extensions here
+wal_sync_method = open_direct
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..b888ee7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -83,6 +83,7 @@ typedef struct XLogRecord
 #define SYNC_METHOD_OPEN		2		/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4		/* for O_DSYNC */
+#define SYNC_METHOD_OPEN_DIRECT	5		/* for O_DIRECT */
 extern int	sync_method;
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 09:27:58AM -0600, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
> > On 14.02.2012 04:57, Dan Ports wrote:
> >> The easiest answer would be to just treat every prepared
> >> transaction found during recovery as though it had a conflict in
> >> and out. This is roughly a one-line change, and it's certainly
> >> safe.
>  
> Dan, could you post such a patch, please?

Sure. See attached.

> Should we add anything to the docs to warn people that if they crash
> with serializable prepared transactions pending, they will see this
> behavior until the prepared transactions are either committed or
> rolled back, either by the transaction manager or through manual
> intervention?

Hmm, it occurs to me if we have to abort a transaction due to
serialization failure involving a prepared transaction, we might want
to include the prepared transaction's gid in the errdetail. 

This semes like it'd be especially useful for prepared transactions. We
can generally pick the transaction to abort to ensure the safe retry
property -- if that transaction is immediately retried, it won't
fail with the same conflict -- but we can't always guarantee that when
prepared transactions are involved. And prepared transactions already
have a convenient, user-visible ID we can report.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index b75b73a..b102e19 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -4733,14 +4733,11 @@ AtPrepare_PredicateLocks(void)
 	xactRecord->flags = MySerializableXact->flags;
 
 	/*
-	 * Tweak the flags. Since we're not going to output the inConflicts and
-	 * outConflicts lists, if they're non-empty we'll represent that by
-	 * setting the appropriate summary conflict flags.
+	 * Note that we don't include the list of conflicts in our out in
+	 * the statefile, because new conflicts can be added even after the
+	 * transaction prepares. We'll just make a conservative assumption
+	 * during recovery instead.
 	 */
-	if (!SHMQueueEmpty(&MySerializableXact->inConflicts))
-		xactRecord->flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
-	if (!SHMQueueEmpty(&MySerializableXact->outConflicts))
-		xactRecord->flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
 
 	RegisterTwoPhaseRecord(TWOPHASE_RM_PREDICATELOCK_ID, 0,
 		   &record, sizeof(record));
@@ -4875,15 +4872,6 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
 
 		sxact->SeqNo.lastCommitBeforeSnapshot = RecoverySerCommitSeqNo;
 
-
-		/*
-		 * We don't need the details of a prepared transaction's conflicts,
-		 * just whether it had conflicts in or out (which we get from the
-		 * flags)
-		 */
-		SHMQueueInit(&(sxact->outConflicts));
-		SHMQueueInit(&(sxact->inConflicts));
-
 		/*
 		 * Don't need to track this; no transactions running at the time the
 		 * recovered xact started are still active, except possibly other
@@ -4905,6 +4893,17 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
    (MaxBackends + max_prepared_xacts));
 		}
 
+		/*
+		 * We don't know whether the transaction had any conflicts or
+		 * not, so we'll conservatively assume that it had both a
+		 * conflict in and a conflict out, and represent that with the
+		 * summary conflict flags.
+		 */
+		SHMQueueInit(&(sxact->outConflicts));
+		SHMQueueInit(&(sxact->inConflicts));
+		sxact->flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
+		sxact->flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
+		
 		/* Register the transaction's xid */
 		sxidtag.xid = xid;
 		sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 10:04:15AM +0200, Heikki Linnakangas wrote:
> Perhaps it would be simpler to add the extra information to the commit 
> records of the transactions that commit after the first transaction is 
> prepared. In the commit record, you would include a list of prepared 
> transactions that this transaction conflicted with. During recovery, you 
> would collect those lists in memory, and use them at the end of recovery 
> to flag the conflicts in prepared transactions that are still in 
> prepared state.

Yeah, doing it that way might be a better strategy if we wanted to go
that route. I hadn't really considered it because I'm not that familiar
with the xlog code (plus, the commit record already contains a variable
length field, making it that much more difficult to add another).

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SSI rw-conflicts and 2PC

2012-02-13 Thread Dan Ports
Looking over the SSI 2PC code recently, I noticed that I overlooked a
case that could lead to non-serializable behavior after a crash.

When we PREPARE a serializable transaction, we store part of the
SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD
locks). One of the pieces of information we record is whether the
transaction had any conflicts in or out. The problem is that that can
change if a new conflict occurs after the transaction has prepared.

Here's an example of the problem (based on the receipt-report test):

-- Setup
CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL);
INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22');
CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date 
NOT NULL, amount numeric(13,2));

-- T2
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 
'receipt'), 4.00);
PREPARE TRANSACTION 't2';

-- T3
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt';
COMMIT;

-- T1
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM ctl WHERE k = 'receipt';
SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22';
COMMIT;

Running this sequence of transactions normally, T1 will be rolled back
because of the pattern of conflicts T1 -> T2 -> T3, as we'd expect. This
should still be true even if we restart the database before executing
the last transaction -- but it's not. The problem is that, when T2
prepared, it had no conflicts, so we recorded that in the statefile.
The T2 -> T3 conflict happened later, so we didn't know about it during
recovery.

I discussed this a bit with Kevin and we agreed that this is important
to fix, since it's a false negative that violates serializability. The
question is how to fix it. There are a couple of options...

The easiest answer would be to just treat every prepared transaction
found during recovery as though it had a conflict in and out. This is
roughly a one-line change, and it's certainly safe. But the downside is
that this is pretty restrictive: after recovery, we'd have to abort any
serializable transaction that tries to read anything that a prepared
transaction wrote, or modify anything that it read, until that
transaction is either committed or rolled back.

To do better than that, we want to know accurately whether the prepared
transaction had a conflict with a transaction that prepared or
committed before the crash. We could do this if we had a way to append
a record to the 2PC statefile of an already-prepared transaction --
then we'd just add a new record indicating the conflict. Of course, we
don't have a way to do that. It'd be tricky to add support for this,
since it has to be crash-safe, so the question is whether the improved
precision justifies the complexity it would require.

A third option is to observe that the only conflicts *in* that matter
from a recovered prepared transaction are from other prepared
transactions. So we could have prepared transactions include in their
statefile the xids of any prepared transactions they conflicted with
at prepare time, and match them up during recovery to reconstruct the
graph. This is a middle ground between the other two options. It
doesn't require modifying the statefile after prepare. However, conflicts
*out* to non-prepared transactions do matter, and this doesn't record
those, so we'd have to do the conservative thing -- which means that
after recovery, no one can read anything a prepared transaction wrote.

I thought I'd throw these options out there to see which ones people
think are reasonable (or any better ideas). Of the three, I think the
first (simplest) solution is the only one we could plausibly backpatch
to 9.1. But if the extra aborts after recovery seem too expensive, we
may want to consider one of the other options for future releases.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more "MVCC-safe"

2012-02-11 Thread Dan Ports
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
> I guess I'm not particularly excited by the idea of trying to make
> TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
> READ isolation level, which is already known to be busted in a variety
> of ways; that's why we now have SERIALIZABLE, and why most people use
> READ COMMITTED.  Are there examples of this behavior at other
> isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-08 Thread Dan Scales
> Is there any problem if the double-write happens only by bgwriter or 
> checkpoint. 
> Something like whenever backend process has to evict the buffer, it will do 
> same as you have described that write in a double-write buffer, but > 
> bgwriter  will check this double-buffer and flush from it.
> Also whenever any backend will see that the double buffer is more than 2/3rd 
> or some threshhold value full it will tell bgwriter to flush from > 
> double-write buffer.
> This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

- Original Message -----
From: "Amit Kapila" 
To: "Dan Scales" , "PG Hackers" 

Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

>> I think it is a good idea, and can help double-writes perform better in the 
>> case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch 

Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-06 Thread Dan Scales
I don't know a lot about base backup, but it sounds like full_page_writes must 
be turned on for base backup, in order to deal with the inconsistent reads of 
pages (which you might call torn pages) that can happen when you backup the 
data files while the database is running.  The relevant parts of the WAL log 
are then copied separately (and consistently) once the backup of the data files 
is done, and used to "recover" the database into a consistent state later.

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.

Dan

- Original Message -
From: "Fujii Masao" 
To: "Dan Scales" 
Cc: "PG Hackers" 
Sent: Monday, February 6, 2012 3:08:15 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales  wrote:
> Let me know if you have any thoughts/comments, etc.  The patch is
> enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-05 Thread Dan Scales
Thanks for the detailed followup.  I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations.  However, Postgres
seems to run fine in many "large shared_memory" configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs.  As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening.  As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either.  The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted.  The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection.  These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled.  When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written.  This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization.  The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial.  It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan


- Original Message -
From: "Robert Haas" 
To: "Dan Scales" 
Cc: "PG Hackers" 
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales  wrote:
> Thanks for the feedback!  I think you make a good point about the small size 
> of dirty data in the OS cache.  I think what you can say about this 
> double-write patch is that it will work not work well for configurations that 
> have a small Postgres cache and a large OS cache, since every write from the 
> Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the 

Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-03 Thread Dan Scales
Hi Robert,

Thanks for the feedback!  I think you make a good point about the small size of 
dirty data in the OS cache.  I think what you can say about this double-write 
patch is that it will work not work well for configurations that have a small 
Postgres cache and a large OS cache, since every write from the Postgres cache 
requires double-writes and an fsync.  However, it should work much better for 
configurations with a much large Postgres cache and relatively smaller OS cache 
(including the configurations that I've given performance results for).  In 
that case, there is a lot more capacity for dirty pages in the Postgres cache, 
and you won't have nearly as many dirty evictions.  The checkpointer is doing a 
good number of the writes, and this patch sorts the checkpointer's buffers so 
its IO is efficient.

Of course, I can also increase the size of the non-checkpointer ring buffer to 
be much larger, though I wouldn't want to make it too large, since it is 
consuming memory.  If I increase the size of the ring buffers significantly, I 
will probably need to add some data structures so that the ring buffer lookups 
in smgrread() and smgrwrite() are more efficient.

Can you let me know what the shared_buffers and RAM sizes were for your pgbench 
run?  I can try running the same workload.  If the size of shared_buffers is 
especially small compared to RAM, then we should increase the size of 
shared_buffers when using double_writes.

Thanks,

Dan 


- Original Message -
From: "Robert Haas" 
To: "Dan Scales" 
Cc: "PG Hackers" 
Sent: Thursday, February 2, 2012 7:19:47 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales  wrote:
> I've been prototyping the double-write buffer idea that Heikki and Simon
> had proposed (as an alternative to a previous patch that only batched up
> writes by the checkpointer).  I think it is a good idea, and can help
> double-writes perform better in the case of lots of backend evictions.
> It also centralizes most of the code change in smgr.c.  However, it is
> trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running "pgbench -i -s 10" with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2012 Call for Papers - extension

2012-01-29 Thread Dan Langille
We apologize that http://www.bsdcan.org/ was offline for 12 hours from early 
Sunday morning.

The deadline for submissions has been extended to Tuesday 31 January.

PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of
Ottawa.  It will be preceded by two days of tutorials on 15-16 May 2012.

We are now accepting proposals for talks.  Proposals can be quite 
simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

8 Jan 2012 Proposal acceptance begins
31 Jan 2012 Proposal acceptance ends
19 Feb 2012 Confirmation of accepted proposals

See also <http://www.pgcon.org/2012/papers.php>

Instructions for submitting a proposal to PGCon 2012 are available
from: <http://www.pgcon.org/2012/submissions.php>

-- 
Dan Langille - http://langille.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-27 Thread Dan Scales
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) 
is that it catches a bunch of page writes that don't go through the buffer pool 
(see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Also, I missed this before:  don't you want to add the checksum calculation 
(PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well?  
Otherwise, you won't be checksumming a bunch of the new pages.

Dan

- Original Message -
From: "Robert Haas" 
To: "Dan Scales" 
Cc: "Noah Misch" , "Heikki Linnakangas" 
, "Andres Freund" , 
"Kevin Grittner" , da...@fetter.org, 
ai...@highrise.ca, st...@mit.edu, pgsql-hackers@postgresql.org, "Simon Riggs" 

Sent: Friday, January 27, 2012 5:19:32 AM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales  wrote:
> I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) 
> to mdwrite() rather than smgrwrite().  If there were every another storage 
> backend, it would have to duplicate the checksum check, right?  Is there a 
> disadvantage to putting it in smgrwrite()?

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way.  It seems like the
right place for this is in some higher layer, like bufmgr.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2012 Call for Papers - reminder

2012-01-26 Thread Dan Langille
A reminder, there are three days left to get in before the deadline…

PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of
Ottawa.  It will be preceded by two days of tutorials on 15-16 May 2012.

We are now accepting proposals for talks.  Proposals can be quite 
simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

8 Jan 2012 Proposal acceptance begins
29 Jan 2012 Proposal acceptance ends
19 Feb 2012 Confirmation of accepted proposals

See also <http://www.pgcon.org/2012/papers.php>

Instructions for submitting a proposal to PGCon 2012 are available
from: <http://www.pgcon.org/2012/submissions.php>

-- 
Dan Langille - http://langille.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-26 Thread Dan Scales
Some other comments on the checksum patch:

I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) 
to mdwrite() rather than smgrwrite().  If there were every another storage 
backend, it would have to duplicate the checksum check, right?  Is there a 
disadvantage to putting it in smgrwrite()?

You may have already noticed this, but a bunch of the comments are incorrect, 
now that you have moved the checksum calculation to mdwrite().

  config.sgml - says writes via temp_buffers (which I think means local 
buffers) are not checksummed -- that's no longer true, right?
  bufmgr.c, line 1914 - bufToWrite no longer exists.  You could revert changes 
from 1912-1920
  localbuf.c, line 203 - as mentioned below, this comment is no longer 
relevant, you are checksumming local buffers now


Dan

- Original Message -
From: "Noah Misch" 
To: "Simon Riggs" 
Cc: "Heikki Linnakangas" , "Robert Haas" 
, "Andres Freund" , "Kevin Grittner" 
, da...@fetter.org, ai...@highrise.ca, 
st...@mit.edu, pgsql-hackers@postgresql.org
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> +   
> +Turning this parameter off speeds normal operation, but
> +might allow data corruption to go unnoticed. The checksum uses
> +16-bit checksums, using the fast Fletcher 16 algorithm. With this
> +parameter enabled there is still a non-zero probability that an error
> +could go undetected, as well as a non-zero probability of false
> +positives.
> +   

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
>  #include "postgres.h"
>  
>  #include "access/visibilitymap.h"
> +#include "access/transam.h"
>  #include "access/xact.h"
>  #include "access/xlogutils.h"
>  #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of 
> linked list */
>  /* XLOG gives us high 4 bits */
>  #define XLOG_SMGR_CREATE 0x10
>  #define XLOG_SMGR_TRUNCATE   0x20
> +#define XLOG_SMGR_HINT   0x40
>  
>  typedef struct xl_smgr_create
>  {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>   smgrDoPendingDeletes(false);
>  }
>  
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or c

Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-17 Thread Dan Scales
We have some numbers for 9.2 runs with and without double writes now.  We
are still using the double-write patch that assumes checksums on data
pages, so checksums must be turned on for double writes.

The first set of runs are 50-warehouse 2-processor DBT2 60-minute run,
with checkpoints every 5 minutes.  Machine memory is 8G, cache size is
5G.  Database size is about 9G.  The disks are enterprise Fibre Channel
disks, so there is good disk write-caching at the array.  All runs are
for virtual machines.  (We expect that the virtual machine numbers would
be representative of performance for non-virtual machines, but we know
that we need to get non-virtual numbers as well.)

  orig 9.2| 9.2 + DW patch
  -
  FPW off  FPW off FPW off  FPW on  DW on/FPW off
   CK off  CK onCK on   CK on
  
one disk: 1557415308   15135   13337   13052 [5G shared_buffer, 8G RAM]
sep log disk: 1873918134   18063   15823   16033

(First row is everything on one disk, second row is where the WAL log is
on a separate disk.)

So, in this case where cache is large and disks probably have
write-caching, we get about same performance with full_page_write on and
double-writes on.  We need to run these numbers more to get a good
average -- in some runs last night, double writes did better, closer to
what we were seeing with 9.0 (score of 17721 instead of 16033).

Note that, for one disk, there is no significant different between the
original 9.2 code and the patched code with checksums (and double-writes)
turned off.  For two disks, there is a bigger difference (3.3%), but I'm
not sure that is really significant.

The second set of numbers is for a hard disk with write cache turned off,
closer to internal hard disks of servers (people were quite interested in
that result).  These runs are for 50-warehouse 8-processor DBT2 60-minute
run, with checkpoints every 5 minutes.  The RAM size is 8G, and the cache
size is 6G.

  9.2 + DW patch
  ---
  FPW off  FPW on  DW on/FPW off
  CK onCK on   CK on
one disk: 1208478499766[6G shared_buffers, 8G RAM]

So, here we see a performance advantage for double writes where the cache
is large and the disks do not have write-caching.  Presumably, the cost
of fsyncing the big writes (with full pages) to the WAL log on a slow
disk are traded against the fsyncs of the double writes.

Third set of numbers is back to the first hardware setup, but with much
smaller shared_buffers.  Again, the runs are 50-warehouse 2-processor DBT2
60-minute run, with checkpoints every 5 minutes.  But shared_buffers is
set to 1G, so there will be a great many more dirty evictions by the
backends.

  9.2 + DW patch
  ---
  FPW off  FPW on  DW on/FPW off
  CK onCK on   CK on
one disk: 11078   103943296  [1G shared_buffers, 8G RAM]
sep log disk: 13605   120153412
  
one disk:  773166132670  [1G shared_buffers, 2G RAM]
sep log disk:  675261292722

Here we see that double writes does very badly, because of all the double
writes being done for individual blocks by the backends.  With the small
shared cache, the backends are now writing 3 times as many blocks as the
checkpointer.

Clearly, the double write option would have to be completely optional,
available for use for database configurations which have a well-sized
cache.

It would still be preferable that performance didn't have such a cliff
when dirty evictions become high, so, with that in mind, I am doing some
prototyping of the double-write buffer idea that folks have proposed on
this thread. 

Happy to hear all comments/suggestions.  Thanks,

Dan

- Original Message -----
From: "Dan Scales" 
To: "Heikki Linnakangas" 
Cc: "PG Hackers" , jks...@gmail.com, "David 
Fetter" 
Sent: Wednesday, January 11, 2012 1:25:21 PM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

Thanks for all the comments and suggestions on the double-write patch.  We are 
working on generating performance results for the 9.2 patch, but there is 
enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the 
fact that the checkpointer and bgwriter are separated and can run at the same 
time (I think), and therefore will contend on the double-write file.  Is there 
any thought that the bgwriter might be paused while the checkpointer is doing a 
checkpoint, since the checkpointer is doing some of the cleaning that the 
bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of 
dirty-page evictions by backends, because of th

[HACKERS] PGCon 2012 Call for Papers

2012-01-11 Thread Dan Langille
PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of
Ottawa.  It will be preceded by two days of tutorials on 15-16 May 2012.

We are now accepting proposals for talks.  Proposals can be quite 
simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

 8 Jan 2012 Proposal acceptance begins
29 Jan 2012 Proposal acceptance ends
19 Feb 2012 Confirmation of accepted proposals

See also <http://www.pgcon.org/2012/papers.php>

Instructions for submitting a proposal to PGCon 2012 are available
from: <http://www.pgcon.org/2012/submissions.php>

-- 
Dan Langille - http://langille.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Dan Scales
Thanks for all the comments and suggestions on the double-write patch.  We are 
working on generating performance results for the 9.2 patch, but there is 
enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the 
fact that the checkpointer and bgwriter are separated and can run at the same 
time (I think), and therefore will contend on the double-write file.  Is there 
any thought that the bgwriter might be paused while the checkpointer is doing a 
checkpoint, since the checkpointer is doing some of the cleaning that the 
bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of 
dirty-page evictions by backends, because of the extra fsyncing just to write 
individual buffers.  I think Heikki's (and Simon's) idea of a growing shared 
double-write buffer (only doing double-writes when it gets to a certain size) 
instead is a great idea that could deal with the dirty-page eviction issue with 
less performance hit.  It could also deal with the checkpointer/bgwriter 
contention, if we can't avoid that.  I will think about that approach and any 
issues that might arise.  But for now, we will work on getting performance 
numbers for the current patch.

With respect to all the extra fsyncs, I agree they are expensive if done on 
individual buffers by backends.  For the checkpointer, there will be extra 
fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off 
against the often large & unpredictable fsyncs at the end of checkpoints.  In 
our performance runs on 9.0, the configuration was such that there were not a 
lot of dirty evictions, and the checkpointer/bgwriter was able to finish the 
checkpoint on time, even with the double writes.

And just wanted to reiterate one other benefit of double writes -- it greatly 
reduces the size of the WAL logs.

Thanks,

Dan

- Original Message -
From: "Heikki Linnakangas" 
To: "David Fetter" 
Cc: "PG Hackers" , jks...@gmail.com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch.  While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the 
checksum patches, please? That would make the patch smaller and easier 
to review, and it would allow benchmarking the performance impact of 
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the 
double-write area first and the data files immediately after that. 
That's probably beneficial if you have a BBU, and/or a fairly large 
shared_buffers setting, so that pages don't get swapped between OS and 
PostgreSQL cache too much. But when those assumptions don't hold, it 
would be interesting to treat the double-write buffers more like a 2nd 
WAL for full-page images. Whenever a dirty page is evicted from 
shared_buffers, write it to the double-write area, but don't fsync it or 
write it back to the data file yet. Instead, let it sit in the 
double-write area, and grow the double-write file(s) as necessary, until 
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra 
fsync's this introduces. You really need a BBU to absorb them, and even 
then, you're fsyncing data files to disk much more frequently than you 
otherwise would.

Jignesh mentioned having run some performance tests with this. I would 
like to see those results, and some analysis and benchmarks of how 
settings like shared_buffers and the presence of BBU affect this, 
compared to full_page_writes=on and off.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: fix SSI finished list corruption

2012-01-06 Thread Dan Ports
There's a corner case in the SSI cleanup code that isn't handled
correctly. It can arise when running workloads that are comprised
mostly (but not 100%) of READ ONLY transactions, and can corrupt the
finished SERIALIZABLEXACT list, potentially causing a segfault. The
attached patch fixes it.

Specifically, when the only remaining active transactions are READ
ONLY, we do a "partial cleanup" of committed transactions because
certain types of conflicts aren't possible anymore. For committed r/w
transactions, we release the SIREAD locks but keep the
SERIALIZABLEXACT. However, for committed r/o transactions, we can go
further and release the SERIALIZABLEXACT too. The problem was with the
latter case: we were returning the SERIALIZABLEXACT to the free list
without removing it from the finished list.

The only real change in the patch is the SHMQueueDelete line, but I
also reworked some of the surrounding code to make it obvious that r/o
and r/w transactions are handled differently -- the existing code felt
a bit too clever.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
>From 39f8462332f998d7363058adabac412c7654befe Mon Sep 17 00:00:00 2001
From: "Dan R. K. Ports" 
Date: Thu, 29 Dec 2011 23:11:35 -0500
Subject: [PATCH] Read-only SERIALIZABLEXACTs are completely released when
 doing partial cleanup, so remove them from the finished
 list. This prevents the finished list from being corrupted.
 Also make it more clear that read-only transactions are
 treated differently here.

---
 src/backend/storage/lmgr/predicate.c |   26 +++---
 1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index aefa698..c0b3cb4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3531,10 +3531,29 @@ ClearOldPredicateLocks(void)
 		else if (finishedSxact->commitSeqNo > PredXact->HavePartialClearedThrough
 		   && finishedSxact->commitSeqNo <= PredXact->CanPartialClearThrough)
 		{
+			/*
+			 * Any active transactions that took their snapshot before
+			 * this transaction committed are read-only, so we can
+			 * clear part of its state.
+			 */
 			LWLockRelease(SerializableXactHashLock);
-			ReleaseOneSerializableXact(finishedSxact,
-	   !SxactIsReadOnly(finishedSxact),
-	   false);
+
+			if (SxactIsReadOnly(finishedSxact))
+			{
+/* A read-only transaction can be removed entirely */
+SHMQueueDelete(&(finishedSxact->finishedLink));
+ReleaseOneSerializableXact(finishedSxact, false, false);
+			}
+			else
+			{
+/*
+ * A read-write transaction can only be partially
+ * cleared. We need to keep the SERIALIZABLEXACT but
+ * can release the SIREAD locks and conflicts in.
+ */
+ReleaseOneSerializableXact(finishedSxact, true, false);
+			}
+			
 			PredXact->HavePartialClearedThrough = finishedSxact->commitSeqNo;
 			LWLockAcquire(SerializableXactHashLock, LW_SHARED);
 		}
@@ -3640,6 +3659,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 
 	Assert(sxact != NULL);
 	Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact));
+	Assert(partial || !SxactIsOnFinishedList(sxact));
 	Assert(LWLockHeldByMe(SerializableFinishedListLock));
 
 	/*
-- 
1.7.8.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and default_transaction_isolation

2011-11-29 Thread Dan Ports
On Tue, Nov 29, 2011 at 07:04:23PM -0500, Tom Lane wrote:
> Hmm.  Shouldn't we make the autovac launcher use READ COMMITTED, too?

Yeah, probably. That one doesn't seem so important because its
transactions aren't long-running (IIRC, it only starts a transaction to
scan pg_database). But it wouldn't hurt to keep it from pointlessly
registering a serializable transaction.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] autovacuum and default_transaction_isolation

2011-11-29 Thread Dan Ports
I was doing some tests recently with default_transaction_isolation set
to 'serializable' in postgresql.conf when I noticed pg_locks filling up
with SIReadLocks rather more quickly than I expected.

After some investigation, I found that an autovacuum worker was
starting a transaction at the default isolation level. While using a
serializable transaction doesn't affect its behavior (because it's not
using a MVCC snapshot), having a serializable transaction open prevents
other concurrent serializable transactions and their predicate locks
from being cleaned up. Since VACUUM on a large table can take a long
time, this could affect many concurrent transactions.

My one-liner fix for this was to set
  DefaultXactIsoLevel = XACT_READ_COMMITTED;
in AutoVacWorkerMain.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FlexLocks

2011-11-15 Thread Dan Ports
On Tue, Nov 15, 2011 at 10:55:49AM -0600, Kevin Grittner wrote:
> And I would be
> surprised if some creative thinking didn't yield a far better FL
> scheme for SSI than we can manage with existing LW locks.

One place I could see it being useful is for
SerializableFinishedListLock, which protects the queue of committed
sxacts that can't yet be cleaned up. When committing a transaction, it
gets added to the list, and then scans the queue to find and clean up
any sxacts that are no longer needed. If there's contention, we don't
need multiple backends doing that scan; it's enough for one to complete
it on everybody's behalf.

I haven't thought it through, but it may also help with the other
contention bottleneck on that lock: that every transaction needs to add
itself to the cleanup list when it commits.


Mostly unrelatedly, the other thing that's looking like it would be really
useful would be some support for atomic integer operations. This would
be useful for some SSI things like writableSxactCount, and some things
elsewhere like the strong lock count in the regular lock manager.
I've been toying with the idea of creating an AtomicInteger type with a
few operations like increment-and-get, compare-and-set, swap, etc. This
would be implemented using the appropriate hardware operations on
platforms that support them (x86_64, perhaps others) and fall back on a
spinlock implementation on other platforms. I'll probably give it a try
and see what it looks like, but if anyone has any thoughts, let me know.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-20 Thread Dan Ports
On Thu, Oct 20, 2011 at 07:33:59AM -0500, Kevin Grittner wrote:
> Dan Ports  wrote:
> > The part that's harder is building the list of potential conflicts
> > that's used to identify safe snapshots for r/o transactions. That
> > (currently) has to happen atomically taking the snapshot.
>  
> That's not obvious to me; could you elaborate on the reasons?  If the
> commit sequence number is incremented under cover of an existing
> ProcArrayLock, and the current value is assigned to a snapshot under
> cover of same, acquiring SerializableXactHashLock after we get the
> snapshot seems to work for SxactGlobalXmin and WritableSxactCount,
> AFAICS.

Well, whenever a r/w transaction commits or aborts, we need to check
whether that caused any concurrent r/o transactions to have a
known-safe or unsafe snapshot. The way that happens now is that, when a
r/o transaction starts, it scans the list of r/w transactions and adds
a pointer to itself in their sxact->possibleUnsafeConflicts. When one
of them commits, it scans the list of possible conflicts and does the
appropriate thing.

That's not ideal because:

  - it requires modifing another transaction's sxact when registering a
serializable transaction, so that means taking
SerializableXactHashLock exclusive.

  - the set of concurrent transactions used to identify the possible
conflicts needs to match the one used to build the snapshot.
Otherwise, a transaction might commit between when the snapshot is
take and when we find possible conflicts. (Holding
SerializableXactHashLock prevents this.)

> Yeah, I don't see how we can avoid taking a LW lock to get a
> serializable transaction which needs a SERIALIZABLEXACT set up, but
> it might be possible and worthwhile to split the uses of
> SerializableXactHashLock so that it's not such a hotspot.

Oh, right, one other problem is that the sxact free list is also
protected by SerializableXactHashLock, so allocating from it requires
locking. That one could be fixed by protecting it with its own lock, or
(better yet) eventually moving to a lock-free implementation.

In general, the contention problem is that SerializableXactHashLock
basically protects all SSI state except the predicate locks themselves
(notably, the dependency graph). This isn't partitioned at all, so
looking at or modifying a single sxact requires locking the whole
graph. I'd like to replace this with something finer-grained.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on documentation builds on OSX w/ macports

2011-10-19 Thread Dan Ports
On Thu, Oct 20, 2011 at 02:02:09AM +0200, Florian Pflug wrote:
> I've patched the ports for openjade, iso8879 and docbook-dsssl,
> and added a new port for docbook-sgml-4.2. These patches are sitting
> in the macports trac now, waiting to be applied.

I'll try to take a look at them in the next couple days (with my
MacPorts hat on), unless someone beats me to it.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
I think it would be fairly sensible to push some of this into
ProcArray, actually. The commit sequence numbers just involve assigning/
incrementing a global counter when taking a snapshot and finishing a
transaction -- that's not too much work, doesn't require any additional
locking beyond ProcArrayLock, and isn't too tied to SSI. (I could
imagine it being useful for other purposes, though I'm not going to
make that argument too seriously without actually having one in mind.)

SxactGlobalXmin and WritableSxactCount are obviously more SSI-specific,
but I think we can come up with something reasonable to do with them.

The part that's harder is building the list of potential conflicts
that's used to identify safe snapshots for r/o transactions. That
(currently) has to happen atomically taking the snapshot. We'll
probably have to do this in some significantly different way, but I
haven't quite worked out what it is yet.

On the bright side, if we can address these three issues, we shouldn't
need to take SerializableXactHashLock at all when starting a
transaction. (Realistically, we might have to take it or some other
lock shared to handle one of them -- but I really want starting a
serializable xact to not take any exclusive locks.)

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 05:04:52PM -0400, Tom Lane wrote:
> I wonder whether it would be prudent to set the synchronized-snapshots
> patch aside until you've finished that work (assuming you're actively
> working on it).  It's evidently going to require some nontrivial changes
> in predicate.c, and I don't think the feature should take precedence
> over SSI performance improvement.

I wouldn't hold the patch up on my account. Improving the SSI locking
situation looks to be a fairly substantial project. I've been drawing
up a plan to fix it, but I'm also travelling for most of the next two
weeks and probably won't be able to do any serious hacking on it until
I'm back to the office.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 04:36:41PM -0400, Tom Lane wrote:
> No, the intention of the synchronized-snapshots feature is just to be
> able to start multiple transactions using exactly the same snapshot.
> They're independent after that.  The aspect of it that is worrying me
> is that if xact A starts, gets a snapshot and publishes it, and then
> xact B starts and wants to adopt that snapshot, then
> 
> (2) as things stand, xact A need not be running in serializable mode ---
> if B is serializable, does *that* break any assumptions?

[taking these in opposite order]

Yes, I think that's going to be a problem. The obvious case where it's
clearly not going to work is if A is older than the oldest active
serializable xact (i.e. SxactGlobalXmin would have to move backwards).
It's probably possible to make it work when that's not the case, but I
think it's better to require A to be serializable -- if nothing else,
it's a far simpler rule to document!

There is another case that could be problematic, if A was READ ONLY,
and B isn't. It sounds to me like that would also be a reasonable thing
to forbid.

> (1) other transactions may have started or ended meanwhile; does that
> break any of SSI's assumptions?

Mostly, no, if A is still running. There's one case that needs to be
handled a bit carefully, but shouldn't be a problem: if A was
SERIALIZABLE READ ONLY, and its snapshot was found to be safe, then
it's actually running (safely) at REPEATABLE READ. If we start a new
read-only transaction at the same snapshot, we need to make it run at
REPEATABLE READ if it requests SERIALIZABLE.

> We already have to have an interlock to ensure that GlobalXmin doesn't
> go backwards, by means of requiring A to still be running at the instant
> B adopts the snapshot and sets its MyProc->xmin accordingly.  However,
> there is not currently anything that guarantees that A is still running
> by the time B does GetSerializableTransactionSnapshotInt, shortly later.
> So if your answer to question (1) involves an assumption that A is still
> running, we're going to have to figure out how to arrange that without
> deadlocking on ProcArrayLock vs SerializableXactHashLock.

Yep, I think we're going to have to do that. I haven't had a chance to
look at the synchronized snapshots patch yet, so I can't (yet) offer
any suggestions about how to implement it.

> Which might
> be another good reason for changing predicate.c so that we don't hold
> the latter while taking a snapshot ...

It'd be great if we could do that, but I don't really see it being
possible...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-19 Thread Dan Ports
On Wed, Oct 19, 2011 at 12:56:58PM -0400, Tom Lane wrote:
> Is it really necessary for GetSerializableTransactionSnapshotInt to
> acquire an empty SERIALIZABLEXACT before it acquires a snapshot?
> If so, why? 

*That* isn't necessary, no. It is necessary, however, to acquire the
snapshot while SerializableXactHashLock is held. There are a couple
reasons for this: the sxact's lastCommitBeforeSnapshot needs to match
the snapshot, SxactGlobalXmin needs to be set to the correct value,
etc. That's why the call to GetSnapshotData happens from where it does

> The proposed synchronized-snapshots feature will mean
> that the allegedly-new snapshot actually was taken some time before,
> so it seems to me that either this is not necessary or we cannot use
> a synchronized snapshot in a serializable xact.

There are definitely potential problems here. If the original snapshot
doesn't belong to an active serializable transaction, we may have
discarded the state we need to do SSI, e.g. we might have already
cleaned up SIREAD locks from concurrent committed transactions.

I assume the answer here is going to have to be to either refuse to
start a serializable transaction if that's the case, or make saving a
snapshot inhibit some of the SSI cleanup.

> In the same vein, why is it necessary to be holding
> SerializableXactHashLock (exclusively, yet) while acquiring the
> snapshot?  That seems rather bad from a concurrency standpoint, and
> again it's going to be pretty meaningless if we're just installing a
> pre-existing snapshot.

Yes, it's bad. I'm working on a design to address
SerializableXactHashLock contention, but there needs to be some locking
here for the reasons I mentioned above. I think the best we can do here
is to acquire a lock in shared mode when registering a serializable
transaction and in exclusive mode when committing. (Which is what you'd
expect, I guess; it's the same story as ProcArrayLock, and for most of
the same reasons.) Obviously, we'll also want to minimize the amount of
work we're doing while holding that lock.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 04:10:18PM -0500, Kevin Grittner wrote:
> Did you ever see much contention on
> SerializablePredicateLockListLock, or was it just
> SerializableXactHashLock?  I think the former might be able to use
> the non-blocking techniques, but I fear the main issue is with the
> latter, which seems like a harder problem.

No, not that I recall -- if SerializablePredicateLockListLock was on
the list of contended locks, it was pretty far down.

SerializableXactHashLock was the main bottleneck, and
SerializableXactFinishedListLock was a lesser but still significant
one.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 02:59:04PM -0500, Kevin Grittner wrote:
> I do have some concern about whether the performance improvements
> from reduced LW locking contention elsewhere in the code may (in
> whack-a-mole fashion) cause the percentages to go higher in SSI. 
> The biggest performance issues in some of the SSI benchmarks were on
> LW lock contention, so those may become more noticeable as other
> contention is reduced.  I've been trying to follow along on the
> threads regarding Robert's work in that area, with hopes of applying
> some of the same techniques to SSI, but it's not clear whether I'll
> have time to work on that for the 9.2 release.  (It's actually
> looking improbably at this point.)

I spent some time thinking about this a while back, but didn't have
time to get very far. The problem isn't contention in the predicate
lock manager (which is partitioned) but the single lock protecting the
active SerializableXact state.

It would probably help things a great deal if we could make that lock
more fine-grained. However, it's tricky to do this without deadlocking
because the serialization failure checks need to examine a node's
neighbors in the dependency graph.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Overhead cost of Serializable Snapshot Isolation

2011-10-10 Thread Dan Ports
On Mon, Oct 10, 2011 at 02:25:59PM -0400, Greg Sabino Mullane wrote:
> I agree it is better versus SELECT FOR, but what about repeatable read versus
> the new serializable? How much overhead is there in the 'monitoring of
> read/write dependencies'? This is my only concern at the moment. Are we 
> talking insignificant overhead? Minor? Is it measurable? Hard to say without 
> knowing the number of txns, number of locks, etc.?

I'd expect that in most cases the main cost is not going to be overhead
from the lock manager but rather the cost of having transactions
aborted due to conflicts. (But the rollback rate is extremely
workload-dependent.)

We've seen CPU overhead from the lock manager to be a few percent on a
CPU-bound workload (in-memory pgbench). Also, if you're using a system
with many cores and a similar workload, SerializableXactHashLock might
become a scalability bottleneck.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-09-26 Thread Dan McGee
On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas  wrote:
> On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee  wrote:
>> [ patch ]
>
> I suppose it's Tom who really needs to comment on this, but I'm not
> too enthusiastic about this approach.  Duplicating the Linux kernel
> calculation into our code means that we could drift if the formula
> changes again.
Why would the formula ever change? This seems like a different excuse
way of really saying you don't appreciate the hacky approach, which I
can understand completely. However, it simply doesn't make sense for
them to change this formula, as it scales the -17 to 16 old range to
the new -1000 to 1000 range. Those endpoints won't be changing unless
a third method is introduced, in which case this whole thing is mute
and we'd need to fix it yet again.

> I like Tom's previous suggestion (I think) of allowing both constants
> to be defined - if they are, then we try oom_score_adj first and fall
> back to oom_adj if that fails.  For bonus points, we could have
> postmaster stat() its own oom_score_adj file before forking and set a
> global variable to indicate the results.  That way we'd only ever need
> to test once per postmaster startup (at least until someone figures
> out a way to swap out the running kernel without stopping the
> server...!).
This would be fine, it just seems unreasonably complicated, not to
mention unnecessary. I might as well point this out [1]. I don't think
a soul out there has built without defining this to 0 (if they define
it at all), and not even that many people are using it. Is it all that
bad of an idea to just force it to 0 for both knobs and be done with
it?

-Dan McGee

[1] http://www.google.com/codesearch#search/&q=LINUX_OOM_ADJ=&type=cs
- Slackware and Fedora are the only hits not in the PG codebase, and
both define it to 0.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] POC: inline int4 comparison in tuplesort

2011-09-21 Thread Dan McGee
This attempts to be as simple as it gets while reducing function call
depth, and should be viewed as a proof of concept. It is also untested
as of now, but will try to do that and report back.

I'm hoping I followed the rabbit hole correctly and are correctly
comparing the right pointers to each other in order to short circuit the
case where we are using the int4 comparison operator.

Peter, if you want to compare stock vs. your patch vs. this patch, we might
be able to get some sort of read on where the maintainablity vs. performance
curve lies. Note that this version should still allow sorting of anything,
and simply shifts gears for int4 tuples...

---
 src/backend/utils/sort/tuplesort.c |   23 +--
 1 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 3505236..ddd5ced 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2652,6 +2652,22 @@ myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, 
Datum arg1, Datum arg2)
return result;
 }
 
+static inline
+int int4cmp(Datum first, Datum second)
+{
+   int32   a = DatumGetInt32(first);
+   int32   b = DatumGetInt32(second);
+
+   if (a > b)
+   return 1;
+   else if (a == b)
+   return 0;
+   else
+   return -1;
+}
+
+extern Datum btint4cmp(PG_FUNCTION_ARGS);
+
 /*
  * Apply a sort function (by now converted to fmgr lookup form)
  * and return a 3-way comparison result.  This takes care of handling
@@ -2683,8 +2699,11 @@ inlineApplySortFunction(FmgrInfo *sortFunction, int 
sk_flags, Oid collation,
}
else
{
-   compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, 
collation,
-   
datum1, datum2));
+   if (sortFunction->fn_addr == btint4cmp)
+   compare = int4cmp(datum1, datum2);
+   else
+   compare = 
DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
+   
datum1, datum2));
 
if (sk_flags & SK_BT_DESC)
compare = -compare;
-- 
1.7.6.3


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-09-19 Thread Dan McGee
On Mon, Sep 19, 2011 at 3:11 PM, Dan McGee  wrote:
> This is one way to prevent the kernel warning message without having to
> introduce a new constant. Scale the old oom_adj-style value the same way
> the kernel internally does it and use that instead if oom_score_adj is
> available for writing.
>
> Signed-off-by: Dan McGee 
> ---
>
> This addresses some of the concerns raised on the ML and will at least keep
> those of us running modern kernels happy.
>
> Alternatively one could switch the symbol used to be the new style and have 
> the
> old one computed; however this is a pain for legacy vs. current versions.
>
>  src/backend/postmaster/fork_process.c |   22 +-
>  1 files changed, 21 insertions(+), 1 deletions(-)
>
> diff --git a/src/backend/postmaster/fork_process.c 
> b/src/backend/postmaster/fork_process.c
> index b2fe9a1..3cded54 100644
> --- a/src/backend/postmaster/fork_process.c
> +++ b/src/backend/postmaster/fork_process.c
> @@ -81,16 +81,36 @@ fork_process(void)
>                         * Use open() not stdio, to ensure we control the open 
> flags. Some
>                         * Linux security environments reject anything but 
> O_WRONLY.
>                         */
> -                       int                     fd = 
> open("/proc/self/oom_adj", O_WRONLY, 0);
> +                       int                     fd = 
> open("/proc/self/oom_score_adj", O_WRONLY, 0);
>
>                        /* We ignore all errors */
>                        if (fd >= 0)
>                        {
>                                char            buf[16];
> +                               int             oom_score_adj;
>
> +                               /*
> +                                * The compile-time value is the old style 
> oom_adj;
> +                                * scale it the same way the kernel does to
> +                                * convert to the new style oom_score_adj. 
> This
> +                                * should become a constant at compile time.
> +                                * Valid values range from -17 (never kill) to
> +                                * 15; no attempt of validation is done.
> +                                */
> +                               oom_score_adj = LINUX_OOM_ADJ * 1000 / 17;
>                                snprintf(buf, sizeof(buf), "%d\n", 
> LINUX_OOM_ADJ);
Of course it would help to actually use the computed value here; this should be:
snprintf(buf, sizeof(buf), "%d\n",
oom_score_adj);

>                                (void) write(fd, buf, strlen(buf));
>                                close(fd);
> +                       } else if (errno == EEXIST) {
> +                               int             fd = 
> open("/proc/self/oom_adj", O_WRONLY, 0);
> +                               if (fd >= 0)
> +                               {
> +                                       char    buf[16];
> +
> +                                       snprintf(buf, sizeof(buf), "%d\n", 
> LINUX_OOM_ADJ);
> +                                       (void) write(fd, buf, strlen(buf));
> +                                       close(fd);
> +                               }
>                        }
>                }
>  #endif   /* LINUX_OOM_ADJ */
> --
> 1.7.6.1
>
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-09-19 Thread Dan McGee
This is one way to prevent the kernel warning message without having to
introduce a new constant. Scale the old oom_adj-style value the same way
the kernel internally does it and use that instead if oom_score_adj is
available for writing.

Signed-off-by: Dan McGee 
---

This addresses some of the concerns raised on the ML and will at least keep
those of us running modern kernels happy.

Alternatively one could switch the symbol used to be the new style and have the
old one computed; however this is a pain for legacy vs. current versions.

 src/backend/postmaster/fork_process.c |   22 +-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/src/backend/postmaster/fork_process.c 
b/src/backend/postmaster/fork_process.c
index b2fe9a1..3cded54 100644
--- a/src/backend/postmaster/fork_process.c
+++ b/src/backend/postmaster/fork_process.c
@@ -81,16 +81,36 @@ fork_process(void)
 * Use open() not stdio, to ensure we control the open 
flags. Some
 * Linux security environments reject anything but 
O_WRONLY.
 */
-   int fd = open("/proc/self/oom_adj", 
O_WRONLY, 0);
+   int fd = 
open("/proc/self/oom_score_adj", O_WRONLY, 0);
 
/* We ignore all errors */
if (fd >= 0)
{
charbuf[16];
+   int oom_score_adj;
 
+   /*
+* The compile-time value is the old style 
oom_adj;
+* scale it the same way the kernel does to
+* convert to the new style oom_score_adj. This
+* should become a constant at compile time.
+* Valid values range from -17 (never kill) to
+* 15; no attempt of validation is done.
+*/
+   oom_score_adj = LINUX_OOM_ADJ * 1000 / 17;
snprintf(buf, sizeof(buf), "%d\n", 
LINUX_OOM_ADJ);
(void) write(fd, buf, strlen(buf));
close(fd);
+   } else if (errno == EEXIST) {
+   int fd = open("/proc/self/oom_adj", 
O_WRONLY, 0);
+   if (fd >= 0)
+   {
+   charbuf[16];
+
+   snprintf(buf, sizeof(buf), "%d\n", 
LINUX_OOM_ADJ);
+   (void) write(fd, buf, strlen(buf));
+   close(fd);
+   }
}
}
 #endif   /* LINUX_OOM_ADJ */
-- 
1.7.6.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dan Ports
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
> 1. Machines with strong memory ordering.  On this category of machines
> (which include x86), the CPU basically does not perform loads or
> stores out of order.  On some of these machines, it is apparently
> possible for there to be some ordering of stores relative to loads,
> but if the program stores two values or loads two values, those
> operations will performed in the same order they appear in the
> program. 

This is all correct, but...

> The main thing you need to make your code work reliably on
> these machines is a primitive that keeps the compiler from reordering
> your code during optimization. 

If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.

Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.

> 2. Machines with weak memory ordering.  On this category of machines
> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
> reorders memory accesses arbitrarily unless you explicitly issue
> instructions that enforce synchronization.  You still need to keep the
> compiler from moving things around, too.  Alpha is particularly
> pernicious, because something like a->b can fetch the pointed-to value
> before loading the pointer itself.  This is otherwise known as "we
> have basically no cache coherency circuits on this chip at all".  On
> these machines, you need to issue an explicit memory barrier
> instruction at each sequence point, or just acquire and release a
> spinlock.

The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)

Alternatively, one might question how important it is to support the
Alpha these days...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-07 Thread Dan Ports
We should also apply the attached patch, which corrects a minor issue
with the conditions for flagging transactions that could potentially
make a snapshot unsafe.

There's a small window wherein a transaction is committed but not yet
on the finished list, and we shouldn't flag it as a potential conflict
if so. We can also skip adding a doomed transaction to the list of
possible conflicts because we know it won't commit.

This is not really a related issue, but Kevin and I found it while
looking into this issue, and it was included in the patch we sent out.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***
*** 1669,1676  RegisterSerializableTransactionInt(Snapshot snapshot)
  			 othersxact != NULL;
  			 othersxact = NextPredXact(othersxact))
  		{
! 			if (!SxactIsOnFinishedList(othersxact) &&
! !SxactIsReadOnly(othersxact))
  			{
  SetPossibleUnsafeConflict(sxact, othersxact);
  			}
--- 1676,1684 
  			 othersxact != NULL;
  			 othersxact = NextPredXact(othersxact))
  		{
! 			if (!SxactIsCommitted(othersxact)
! && !SxactIsDoomed(othersxact)
! && !SxactIsReadOnly(othersxact))
  			{
  SetPossibleUnsafeConflict(sxact, othersxact);
  			}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-07 Thread Dan Ports
On Thu, Jul 07, 2011 at 04:48:55PM -0400, Tom Lane wrote:
> Seems to me there's a more fundamental reason not to do that, which is
> that once you've done PREPARE it is no longer legitimate to decide to
> roll back the transaction to get out of a "dangerous" structure --- ie,
> you have to target one of the other xacts involved instead.  Shouldn't
> the assignment of a prepareSeqNo correspond to the point where the xact
> is no longer a target for SSI rollback?

That part is already accomplished by setting SXACT_FLAG_PREPARED (and
choosing a new victim if we think we want to abort a transaction with
that flag set).

prepareSeqNo is being used as a lower bound on the transaction's commit
sequence number. It's currently set at the same time as the PREPARED
flag, but it doesn't have to be.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI 2PC coverage

2011-07-05 Thread Dan Ports
On Tue, Jul 05, 2011 at 09:14:30PM +0300, Heikki Linnakangas wrote:
> I think that needs some explanation, why only those SxactIsCommitted() 
> tests need to be replaced with SxactIsPrepared()?

Here is the specific problem this patch addresses:

If there's a dangerous structure T0 ---> T1 ---> T2, and T2 commits
first, we need to abort something. If T2 commits before both conflicts
appear, then it should be caught by
OnConflict_CheckForSerializationFailure. If both conflicts appear
before T2 commits, it should be caught by
PreCommit_CheckForSerializationFailure. But that is actually run before
T2 *prepares*.

So the problem occurs if T2 is prepared but not committed when the
second conflict is detected. OnConflict_CFSF deems that OK, because T2
hasn't committed yet. PreCommit_CFSF doesn't see a problem either,
because the conflict didn't exist when it ran (before the transaction
prepared)

This patch fixes that by having OnConflict_CFSF declare a serialization
failure in this circumstance if T2 is already prepared, not just if
it's committed.

Although it fixes the situation described above, I wasn't initially
confident that there weren't other problematic cases. I wrote the
attached test case to convince myself of that. Because it tests all
possible sequences of conflict/prepare/commit that should lead to
serialization failure, the fact that they do all fail (with this patch)
indicates that these are the only checks that need to be changed.

> What about the first 
> SxactIsCommitted() test in OnConflict_CheckForSerializationFailure(), 
> for instance?

That one only comes up if the SERIALIZABLEXACT for one of the
transactions involved has been freed, and the RWConflict that points to
it has been replaced by a flag. That only happens if "writer" has
previously called ReleasePredicateLocks.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI atomic commit

2011-07-05 Thread Dan Ports
On Tue, Jul 05, 2011 at 01:15:13PM -0500, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
>  
> > Hmm, I think it would be simpler to decide that instead of 
> > SerializableXactHashLock, you must hold ProcArrayLock to access 
> > LastSxactCommitSeqNo, and move the assignment of commitSeqNo to 
> > ProcArrayTransaction(). It's probably easiest to move 
> > LastSxactCommitSeqno to ShmemVariableCache too. There's a few
> > places that would then need to acquire ProcArrayLock to read 
> > LastSxactCommitSeqno, but I feel it might still be much simpler
> > that way.
>  
> We considered that.  I think the biggest problem was that when there
> is no XID it wouldn't be covered by the lock on assignment.

One other issue is that after the sequence number is assigned, it still
needs to be stored in MySerializableXact->commitSeqNo. Modifying that
does require taking SerializableXactHashLock.

With the proposed patch, assigning the next commitSeqNo and storing it
in MySerializableXact happen atomically. That makes it possible to say
that a transaction that has a commitSeqNo must have committed before
one that doesn't. If the two steps are separated, that isn't true: two
transactions might get their commitSeqNos in one order and make them
visible in the other. We should be able to deal with that, but it will
make some of the commit ordering checks more complicated.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade version check improvements and small fixes

2011-06-24 Thread Dan McGee
On Wed, Jun 22, 2011 at 8:54 PM, Bruce Momjian  wrote:
>> 0003 is what I really wanted to solve, which was my failure with
>> pg_upgrade. The call to pg_ctl didn't succeed because the binaries
>> didn't match the data directory, thus resulting in this:
>>
>> The error had nothing to do with "trust" at all; it was simply that I
>> tried to use 9.0 binaries with an 8.4 data directory. My patch checks
>> for this and ensures that the -D bindir is the correct version, just
>> as the -B datadir has to be the correct version.
>
> I had not thought about testing if the bin and data directory were the
> same major version, but you are right that it generates an odd error if
> they are not.
>
> I changed your sscanf format string because the version can be just
> "9.2dev" and there is no need for the minor version.
No problem- you're going to know way more about this than me, and I
was just going off what I saw in my two installed versions and a quick
look over at the code of pg_ctl to make sure that string was never
translated.

On a side note I don't think I ever mentioned to everyone else why
parsing the version from pg_ctl rather than pg_config was done- this
is so we don't introduce any additional binary requirements. Tom Lane
noted in an earlier cleanup series that pg_config is not really needed
at all in the old cluster, so I wanted to keep it that way, but pg_ctl
has always been required.

>   I saw no reason
> to test if the binary version matches the pg_upgrade version because we
> already test the cluster version, and we test the cluster version is the
> same as the binary version.
Duh. That makes sense. Thanks for getting to these so quickly!

To partially toot my own horn but also show where a user like me
encountered this, after some packaging hacking, anyone running Arch
Linux should be able to do a pg_upgrade from here on out by installing
the postgresql-old-upgrade package
(http://www.archlinux.org/packages/extra/x86_64/postgresql-old-upgrade/)
and possible consulting the Arch wiki.

-Dan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Repeated PredicateLockRelation calls during seqscan

2011-06-22 Thread Dan Ports
On Wed, Jun 22, 2011 at 12:07:04PM +0300, Heikki Linnakangas wrote:
> Hmm, I wonder if we should move this logic to heapam.c. The optimization 
> to acquire a relation lock straight away should apply to all heap scans, 
> not only those coming from ExecSeqScan. The distinction is academic at 
> the moment, because that's the only caller that uses a regular MVCC 
> snapshot, but it seems like a modularity violation for nodeSeqscan.c to 
> reach into the HeapScanDesc to set the flag and grab the whole-relation 
> lock, while heapam.c contains the PredicateLockTuple and 
> CheckForSerializableConflictOut() calls.

On modularity grounds, I think that's a good idea. The other
PredicateLock* calls live in the access methods: heapam, nbtree, and
indexam for the generic index support. heap_beginscan_internal seems
like a reasonable place, as long as we're OK with taking the lock even
if the scan is initialized but never called.

Note that this hadn't been a reasonable option until last week when we
added the check for non-MVCC snapshots, since there are lots of things
that use heap scans but SeqScan is the only (currently-existing) one we
want to lock.

I am rather uneasy about making changes here unless we can be
absolutely certain they're right...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade version check improvements and small fixes

2011-06-21 Thread Dan McGee
Not sure what the normal process is for patches, but I put together a
few small patches for pg_upgrade after trying to use it earlier today
and staring a non-helpful error message before I finally figured out
what was going on.

0001 is just a simple typo fix, but didn't want to mix it in with the rest.
0002 moves a function around to be declared in the only place it is
needed, and prevents a "sh: /oasdfpt/pgsql-8.4/bin/pg_config: No such
file or directory" error message when you give it a bogus bindir.

0003 is what I really wanted to solve, which was my failure with
pg_upgrade. The call to pg_ctl didn't succeed because the binaries
didn't match the data directory, thus resulting in this:

$ pg_upgrade --check -d /tmp/olddata -D /tmp/newdata -b /usr/bin/ -B /usr/bin/
Performing Consistency Checks
-
Checking old data directory (/tmp/olddata)  ok
Checking old bin directory (/usr/bin)   ok
Checking new data directory (/tmp/newdata)  ok
Checking new bin directory (/usr/bin)   ok
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Trying to start old server  .ok

 Unable to start old postmaster with the command: "/usr/bin/pg_ctl" -l
"/dev/null" -D "/tmp/olddata" -o "-p 5432 -c autovacuum=off -c
autovacuum_freeze_max_age=20" start >> "/dev/null" 2>&1
Perhaps pg_hba.conf was not set to "trust".

The error had nothing to do with "trust" at all; it was simply that I
tried to use 9.0 binaries with an 8.4 data directory. My patch checks
for this and ensures that the -D bindir is the correct version, just
as the -B datadir has to be the correct version.

I'm not on the mailing list nor do I have a lot of free time to keep
up with normal development, but if there are quick things I can do to
get these patches in let me know.

-Dan
From 840bdd22b62c8d45796abf7eb9e7b3da0329dce8 Mon Sep 17 00:00:00 2001
From: Dan McGee 
Date: Tue, 21 Jun 2011 18:48:01 -0500
Subject: [PATCH 1/3] pg_upgrade: fix typo in consistency check message

---
 contrib/pg_upgrade/check.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
index 376d25a..2b481da 100644
--- a/contrib/pg_upgrade/check.c
+++ b/contrib/pg_upgrade/check.c
@@ -30,7 +30,7 @@ output_check_banner(bool *live_check)
 		if (old_cluster.port == new_cluster.port)
 			pg_log(PG_FATAL, "When checking a live server, "
    "the old and new port numbers must be different.\n");
-		pg_log(PG_REPORT, "PerForming Consistency Checks on Old Live Server\n");
+		pg_log(PG_REPORT, "Performing Consistency Checks on Old Live Server\n");
 		pg_log(PG_REPORT, "\n");
 	}
 	else
-- 
1.7.5.4

From f3e393318ba36ef543f77fb8983902d466ebe8c8 Mon Sep 17 00:00:00 2001
From: Dan McGee 
Date: Tue, 21 Jun 2011 18:49:47 -0500
Subject: [PATCH 2/3] pg_upgrade: remove libpath from cluster info struct

We only use this item in one check and then no longer need it.
Additionally, get_pkglibdir() is currently run before we do our checks
on the bin/ directory, so an incorrectly specified bin/ directory will
evoke failures at the "wrong" point.

Move the entire function into the file that uses it so it can remain
static.
---
 contrib/pg_upgrade/check.c  |   35 +-
 contrib/pg_upgrade/option.c |   45 ---
 contrib/pg_upgrade/pg_upgrade.h |1 -
 3 files changed, 34 insertions(+), 47 deletions(-)

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
index 2b481da..5ff2d81 100644
--- a/contrib/pg_upgrade/check.c
+++ b/contrib/pg_upgrade/check.c
@@ -19,6 +19,7 @@ static void check_is_super_user(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static char *get_pkglibdir(const char *bindir);
 
 
 void
@@ -246,14 +247,17 @@ void
 check_cluster_compatibility(bool live_check)
 {
 	char		libfile[MAXPGPATH];
+	char	   *libpath;
 	FILE	   *lib_test;
 
 	/*
 	 * Test pg_upgrade_support.so is in the proper place.	 We cannot copy it
 	 * ourselves because install directories are typically root-owned.
 	 */
-	snprintf(libfile, sizeof(libfile), "%s/pg_upgrade_support%s", new_cluster.libpath,
+	libpath = get_pkglibdir(new_cluster.bindir);
+	snprintf(libfile, sizeof(libfile), "%s/pg_upgrade_support%s", libpath,
 			 DLSUFFIX);
+	pg_free(libpath);
 
 	if ((lib_test = fopen(libfile, "r")) == NULL)
 		pg_log(PG_FATAL,
@@ -730,3 +734,32 @@ check_for_reg_data_type_usag

Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-21 Thread Dan Ports
On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote:
> Yes, I suspect it can be done better. The reason it's tricky is a lock
> ordering issue; part of releasing a SerializableXact has to be done
> while holding SerializableXactHashLock and part has to be done without
> it (because it involves taking partition locks). Reworking the order of
> these things might work, but would require some careful thought since
> most of the code is shared with the non-abort cleanup paths. And yes,
> it's definitely the time for that.

...by which I mean it's definitely *not* the time for that, of course.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-21 Thread Dan Ports
On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote:
> Thanks, committed.

Thanks.

> In the long term, I'm not sure this is the best way to handle this. It 
> feels a bit silly to set the flag, release the SerializableXactHashLock, 
> and reacquire it later to remove the transaction from the hash table. 
> Surely that could be done in some more straightforward way. But I don't 
> feel like fiddling with it this late in the release cycle.

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

I've been meaning to take another look at this part of the code anyway,
with an eye towards performance. SerializableXactHashLock can be a
bottleneck on certain in-memory workloads.

> > One of the prepared_xacts regression tests actually hits this bug.
> > I removed the anomaly from the duplicate-gids test so that it fails in
> > the intended way, and added a new test to check serialization failures
> > with a prepared transaction.
> 
> Hmm, I have ran "make installcheck" with 
> default_transaction_isolation='serializable' earlier. I wonder why I 
> didn't see that.

The affected test was being run at SERIALIZABLE already, so that
wouldn't have made a difference. One reason I didn't notice an issue
when I looked at that test before before, is that it was intended to
fail anyway, just with a different error. I didn't think too hard about
which error would take precedence.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Repeated PredicateLockRelation calls during seqscan

2011-06-21 Thread Dan Ports
I was looking at ExecSeqScan today and noticed that it invokes
PredicateLockRelation each time it's called, i.e. for each tuple
returned. Any reason we shouldn't skip that call if
rs_relpredicatelocked is already set, as in the attached patch?

That would save us a bit of overhead, since checking that flag is
cheaper than doing a hash lookup in the local predicate lock table
before bailing out.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f356874..32a8f56 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -113,9 +113,13 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSeqScan(SeqScanState *node)
 {
-	PredicateLockRelation(node->ss_currentRelation,
-		  node->ss_currentScanDesc->rs_snapshot);
-	node->ss_currentScanDesc->rs_relpredicatelocked = true;
+	if (!node->ss_currentScanDesc->rs_relpredicatelocked)
+	{
+		PredicateLockRelation(node->ss_currentRelation,
+			  node->ss_currentScanDesc->rs_snapshot);
+		node->ss_currentScanDesc->rs_relpredicatelocked = true;		
+	}
+	
 	return ExecScan((ScanState *) node,
 	(ExecScanAccessMtd) SeqNext,
 	(ExecScanRecheckMtd) SeqRecheck);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Coding style point: "const" in function parameter declarations

2011-06-21 Thread Dan Ports
On Tue, Jun 21, 2011 at 06:51:20PM -0400, Tom Lane wrote:
> I find this to be poor style, and would like to see if there's any
> support for getting rid of the "const" keywords. 

I'm in favor of removing them too.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-20 Thread Dan Ports
While testing the fix for this one, I found another bug. Patches for
both are attached.

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.


The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
 far ---> near ---> me
where neither the "far" or "near" transactions have committed. If so, 
it aborts the "near" transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6c55211..3678878 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -246,7 +246,6 @@
 
 #define SxactIsCommitted(sxact) (((sxact)->flags & SXACT_FLAG_COMMITTED) != 0)
 #define SxactIsPrepared(sxact) (((sxact)->flags & SXACT_FLAG_PREPARED) != 0)
-#define SxactIsRolledBack(sxact) (((sxact)->flags & SXACT_FLAG_ROLLED_BACK) != 0)
 #define SxactIsDoomed(sxact) (((sxact)->flags & SXACT_FLAG_DOOMED) != 0)
 #define SxactIsReadOnly(sxact) (((sxact)->flags & SXACT_FLAG_READ_ONLY) != 0)
 #define SxactHasSummaryConflictIn(sxact) (((sxact)->flags & SXACT_FLAG_SUMMARY_CONFLICT_IN) != 0)
@@ -3047,7 +3046,7 @@ SetNewSxactGlobalXmin(void)
 
 	for (sxact = FirstPredXact(); sxact != NULL; sxact = NextPredXact(sxact))
 	{
-		if (!SxactIsRolledBack(sxact)
+		if (!SxactIsDoomed(sxact)
 			&& !SxactIsCommitted(sxact)
 			&& sxact != OldCommittedSxact)
 		{
@@ -3114,7 +3113,6 @@ ReleasePredicateLocks(const bool isCommit)
 	Assert(!isCommit || SxactIsPrepared(MySerializableXact));
 	Assert(!isCommit || !SxactIsDoomed(MySerializableXact));
 	Assert(!SxactIsCommitted(MySerializableXact));
-	Assert(!SxactIsRolledBack(MySerializableXact));
 
 	/* may not be serializable during COMMIT/ROLLBACK PREPARED */
 	if (MySerializableXact->pid != 0)
@@ -3153,22 +3151,7 @@ ReleasePredicateLocks(const bool isCommit)
 			MySerializableXact->flags |= SXACT_FLAG_READ_ONLY;
 	}
 	else
-	{
-		/*
-		 * The DOOMED flag indicates that we intend to roll back this
-		 * transaction and so it should not cause serialization
-		 * failures for other transactions that conflict with
-		 * it. Note that this flag might already be set, if another
-		 * backend marked this transaction for abort.
-		 *
-		 * The ROLLED_BACK flag further indicates that
-		 * ReleasePredicateLocks has been called, and so the
-		 * SerializableXact is eligible for cleanup. This means it
-		 * should not be considered when calculating SxactGlobalXmin.
-		 */
 		MySerializableXact->flags |= SXACT_FLAG_DOOMED;
-		MySerializableXact->flags |= SXACT_FLAG_ROLLED_BACK;
-	}
 
 	if (!topLevelIsDeclaredReadOnly)
 	{
@@ -3544,7 +3527,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 nextConflict;
 
 	Assert(sxact != NULL);
-	Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact));
+	Assert(SxactIsDoomed(sxact) || SxactIsCommitted(sxact));
 	Assert(LWLockHeldByMe(SerializableFinishedListLock));
 
 	/*
diff --git a/src/include/storage/predicate_internals.h b/src/include/storage/predicate_internals.h
index 34c661d..495983f 100644
--- a/src/include/storage/predicate_internals.h
+++ b/src/include/storage/predicate_internals.h
@@ -90,22 +90,21 @@ typedef struct SERIALIZABLEXACT
 	int			pid;			/* pid of associated process */
 } SERIALIZABLEXACT;
 
-#define SXACT_FLAG_COMMITTED			0x0001	/* already committed */
-#define SXACT_FLAG_PREPARED0x0002	/* about to commit */
-#define SXACT_FLAG_ROLLED_BACK			0x0004	/* already rolled back */
-#define SXACT_FLAG_DOOMED0x0008	/* will roll back */
+#define SXACT_FLAG_COMMITTED0x0001	/* already committed */
+#define SXACT_FLAG_PREPARED	0x0002	/* about to commit */
+#define SXACT_FLAG_DOOMED	0x0004	/* will roll back */
 /*
  * The following flag actually means that the flagged transaction has a
  * conflict out *to

Re: [HACKERS] pika buildfarm member failure on isolationCheck tests

2011-06-20 Thread Dan Ports
On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote:
> Is this an open item for 9.1beta3?

Yes. I've put it on the list.

The SxactGlobalXmin and its refcount were getting out of sync with the
actual transaction state. This is timing-dependent but I can reproduce it
fairly reliably under concurrent workloads.

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to
do everywhere else, but gets us in trouble here. We shouldn't be
ignoring doomed transactions in SetNewSxactGlobalXmin, because they're
not eligible for cleanup until the associated backend aborts the
transaction and calls ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed check
from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted transaction's
SerializableXact (it pretty much has to, because we must drop and
reacquire SerializableXactHashLock in between). But we don't want the
aborted transaction included in the SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even
though it's only set for this brief interval. We need to be able to
distinguish a transaction that's just been marked for death (doomed)
from one that's already called ReleasePredicateLocks.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-16 Thread Dan Ports
On Fri, Jun 17, 2011 at 12:32:46AM -0400, Robert Haas wrote:
> Perhaps it would be best to remove the general item and replace it
> with a list of more specific things that need doing - which might just
> mean #5.

Done.

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI work for 9.1

2011-06-16 Thread Dan Ports
On Thu, Jun 16, 2011 at 11:49:48PM -0400, Robert Haas wrote:
> Does this mean that the open item "more SSI loose ends" can now be
> marked resolved?

I was just looking at it and contemplating moving it to the non-blockers
list. Of the five items:
 - (1) and (4) are resolved
 - (2) isn't an issue -- maybe we want to add a comment, someplace but
   I'm not convinced even that is necessary
 - (3) is a regression test, and is already on the list separately
 - (5) is a doc issue only

There are no open issues with the code that I'm aware of.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: update README-SSI

2011-06-16 Thread Dan Ports
On Thu, Jun 16, 2011 at 04:39:09PM +0300, Heikki Linnakangas wrote:
> There's no mention on what T1 is. I believe it's supposed to be Tin, in 
> the terminology used in the graph.

Yes, I changed the naming after I originally wrote it, and missed a
couple spots. T1 should be Tin.

> I don't see how there can be a ww-dependency between T0 and Tin. There 
> can't be a rw-conflict because Tin is read-only, so surely there can't 
> be a ww-conflict either?

Yes, it can only be a wr-conflict. Good catch.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: update README-SSI

2011-06-15 Thread Dan Ports
The attached patch updates README-SSI. In addition to some minor edits,
changes include:

 - add a section at the beginning that more clearly describes the SSI
   rule and defines "dangerous structure" with a diagram. It describes
   the optimizations we use about the relative commit times, and the
   case where one transaction is read-only. It includes a proof for the
   latter (novel) optimization, per Heikki's request.

 - note that heap page locks do not lock "gaps" like index pages

 - be clear about what's been implemented (parts of the README used the
   future tense, probably because they were written long ago), and
   remove a couple items from the "R&D Issues" list that have since
   been addressed.
   
Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/README-SSI b/src/backend/storage/lmgr/README-SSI
index c685bee..09a8136 100644
--- a/src/backend/storage/lmgr/README-SSI
+++ b/src/backend/storage/lmgr/README-SSI
@@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is
 run alone (before or after any other transaction), it will always do
 the right thing in any mix of concurrent serializable transactions.
 Where conflicts with other transactions would result in an
-inconsistent state within the database, or an inconsistent view of
+inconsistent state within the database or an inconsistent view of
 the data, a serializable transaction will block or roll back to
 prevent the anomaly. The SQL standard provides a specific SQLSTATE
 for errors generated when a transaction rolls back for this reason,
 so that transactions can be retried automatically.
 
-Before version 9.1 PostgreSQL did not support a full serializable
+Before version 9.1, PostgreSQL did not support a full serializable
 isolation level. A request for serializable transaction isolation
 actually provided snapshot isolation. This has well known anomalies
 which can allow data corruption or inconsistent views of the data
@@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies
 
 Techniques for implementing full serializable isolation have been
 published and in use in many database products for decades. The
-primary technique which has been used is Strict 2 Phase Locking
+primary technique which has been used is Strict Two-Phase Locking
 (S2PL), which operates by blocking writes against data which has been
 read by concurrent transactions and blocking any access (read or
 write) against data which has been written by concurrent
@@ -112,54 +112,90 @@ visualize the difference between the serializable implementations
 described above, is to consider that among transactions executing at
 the serializable transaction isolation level, the results are
 required to be consistent with some serial (one-at-a-time) execution
-of the transactions[1]. How is that order determined in each?
+of the transactions [1]. How is that order determined in each?
 
-S2PL locks rows used by the transaction in a way which blocks
-conflicting access, so that at the moment of a successful commit it
-is certain that no conflicting access has occurred. Some transactions
-may have blocked, essentially being partially serialized with the
-committing transaction, to allow this. Some transactions may have
-been rolled back, due to cycles in the blocking. But with S2PL,
-transactions can always be viewed as having occurred serially, in the
-order of successful commit.
+In S2PL, each transaction locks any data it accesses. It holds the
+locks until committing, preventing other transactions from making
+conflicting accesses to the same data in the interim. Some
+transactions may have to be rolled back to prevent deadlock. But
+successful transactions can always be viewed as having occurred
+sequentially, in the order they committed.
 
 With snapshot isolation, reads never block writes, nor vice versa, so
-there is much less actual serialization. The order in which
-transactions appear to have executed is determined by something more
-subtle than in S2PL: read/write dependencies. If a transaction
-attempts to read data which is not visible to it because the
-transaction which wrote it (or will later write it) is concurrent
-(one of them was running when the other acquired its snapshot), then
-the reading transaction appears to have executed first, regardless of
-the actual sequence of transaction starts or commits (since it sees a
-database state prior to that in which the other transaction leaves
-it). If one transaction has both rw-dependencies in (meaning that a
-concurrent transaction attempts to read data it writes) and out
-(meaning it attempts to read data a concurrent transaction writes),
-and a couple other conditions are met, there can appear to be a cycle
-in execution order of the transactions. This is when the anomalies
-occur.
-
-SSI works by watching for the conditions mentioned above, and rolling
-back a tr

Re: [HACKERS] SSI patch renumbered existing 2PC resource managers??

2011-06-13 Thread Dan Ports
On Mon, Jun 13, 2011 at 03:33:24PM -0400, Tom Lane wrote:
> We can either change that now, or undo the
> unnecessary change in existing RM IDs.  I vote for the latter.

Sounds good to me. I'd offer a patch, but it'd probably take you longer
to apply than to make the change yourself.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI patch renumbered existing 2PC resource managers??

2011-06-13 Thread Dan Ports
On Mon, Jun 13, 2011 at 10:22:19PM +0300, Heikki Linnakangas wrote:
> As far as I can tell it was for purely cosmetic reasons, to have lock 
> and predicate lock lines together.

Yes, that is the only reason.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small SSI issues

2011-06-11 Thread Dan Ports
On Sat, Jun 11, 2011 at 01:38:31PM -0500, Kevin Grittner wrote:
> I'm not concerned about references covered by
> SerializableXactHashLock.  I am more concerned about some of the
> tests for whether the (MySerializableXact == InvalidSerializableXact)
> checks and any other tests not covered by that lock are OK without it
> (and OK with it).  Since my knowledge of weak memory ordering
> behavior is, well, weak I didn't want to try to make that call.

Oh, those checks are definitely not an issue -- MySerializableXact
itself (the pointer, not the thing it's pointing to) is in
backend-local memory, so it won't change under us.

The volatile qualifier (as written) doesn't help with that anyway, it
attaches to the data being pointed to, not the pointer itself.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   >