Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Andres Freund
Hi,

On 2012-12-23 02:36:42 -0500, Greg Smith wrote:
> I'm testing a checkout from a few days ago and trying to complete a day long
> pgbench stress test, with assertions and debugging on.  I want to make sure
> the base code works as expected before moving on to testing checksums.  It's
> crashing before finishing though.  Here's a sample:
>
> 2012-12-20 20:36:11 EST [26613]: LOG:  checkpoint complete: wrote 32 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108
> s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-20 20:36:44 EST [26611]: LOG:  server process (PID 4834) was
> terminated by signal 6: Aborted
> 2012-12-20 20:36:44 EST [26611]: DETAIL:  Failed process was running: END;
>
> Running the same test set again gave the same crash, so this looks
> repeatable.  It's not trivial to trigger given the average runtime to crash
> I'm seeing.  Second sample:
>
> 2012-12-22 21:27:17 EST [6006]: LOG:  checkpoint complete: wrote 34 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310
> s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-22 21:27:23 EST [26611]: LOG:  server process (PID 12143) was
> terminated by signal 6: Aborted
> 2012-12-22 21:27:23 EST [26611]: DETAIL:  Failed process was running: END;
>
> The important part:
>
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
>
> That's the check done by "AtEOXact_Buffers - clean up at end of
> transaction".  It fired while executing the "END;"" statement at the end of
> the standard pgbench test.  The test looks for loose things that weren't
> pinned/unpinned correctly by the transaction.  I'm not sure if getting a
> stack trace at that point will be useful.  The ref count mistake could
> easily have been made by an earlier statement in the transaction block,
> right?
>
> This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test
> runs were automated with pgbench-tools, using the following configuration:
>
> MAX_WORKERS="4"
> SCRIPT="tpc-b.sql"
> SCALES="500 1000"
> SETCLIENTS="4 8 16 32 64 96"
> SETTIMES=3
> RUNTIME=600
>
> All of the crashes so far have been at scale=500, and I haven't seen those
> finish yet.  It failed 7 hours in the first time, then 4 hours in.
>
> For code reference, the last commit in the source code tree I built against
> was:
>
> af275a12dfeecd621ed9899a9382f26a68310263
> Thu Dec 20 14:23:31 2012 +0200
>
> Looking at recent activity, the only buffer pin related changes I saw was
> this one:
>
> commit 62656617dbe49cca140f3da588a5e311b3fc35ea
> Date:   Mon Dec 3 18:53:31 2012 +
> Avoid holding vmbuffer pin after VACUUM.

I don't immediately see anything bad here.

> This is my first test like this against 9.3 development though, so the cause
> could be an earlier commit.  I'm just starting with the most recent work as
> the first suspect.  Next I think I'll try autovacuum=off and see if the
> crash goes away.  Other ideas are welcome.

Something like the (untested) debug message below might be helpful:

diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 7be767b..20f61a1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit)
 
for (i = 0; i < NBuffers; i++)
{
-   Assert(PrivateRefCount[i] == 0);
+   if (PrivateRefCount[i] != 0)
+   {
+   BufferDesc *bufHdr = &BufferDescriptors[i];
+   elog(PANIC, "refcount of %s is %u should be 0, 
globally: %u",
+relpathbackend(bufHdr->tag.rnode, 
InvalidBackendId, bufHdr->tag.forkNum),
+PrivateRefCount[i], bufHdr->refcount);
+   }
}
}
 #endif

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto seeding problem when ssl=on

2012-12-23 Thread Marko Kreen
On Sat, Dec 22, 2012 at 9:20 PM, Tom Lane  wrote:
> I'm not thrilled with the suggestion to do RAND_cleanup() after forking
> though, as that seems like it'll guarantee that each session starts out
> with only a minimal amount of entropy.

No, that's actually the right fix - that forces OpenSSL to do new reseed
with system randomness, thus making backend (including SSL_accept)
maximally disconnected from static pool in postmaster.

This also makes behaviour equal to current ssl=off and exec-backend
mode, which already do initial seeding in backend.

The fact that PRNG behaviour is affected by complex set of compile-
and run-time switches makes current situation rather fragile and
hard to understand.

>  What seems to me like it'd be
> most secure is to make the postmaster do RAND_add() with a gettimeofday
> result after each successful fork, say at the bottom of
> BackendStartup().  That way the randomness accumulates in the parent
> process, and there's no way to predict the initial state of any session
> without exact knowledge of every child process launch since postmaster
> start.  So it'd go something like
>
> #ifdef USE_SSL
> if (EnableSSL)
> {
> struct timeval tv;
>
> gettimeofday(&tv, NULL);
> RAND_add(&tv, sizeof(tv), 0);
> }
> #endif

If you decide against RAND_cleanup in backend and instead
do workarounds in backend or postmaster, then please
also to periodic RAND_cleanup in postmaster.  This should
make harder to exploit weaknesses in reused slowly moving state.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup from cascading standby after timeline switch

2012-12-23 Thread Fujii Masao
On Fri, Dec 21, 2012 at 9:54 PM, Heikki Linnakangas
 wrote:
> Yes, this should be backpatched to 9.2. I came up with the attached.

In this patch, if '-X stream' is specified in pg_basebackup, the timeline
history files are not backed up. We should change pg_backup background
process and walsender so that they stream also timeline history files,
for example, by using 'TIMELINE_HISTORY' replication command?
Or basebackup.c should send all timeline history files at the end of backup
even if '-X stream' is specified?

> However, thinking about this some more, there's a another bug in the way WAL
> files are included in the backup, when a timeline switch happens.
> basebackup.c includes all the WAL files on ThisTimeLineID, but when the
> backup is taken from a standby, the standby might've followed a timeline
> switch. So it's possible that some of the WAL files should come from
> timeline 1, while others should come from timeline 2. This leads to an error
> like "requested WAL segment 0001000C has already been
> removed" in pg_basebackup.
>
> Attached is a script to reproduce that bug, if someone wants to play with
> it. It's a bit sensitive to timing, and needs tweaking the paths at the top.
>
> One solution to that would be to pay more attention to the timelines to
> include WAL from. basebackup.c could read the timeline history file, to see
> exactly where the timeline switches happened, and then construct the
> filename of each WAL segment using the correct timeline id. Another approach
> would be to do readdir() on pg_xlog, and include all WAL files, regardless
> of timeline IDs, that fall in the right XLogRecPtr range. The latter seems
> easier to backpatch.

The latter sounds good to me. But how all WAL files with different timelines
are shipped in pg_basebackup -X stream mode?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-23 Thread Simon Riggs
On 22 December 2012 20:13, Kevin Grittner  wrote:

> I apologize again for coming in so late with strong opinions, but I
> thought I knew what "row level security" meant, and it was just a
> question of how to do it, but I can't reconcile what I thought the
> feature was about with the patch I'm seeing; perhaps it's just a
> lack of the hight level context  that's making it difficult.

Agreed, I think we're all feeling that. I'll do my best to accommodate
all viewpoints.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] pg_ping utility

2012-12-23 Thread Michael Paquier
On Sat, Dec 22, 2012 at 4:07 AM, Phil Sorber  wrote:

>
> Added new version with default verbose and quiet option. Also updated
> docs to reflect changes.
>
Thanks for the updated patches.

Here is the status about the binary patch:
- Code compiles without any warnings
- After testing the patch, it behaves as expected, default option is now
verbose, the output can be hidden using -q or --quiet
However, I still have the following comments:
- in pg_isready.c, the function "help" needs to be static.
- the list of options called with getopt_long should be classified in
alphabetical order (the option q is not at the correct position)
d:h:p:U:qV => d:h:p:qU:V

Then, about the doc patch:
- docs compile correctly (I did a manual check)
- I am not a native English speaker, but the docs look correct to me. There
are enough examples and description is enough. No problems either with the
sgml format.

Once the 2 small things I noticed are fixed, this patch can be marked as
ready for committer.
Thanks,
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] [WIP] pg_ping utility

2012-12-23 Thread Phil Sorber
On Sun, Dec 23, 2012 at 9:29 AM, Michael Paquier
 wrote:
>
>
> On Sat, Dec 22, 2012 at 4:07 AM, Phil Sorber  wrote:
>>
>>
>> Added new version with default verbose and quiet option. Also updated
>> docs to reflect changes.
>
> Thanks for the updated patches.
>
> Here is the status about the binary patch:
> - Code compiles without any warnings
> - After testing the patch, it behaves as expected, default option is now
> verbose, the output can be hidden using -q or --quiet
> However, I still have the following comments:
> - in pg_isready.c, the function "help" needs to be static.

I have no objection to making this static, but curious what is your
reason for it here?

> - the list of options called with getopt_long should be classified in
> alphabetical order (the option q is not at the correct position)
> d:h:p:U:qV => d:h:p:qU:V
>
> Then, about the doc patch:
> - docs compile correctly (I did a manual check)
> - I am not a native English speaker, but the docs look correct to me. There
> are enough examples and description is enough. No problems either with the
> sgml format.
>
> Once the 2 small things I noticed are fixed, this patch can be marked as
> ready for committer.

Ok, I will get an updated version later today.

> Thanks,
> --
> Michael Paquier
> http://michael.otacoo.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-23 Thread Fujii Masao
On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao  wrote:
> On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao  wrote:
>> On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
>>  wrote:
>>> On 06.12.2012 15:39, Amit Kapila wrote:

 On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:
>
> On 05.12.2012 14:32, Amit Kapila wrote:
>>
>> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
>>>
>>> After some diversions to fix bugs and refactor existing code, I've
>>> committed a couple of small parts of this patch, which just add some
>>> sanity checks to notice incorrect PITR scenarios. Here's a new
>>> version of the main patch based on current HEAD.
>>
>>
>> After testing with the new patch, the following problems are observed.
>>
>> Defect - 1:
>>
>>   1. start primary A
>>   2. start standby B following A
>>   3. start cascade standby C following B.
>>   4. start another standby D following C.
>>   5. Promote standby B.
>>   6. After successful time line switch in cascade standby C&   D,
>
> stop D.
>>
>>   7. Restart D, Startup is successful and connecting to standby C.
>>   8. Stop C.
>>   9. Restart C, startup is failing.
>
>
> Ok, the error I get in that scenario is:
>
> C 2012-12-05 19:55:43.840 EET 9283 FATAL:  requested timeline 2 does not
> contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
> 19:55:43.841 EET 9282 LOG:  startup process (PID 9283) exited with exit
> code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG:  aborting startup due to
> startup process failure
>

>
> That mismatch causes the error. I'd like to fix this by always treating
> the checkpoint record to be part of the new timeline. That feels more
> correct. The most straightforward way to implement that would be to peek
> at the xlog record before updating replayEndRecPtr and replayEndTLI. If
> it's a checkpoint record that changes TLI, set replayEndTLI to the new
> timeline before calling the redo-function. But it's a bit of a
> modularity violation to peek into the record like that.
>
> Or we could just revert the sanity check at beginning of recovery that
> throws the "requested timeline 2 does not contain minimum recovery point
> 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
> record that says "unexpected timeline ID %u in checkpoint record, before
> reaching minimum recovery point %X/%X on timeline %u" checks basically
> the same thing, but at a later stage. However, the way
> minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
> fix that.
>
> I'm thinking of something like the attached (with some more comments
> before committing). Thoughts?


 This has fixed the problem reported.
 However, I am not able to think will there be any problem if we remove
 check
 "requested timeline 2 does not contain minimum recovery point
>
> 0/3023F08 on timeline 1" at beginning of recovery and just update

 replayEndTLI with ThisTimeLineID?
>>>
>>>
>>> Well, it seems wrong for the control file to contain a situation like this:
>>>
>>> pg_control version number:932
>>> Catalog version number:   201211281
>>> Database system identifier:   5819228770976387006
>>> Database cluster state:   shut down in recovery
>>> pg_control last modified: pe  7. joulukuuta 2012 17.39.57
>>> Latest checkpoint location:   0/3023EA8
>>> Prior checkpoint location:0/260
>>> Latest checkpoint's REDO location:0/3023EA8
>>> Latest checkpoint's REDO WAL file:00020003
>>> Latest checkpoint's TimeLineID:   2
>>> ...
>>> Time of latest checkpoint:pe  7. joulukuuta 2012 17.39.49
>>> Min recovery ending location: 0/3023F08
>>> Min recovery ending loc's timeline:   1
>>>
>>> Note the latest checkpoint location and its TimelineID, and compare them
>>> with the min recovery ending location. The min recovery ending location is
>>> ahead of latest checkpoint's location; the min recovery ending location
>>> actually points to the end of the checkpoint record. But how come the min
>>> recovery ending location's timeline is 1, while the checkpoint record's
>>> timeline is 2.
>>>
>>> Now maybe that would happen to work if remove the sanity check, but it still
>>> seems horribly confusing. I'm afraid that discrepancy will come back to
>>> haunt us later if we leave it like that. So I'd like to fix that.
>>>
>>> Mulling over this for some more, I propose the attached patch. With the
>>> patch, we peek into the checkpoint record, and actually perform the timeline
>>> switch (by changing ThisTimeLineID) before replaying it. That way the
>>> checkpoint record is really considered to 

Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2012-12-23 Thread Simon Riggs
On 20 December 2012 14:56, Amit Kapila  wrote:

>> > 1. There is no performance change for cloumns that have all valid
>> > values(non- NULLs).

I don't see any tests (at all) that measure this.

I'm particularly interested in lower numbers of columns, so we can
show no regression for the common case.


>> > 2. There is a visible performance increase when number of columns
>> containing
>> > NULLS are more than > 60~70% in table have large number of columns.
>> >
>> > 3. There are visible space savings when number of columns containing
>> NULLS
>> > are more than > 60~70% in table have large number of columns.

Agreed.

I would call that quite disappointing though and was expecting better.
Are we sure the patch works and the tests are correct?

The lack of any space saving for lower % values is strange and
somewhat worrying. There should be a 36? byte saving for 300 null
columns in an 800 column table - how does that not show up at all?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] pg_ping utility

2012-12-23 Thread Erik Rijkers
On Sun, December 23, 2012 15:29, Michael Paquier wrote:
>
> Once the 2 small things I noticed are fixed, this patch can be marked as
> ready for committer.

I wasn't going to complain about it, but if we're going for small things 
anyway...

The output is now capitalised:

/tmp:6543 - Accepting Connections
/tmp:6000 - No Response

which looks unusual to me, could we please make it all lower-case?


TYhanks,

Erik Rijkers





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] pg_ping utility

2012-12-23 Thread Phil Sorber
On Sun, Dec 23, 2012 at 9:57 AM, Erik Rijkers  wrote:
> On Sun, December 23, 2012 15:29, Michael Paquier wrote:
>>
>> Once the 2 small things I noticed are fixed, this patch can be marked as
>> ready for committer.
>
> I wasn't going to complain about it, but if we're going for small things 
> anyway...
>
> The output is now capitalised:
>
> /tmp:6543 - Accepting Connections
> /tmp:6000 - No Response
>
> which looks unusual to me, could we please make it all lower-case?

I'm not apposed to it in principal. Is that how it is done elsewhere
in the code? If there are no objections from anyone else I will roll
that change in with the others.

>
>
> TYhanks,
>
> Erik Rijkers
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-23 Thread Fujii Masao
On Sat, Dec 22, 2012 at 5:14 AM, Heikki Linnakangas
 wrote:
> On 21.12.2012 21:43, Simon Riggs wrote:
>>
>> On 21 December 2012 19:35, Bruce Momjian  wrote:
>>
 It's not too complex. You just want that to be true. The original
 developer has actually literally gone away, but not because of this.
>>>
>>>
>>> Well, Robert and I remember it differently.
>>>
>>> Anyway, I will ask for a vote now.
>>
>>
>> And what will you ask for a vote on? Why not spend that effort on
>> solving the problem? Why is it OK to waste so much time?
>>
>> Having already explained how to do this, I'll add backwards
>> compatibility within 1 day of the commit of the patch you claim was
>> blocked by this. I think it will take me about an hour and not be very
>> invasive, just to prove what a load of hot air is being produced here.
>
>
> I haven't been following this.. Could you two post a link to the patch we're
> talking about, and the explanation of how to add backwards compatibility to
> it?

The latest patch is the following. Of course, this cannot be applied
cleanly in HEAD.
http://archives.postgresql.org/message-id/CAHGQGwHB==cjehme6jiuy-knutrx-k3ywqzieg1gpnb3ck6...@mail.gmail.com

> Just by looking at the last few posts, this seems like a no brainer. The
> impression I get is that there's a patch that's otherwise ready to be
> applied, but Simon and some others want to have backwards-compatiblity added
> to it first. And Simon has a plan on how to do it, and can do it in one day.
> The obvious solution is that Simon posts the patch, with the
> backwards-compatibility added. We can then discuss that, and assuming no
> surprises there, commit it. And everyone lives happily ever after.

Sounds reasonable.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] pg_ping utility

2012-12-23 Thread Phil Sorber
On Sun, Dec 23, 2012 at 10:07 AM, Phil Sorber  wrote:
> On Sun, Dec 23, 2012 at 9:57 AM, Erik Rijkers  wrote:
>> On Sun, December 23, 2012 15:29, Michael Paquier wrote:
>>>
>>> Once the 2 small things I noticed are fixed, this patch can be marked as
>>> ready for committer.
>>
>> I wasn't going to complain about it, but if we're going for small things 
>> anyway...
>>
>> The output is now capitalised:
>>
>> /tmp:6543 - Accepting Connections
>> /tmp:6000 - No Response
>>
>> which looks unusual to me, could we please make it all lower-case?
>
> I'm not apposed to it in principal. Is that how it is done elsewhere
> in the code? If there are no objections from anyone else I will roll
> that change in with the others.
>
>>
>>
>> TYhanks,
>>
>> Erik Rijkers
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers

Updated patch attached.


pg_isready_bin_v9.diff
Description: Binary data


pg_isready_docs_v9.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2012-12-23 Thread Tom Lane
Simon Riggs  writes:
> The lack of any space saving for lower % values is strange and
> somewhat worrying. There should be a 36? byte saving for 300 null
> columns in an 800 column table - how does that not show up at all?

You could only fit about 4 such rows in an 8K page (assuming the columns
are all int4s).  Unless the savings is enough to allow 5 rows to fit in
a page, the effective savings will be zilch.

This may well mean that the whole thing is a waste of time in most
scenarios --- the more likely it is to save anything, the more likely
that the savings will be lost anyway due to page alignment
considerations, because wider rows inherently pack less efficiently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2012-12-23 Thread Simon Riggs
On 23 December 2012 17:38, Tom Lane  wrote:
> Simon Riggs  writes:
>> The lack of any space saving for lower % values is strange and
>> somewhat worrying. There should be a 36? byte saving for 300 null
>> columns in an 800 column table - how does that not show up at all?
>
> You could only fit about 4 such rows in an 8K page (assuming the columns
> are all int4s).  Unless the savings is enough to allow 5 rows to fit in
> a page, the effective savings will be zilch.

If that's the case, the use case is tiny, especially considering how
sensitive the saving is to the exact location of the NULLs.

> This may well mean that the whole thing is a waste of time in most
> scenarios --- the more likely it is to save anything, the more likely
> that the savings will be lost anyway due to page alignment
> considerations, because wider rows inherently pack less efficiently.

ISTM that we'd get a better gain and a wider use case by compressing
the whole block, with some bits masked out to allow updates/deletes.
The string of zeroes in the null bitmap would compress easily, but so
would other aspects also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Tom Lane
Andres Freund  writes:
>> This is my first test like this against 9.3 development though, so the cause
>> could be an earlier commit.  I'm just starting with the most recent work as
>> the first suspect.  Next I think I'll try autovacuum=off and see if the
>> crash goes away.  Other ideas are welcome.

> Something like the (untested) debug message below might be helpful:

It might also be interesting to know if there is more than one
still-pinned buffer --- that is, if you're going to hack the code, fix
it to elog(LOG) each pinned buffer and then panic after completing the
loop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-23 Thread Kohei KaiGai
2012/12/22 Kevin Grittner :
> Kohei KaiGai wrote:
>
>> RLS entry of wiki has not been updated for long time, I'll try to
>> update the entry for high-level design in a couple of days.
>
> Thanks, I think that is essential for a productive discussion of
> the issue.
>
I tried to update http://wiki.postgresql.org/wiki/RLS

I backed to the definition of feature for information security; that
requires to ensure confidentiality, integrity and availability (C.I.A)
of information asset managed by system.
Access control contributes the first two elements.
So, I'm inclined RLS feature "eventually" support reader-side and
writer-side, to prevent unprivileged rows are read or written.

If I could introduce the most conceptual stuff in one statement,
it shall be:
"Overall, RLS prevents users to read and write rows that does not
satisfies the row-security policy being configured on the table by
the table owner. Reader-side ensures confidentiality of data,
writer-side ensures integrity of data."
Also note that, I believe this criteria never deny to have multiple
(asymmetric) row-security policy for each command type, as long
as we care about problematic scenario properly.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-23 Thread Simon Riggs
On 21 December 2012 16:51, Kevin Grittner  wrote:

>> It would be easy enough to include read/write variable clauses by
>> using a function similar to TG_OP for triggers, e.g. StatementType.
>> That would make the security clause that applied only to reads into
>> something like this (StatementType('INSERT, UPDATE') OR other-quals).
>
> In the case where the logic in encapsulated into a function, what
> are the logical differences from a BEFORE EACH ROW trigger?

Logically it is broadly similar to a CHECK constraint, which is also
similar to a trigger in a few ways.

Implemented as a BEFORE EACH ROW trigger it would need to be a new
class of trigger that always executed after all other BEFORE EACH ROW
triggers had executed, in the same way that security barrier views act
last. The "act last" bit seems to be the most important thing here,
just as it was for security barriers and by analogy a string enough
reason to add specific syntax to support this case. (Syntax as yet
undecided...)

> If
> none, and this is strictly an optimization, what are the benchmarks
> showing?

AFAIK its well known that a check constraint is much faster than a
trigger. The objection to using triggers is partially for that reason
and partially because of the admin overhead, as I already said.

Adding a trigger could be done automatically, just as the current
patch does with the check constraint approach. So if anyone has
benchmarks that show triggers are actually faster, then we could add
that automatically instead, I guess.

Anyway, hope you can make call on 28th so we can discuss this and
agree a way forwards you're happy with.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-23 Thread Tom Lane
Simon Riggs  writes:
> On 21 December 2012 16:51, Kevin Grittner  wrote:
>> If none, and this is strictly an optimization, what are the benchmarks
>> showing?

> AFAIK its well known that a check constraint is much faster than a
> trigger.

I don't believe that that's "well known" at all, at least not for
apples-to-apples comparison cases.  A C-coded BEFORE trigger doesn't
have very much overhead; I suspect it's probably comparable to
expression evaluation setup overhead.  I think if you want to argue
for this on performance grounds, you need to actually prove there's
a significant performance advantage, not just assume there will be.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-23 Thread Simon Riggs
On 23 December 2012 19:16, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 21 December 2012 16:51, Kevin Grittner  wrote:
>>> If none, and this is strictly an optimization, what are the benchmarks
>>> showing?
>
>> AFAIK its well known that a check constraint is much faster than a
>> trigger.
>
> I don't believe that that's "well known" at all, at least not for
> apples-to-apples comparison cases.  A C-coded BEFORE trigger doesn't
> have very much overhead; I suspect it's probably comparable to
> expression evaluation setup overhead.  I think if you want to argue
> for this on performance grounds, you need to actually prove there's
> a significant performance advantage, not just assume there will be.

If you want to see some tests, I'm sure those can be arranged, no
problem. But honestly, if its low enough, then which is the fastest
will likely be moot in comparison with the cost of a non-C coded
role-based security check. So I think our attention is best spent on
providing a few likely C-coded security checks, so we're able to
address the whole performance concern not just the constraint/trigger
debate.

That still leaves the points about ensuring the trigger/checks are
executed last and also that they are added automatically, rather than
requiring them to be added manually. As KaiGai points out, if they are
added automatically, it doesn't really matter which we pick.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Greg Smith

On 12/23/12 1:10 PM, Tom Lane wrote:


It might also be interesting to know if there is more than one
still-pinned buffer --- that is, if you're going to hack the code, fix
it to elog(LOG) each pinned buffer and then panic after completing the
loop.



Easy enough; I kept it so the actual source of panic is still an 
assertion failure:


diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c

index dddb6c0..df43643 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit)
if (assert_enabled)
{
int i;
+   int RefCountErrors = 0;

for (i = 0; i < NBuffers; i++)
{
-   Assert(PrivateRefCount[i] == 0);
+
+   if (PrivateRefCount[i] != 0)
+   {
+   BufferDesc *bufHdr = &BufferDescriptors[i];
+   elog(LOG, "refcount of %s is %u should 
be 0, globally: %u",
+relpathbackend(bufHdr->tag.rnode, 
InvalidBackendId, bufHdr->tag.forkNum),

+PrivateRefCount[i], bufHdr->refcount);
+   RefCountErrors++;
+   }
}
+   Assert(RefCountErrors == 0);
}
 #endif


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto seeding problem when ssl=on

2012-12-23 Thread Tom Lane
Marko Kreen  writes:
> On Sat, Dec 22, 2012 at 9:20 PM, Tom Lane  wrote:
>> I'm not thrilled with the suggestion to do RAND_cleanup() after forking
>> though, as that seems like it'll guarantee that each session starts out
>> with only a minimal amount of entropy.

> No, that's actually the right fix - that forces OpenSSL to do new reseed
> with system randomness, thus making backend (including SSL_accept)
> maximally disconnected from static pool in postmaster.

I don't think that "maximal disconnectedness" is the deciding factor
here, or even a very important factor.  If we do RAND_cleanup() then
each new session will be forced to suck entropy from /dev/urandom
(immediately, if an SSL connection is attempted).  This opens the door
to quasi-denial-of-service attacks that bleed all the entropy out of the
system, forcing long delays for new PRNG-using processes, Postgres or
otherwise.  And long delays are the *best case* scenario --- worst case,
if the system lacks decent /dev/random support, is that new processes
are starting up with highly correlated PRNG states.

If such an approach were a good thing, the OpenSSL guys would have
implemented it internally --- it'd be easy enough to do, just by forcing
RAND_cleanup anytime getpid() reads differently than it did when the
pool was set up.

gettimeofday() might not be the very best way of changing the inherited
PRNG state from child to child, but I think it's a more attractive
solution than RAND_cleanup.

> This also makes behaviour equal to current ssl=off and exec-backend
> mode, which already do initial seeding in backend.

No, it's not "equal", because when ssl=off seeding attempts won't happen
immediately after fork and thus an attacker doesn't have such an easy
way of draining entropy.  If we do what you're suggesting, the attacker
doesn't even need a valid database login to drain entropy --- he can
just fire-and-forget NEGOTIATE_SSL startup packets.  (The exec-backend
case will have such issues anyway, but who thought Windows was a secure
OS?)

> If you decide against RAND_cleanup in backend and instead
> do workarounds in backend or postmaster, then please
> also to periodic RAND_cleanup in postmaster.  This should
> make harder to exploit weaknesses in reused slowly moving state.

We could consider that, but it's not apparent to me that it has any
real value.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto seeding problem when ssl=on

2012-12-23 Thread Tom Lane
Noah Misch  writes:
> On Sat, Dec 22, 2012 at 02:20:56PM -0500, Tom Lane wrote:
>> I believe that we'd be better off doing something in postmaster.c to
>> positively ensure that each session has a distinct seed value.  Notice
>> that BackendRun() already takes measures to ensure that's the case for
>> the regular libc random() function; it seems like a reasonable extension
>> to also worry about OpenSSL's PRNG.

>> #ifdef USE_SSL
>> if (EnableSSL)
>> {
>>  struct timeval tv;
>> 
>>  gettimeofday(&tv, NULL);
>>  RAND_add(&tv, sizeof(tv), 0);
>> }
>> #endif

> Take the caution one step further and make it independent of EnableSSL.  In a
> stock installation, a !EnableSSL postmaster will never seed its PRNG, and
> there's no vulnerability.  Add a shared_preload_libraries module that uses the
> OpenSSL PRNG in its _PG_init(), and suddenly you're vulnerable again.

Meh.  In a postmaster that wasn't built with SSL support at all, such
a module is still dangerous (and I'm not convinced anybody would build
such a module anyway).  I think we should confine our ambitions to
preventing security issues caused by our own code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto seeding problem when ssl=on

2012-12-23 Thread Tom Lane
Andres Freund  writes:
> On 2012-12-22 14:20:56 -0500, Tom Lane wrote:
>> I believe that we'd be better off doing something in postmaster.c to
>> positively ensure that each session has a distinct seed value.

> I am entirely unconvinced that adding in a gettimeofday() provides more
> entropy than what openssl gathers internally after a
> RAND_cleanup().

No, it doesn't provide more entropy, but what it does do is avoid
draining entropy from the kernel (see my reply to Marko just now).

> gettimeofday() will yield the same result in close
> enough forks on a significant number of systems whereas openssl should
> use stuff like /dev/urandom to initialize its PRNG after a cleanup.

Well, in the first place, that doesn't matter much because the PRNG
state will still change from repeated mix-ins of gettimeofday, even if
we obtain the same reading successive times.  In the second place, if we
add it where I suggest, it would be easy to also include the child
process PID in the added entropy -- we could xor it into the tv_sec
value for instance -- thus further reducing the chance of identical
values getting added in.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Simon Riggs
On 23 December 2012 19:42, Greg Smith  wrote:

> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index dddb6c0..df43643 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1697,11 +1697,21 @@ AtEOXact_Buffers(bool isCommit)
> if (assert_enabled)
> {
> int i;
> +   int RefCountErrors = 0;
>
>
> for (i = 0; i < NBuffers; i++)
> {
> -   Assert(PrivateRefCount[i] == 0);
> +
> +   if (PrivateRefCount[i] != 0)
> +   {
> +   BufferDesc *bufHdr = &BufferDescriptors[i];
> +   elog(LOG, "refcount of %s is %u should be 0,
> globally: %u",
>
> +relpathbackend(bufHdr->tag.rnode,
> InvalidBackendId, bufHdr->tag.forkNum),
> +PrivateRefCount[i], bufHdr->refcount);
> +   RefCountErrors++;
> +   }
> }
> +   Assert(RefCountErrors == 0);
> }
>  #endif
>

We already have PrintBufferLeakWarning() for this, which might be a bit neater.

If that last change was the cause, then its caused within VACUUM. I'm
running a thrash test with autovacuums set much more frequently but
nothing yet.

Are you by any chance running with unlogged tables? There was a change
there recently, something around checkpoint IIRC. Can you set
checkpoints more frequent also?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making view dump/restore safe at the column-alias level

2012-12-23 Thread Robert Haas
On Fri, Dec 21, 2012 at 9:46 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I'm having a hard time following this.  Can you provide a concrete example?
>
> regression=# create table t1 (x int, y int);
> CREATE TABLE
> regression=# create table t2 (x int, z int);
> CREATE TABLE
> regression=# create view v1 as select * from t1 join t2 using (x);
> CREATE VIEW
> regression=# \d+ v1
>View "public.v1"
>  Column |  Type   | Modifiers | Storage | Description
> +-+---+-+-
>  x  | integer |   | plain   |
>  y  | integer |   | plain   |
>  z  | integer |   | plain   |
> View definition:
>  SELECT t1.x, t1.y, t2.z
>FROM t1
>JOIN t2 USING (x);
> regression=# alter table t2 rename column x to q;
> ALTER TABLE
> regression=# \d+ v1
>View "public.v1"
>  Column |  Type   | Modifiers | Storage | Description
> +-+---+-+-
>  x  | integer |   | plain   |
>  y  | integer |   | plain   |
>  z  | integer |   | plain   |
> View definition:
>  SELECT t1.x, t1.y, t2.z
>FROM t1
>JOIN t2 USING (x);
>
> At this point the dumped view definition is wrong: if you try to execute
> it you get
>
> regression=# SELECT t1.x, t1.y, t2.z
> regression-#FROM t1
> regression-#JOIN t2 USING (x);
> ERROR:  column "x" specified in USING clause does not exist in right table
>
> I'm suggesting that we could fix this by emitting something that forces
> the right alias to be assigned to t2.q:
>
> SELECT t1.x, t1.y, t2.z
>FROM t1
>JOIN t2 AS t2(x,z)
>USING (x);

Sneaky.  I didn't know that would even work, but it seems like a
sensible approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Greg Smith

On 12/23/12 3:17 PM, Simon Riggs wrote:

If that last change was the cause, then its caused within VACUUM. I'm
running a thrash test with autovacuums set much more frequently but
nothing yet.


I am not very suspicious of that VACUUM change; just pointed it out for 
completeness sake.



Are you by any chance running with unlogged tables? There was a change
there recently, something around checkpoint IIRC. Can you set
checkpoints more frequent also?


This is straight pgbench, modifying only scale, client count, workers, 
and duration.  No unlogged tables involved.


The crash is so intermittent that I'm trying not to change anything 
until I get more productive output out of one.  The last run I kicked 
off has been chugging for 14 hours without a hiccup yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2012-12-23 Thread Simon Riggs
On 23 December 2012 21:52, Greg Smith  wrote:
> On 12/23/12 3:17 PM, Simon Riggs wrote:
>>
>> If that last change was the cause, then its caused within VACUUM. I'm
>> running a thrash test with autovacuums set much more frequently but
>> nothing yet.
>
>
> I am not very suspicious of that VACUUM change; just pointed it out for
> completeness sake.

Nothing seen here either after 100 minutes of thrash with ~1 VACUUM
per sec, at SF1

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers: adding information

2012-12-23 Thread Robert Haas
On Fri, Dec 21, 2012 at 11:35 AM, Dimitri Fontaine
 wrote:
> It's hard to devise exactly what kind of refactoring we need to do
> before trying to write a patch that benefits from it, in my experience.
> In some cases the refactoring will make things more complex, not less.

Perhaps.  I have a feeling there's some more simplification that can
be done here, somehow.

> That's a good idea. I only started quite late in that patch to work on
> the ObjectID piece of information, that's why I didn't split it before
> doing so.

That's fine.  I've committed that part.  I think the remainder of the
patch is really several different features that ought to be split
apart and considered independently.  We may want some but not others,
or some may be ready to go in but not others.

> Also, keep in mind we want the ObjectID in all CREATE, ALTER and DROP
> statements, so my current patch is still some bricks shy of a load… (I
> added ObjectID only in the commands I added rewrite support for, apart
> from DROP).

I shall rely on you to provide those bricks which are still missing.

>> change to gram.y that affects any of the supported commands will
>> require a compensating change to ddl_rewrite.c.  That is a whole lot
>> of work and it is almost guaranteed that future patch authors will
>> sometimes fail to do it correctly.   At an absolute bare minimum I
>> think we would need some kind of comprehensive test suite for this
>> functionality, as opposed to the less than 60 lines of new test cases
>> this patch adds which completely fail to test this code in any way at
>> all, but really I think we should just not have it.  It WILL get
>
> I had a more complete test suite last rounds and you did oppose to it
> for good reasons. I'm ok to revisit that now, and I think the test case
> should look like this:
>
>   - install an auditing command trigger that will insert any DDL command
> string into a table with a sequence ordering
>
>   - select * from audit
>
>   - \d… of all objects created
>
>   - drop schema cascade
>
>   - DO $$ loop for sql in select command from audit do execute sql …
>
>   - \d… of all objects created again
>
> Then any whacking of the grammar should alter the output here and any
> case that's not supported will fail too.
>
> We might be able to have a better way of testing the feature here, but I
> tried to stay into the realms of what I know pg_regress able to do.

I was thinking that we might need to go beyond what pg_regress can do
in this case.  For example, one idea would be to install an audit
trigger that records all the DDL that happens during the regression
tests.  Then, afterwards, replay that DDL into a new database.  Then
do a schema-only dump of the old and new databases and diff the dump
files.  That's a little wacky by current community standards but FWIW
EDB has a bunch of internal tests that we run to check our proprietary
stuff; some of them work along these lines and it's pretty effective
at shaking out bugs.

In this particular case, it would significantly reduce the maintenance
burden of putting a proper test framework in place, because we can
hope that the regression tests create (and leave around) at least one
object of any given type, and that any new DDL features will be
accompanied by similar regression tests.  We already rely on the
regression database for pg_upgrade testing, so if it's not complete,
it impacts pg_upgrade test coverage as well.

>> broken (if it isn't already) and it WILL slow down future development
>> of SQL features.  It also WILL have edge cases where it doesn't work
>> properly, such as where the decision of the actual index name to use
>> is only decided at execution time and cannot be inferred from the
>> parse tree.  I know that you feel that all of these are tolerable
>
> The way to solve that, I think, as Tom said, is to only rewrite the
> command string when the objects exist in the catalogs. That's why we now
> have ddl_command_trace which is an alias to either start or end
> depending on whether we're doing a DROP or a CREATE or ALTER operation.

Hmm.  I have to study that more, maybe.  I certainly agree that if you
can look at the catalogs, you should be able to reliably reconstruct
what happened - which isn't possible just looking at the parse tree.
However, it feels weird to me to do something that's partly based on
the parse tree and partly based on the catalog contents.  Moreover,
the current pre-create hook isn't the right place to gather
information from the catalogs anyway as it happens before locks have
been taken.

Now, there's another thing that is bothering me here, too.  How
complete is the support you've implemented in this patch?  Does it
cover ALL CREATE/ALTER/DROP commands, including all options to and all
variants of those commands?  Because while I'm not very sanguine about
doing this at all, it somehow feels like doing it only halfway would
be even worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Ent

[HACKERS] initdb and share/postgresql.conf.sample

2012-12-23 Thread Jeff Janes
In some of my git branches I have
editorialized src/backend/utils/misc/postgresql.conf.sample to contain my
configuration preferences for whatever it is that that branch is for
testing.  Then this gets copied to share/postgresql.conf.sample during
install and from there to data/postgresql.conf during initdb, and I don't
need to remember to go make the necessary changes.

Am I insane to be doing this?  Is there a better way to handle this
branch-specific configuration needs?

Anyway, I was recently astonished to discovery that the contents
of share/postgresql.conf.sample during the initdb affected the performance
of the server, even when the conf file was replaced with something else
before the server was started up.  To make a very long story short,
if share/postgresql.conf.sample is set up for archiving, then somewhere in
the initdb process some bootstrap process pre-creates a bunch of extra xlog
files.

Is this alarming?  It looks like initdb takes some pains, at least on one
place, to make an empty config file rather than using
postgresql.conf.sample, but it seems like a sub-process is not doing that.

Those extra log files then give the newly started server a boost (whether
it is started in archive mode or not) because it doesn't have to create
them itself.  It isn't so much a boost, as the absence of a new-server
penalty.  I want to remove that penalty to get better numbers from
benchmarking.  What I am doing now is this, between the initdb and the
pg_ctl start:

for g in `perl -e 'printf("000100%02X\n",$_) foreach
2..120'`; do cp /tmp/data/pg_xlog/00010001
/tmp/data/pg_xlog/$g -i < /dev/null;

The "120" comes from 2 * checkpoint_segments.  That's mighty ugly, is there
a better trick?

You could say that benchmarks should run long enough to average out such
changes, but needing to run a benchmark that long can make some kinds of
work (like git bisect) unrealistic rather than merely tedious.

Cheers,

Jeff


Re: [HACKERS] WIP: store additional info in GIN index

2012-12-23 Thread Tomas Vondra
Hi!

On 22.12.2012 17:15, Alexander Korotkov wrote:
> I'm not saying this is a perfect benchmark, but the differences (of
> querying) are pretty huge. Not sure where this difference comes from,
> but it seems to be quite consistent (I usually get +-10% results, which
> is negligible considering the huge difference).
> 
> Is this an expected behaviour that will be fixed by another patch?
> 
>  
> Another patches which significantly accelerate index search will be
> provided. This patch changes only GIN posting lists/trees storage.
> However, it wasn't expected that this patch significantly changes index
> scan speed in any direction.

That was exactly my expectation - probably not an improvement, but
definitely not a worse performance.

> 
> The database contains ~680k messages from the mailing list archives,
> i.e. about 900 MB of data (in the table), and the GIN index on tsvector
> is about 900MB too. So the whole dataset nicely fits into memory (8GB
> RAM), and it seems to be completely CPU bound (no I/O activity at all).
> 
> The configuration was exactly the same in both cases
> 
> shared buffers = 1GB
> work mem = 64 MB
> maintenance work mem = 256 MB
> 
> I can either upload the database somewhere, or provide the benchmarking
> script if needed.
> 
> 
> Unfortunately, I can't reproduce such huge slowdown on my testcases.
> Could you share both database and benchmarking script?

It's strange, but no matter what I do I can't reproduce those results
(with the significant performance decrease). So either I've done some
strange mistake when running those tests, or there was something wrong
with my system, or whatever :-(

But when running the benchmarks now (double-checked everything, properly
repeated the tests, ...), I've noticed a different behaviour. But first
some info about the scripts I use for testing.

All the scripts are available here:

  https://bitbucket.org/tvondra/archie

It's my "hobby project" implementing fulltext mbox archive. It should be
usable but it's still a bit WIP so let me know in case of any issues.

The README should give you all the instructions on how to setup and load
the database. I'm using ~1700 mbox files downloaded from
http://archives.postgresql.org/ for these lists (until 2012/11):

   pgadmin-hackers
   pgsql-advocacy
   pgsql-announce
   pgsql-bugs
   pgsql-general
   pgsql-hackers
   pgsql-jdbc
   pgsql-novice
   pgsql-odbc
   pgsql-patches
   pgsql-sql

which in the end gives ~677k rows in the 'messages' table, occupying
~5.5GB disk space (including all the indexes etc).

Once you have the data loaded, you need to warmup the database and then
start benchmarking it - I'm using the warmup.py script to both things.
The script is quite simple, it basically just

To warmup the DB, just run this

./warmup.py --db archie --duration 300

until the %util drops near 0 (assuming you have enough RAM to fit the
whole database into memory). Then I usually do this as a benchmarking

./warmup.py --db archie --duration 300 --no-hash \
--no-thread --words 1

./warmup.py --db archie --duration 300 --no-hash \
--no-thread --words 2

which runs 60-second tests and outputs one line for worker (by default
equal to the number of CPUs).

The script itself is very simple, it fetches a random message and uses
the tsvector column as a source of words for the actual benchmark. It
takes N words from the tsvector, splits them into groups and performs a
simple fulltext query using plainto_tsquery('word1 word2 ...'). At the
end it prints info including the number of queries per second.

I've run the tests on the current master with and without the v3 patch.
I've tested it with 1GB or 2GB shared buffers, and 32MB or 64MB work mem.

The tests were run for 1, 2, 3, 4 and 5 words, and I've repeated it five
times for each configuration. Duration of each run was 5-minutes.

These are the averages (from the 5 runs) of queries per second for each
combination of parameters:

  1 2 3 4 5

 master 1GB/32MB 19   179   165   12799
 patched1GB/32MB 19   175   163   12496

 master 1GB/64MB 20   181   165   12799
 patched1GB/64MB 19   174   159   12095

 master 2GB/32MB 27   181   165   12798
 patched2GB/32MB 25   176   156   12093

 master 2GB/64MB 27   180   166   128   102
 patched2GB/64MB 40   402   364   245   176

There's no significant difference in performance, except for the
2GB/64MB combination. And in that case it's actually the opposite
direction than I've reported before - i.e. this time it's up to 100%
faster than the unpatched master. The results are pretty consistent
(very small variance across the repeated runs), so I'm not sure about
the previous results.

Any idea what might caus

Re: [HACKERS] initdb and share/postgresql.conf.sample

2012-12-23 Thread Greg Stark
On Sun, Dec 23, 2012 at 11:11 PM, Jeff Janes  wrote:
> You could say that benchmarks should run long enough to average out such
> changes

I would say any benchmark needs to be run long enough to reach a
steady state before the measurements are taken. The usual practice is
to run a series groups and observe the aggregate measurements for each
group. For instance run 10 runs with each run including of 1000
repetitions of the transaction. Then you can observe at which point
the averages for individual groups begin to behave consistently. If
the first three are outliers but the remaining 7 are stable then
discard the first three and take the average (or often median) of the
remaining 7.

If you include the early runs which are affected by non-steady-state
conditions such as cache effects or file fragmentation then it can
take a very long time for those effects to be erased by averaging with
later results. Worse, it's very difficult to tell whether you've
waited long enough.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto seeding problem when ssl=on

2012-12-23 Thread Noah Misch
On Sun, Dec 23, 2012 at 02:49:08PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Dec 22, 2012 at 02:20:56PM -0500, Tom Lane wrote:
> >> #ifdef USE_SSL
> >> if (EnableSSL)
> >> {
> >>struct timeval tv;
> >> 
> >>gettimeofday(&tv, NULL);
> >>RAND_add(&tv, sizeof(tv), 0);
> >> }
> >> #endif
> 
> > Take the caution one step further and make it independent of EnableSSL.  In 
> > a
> > stock installation, a !EnableSSL postmaster will never seed its PRNG, and
> > there's no vulnerability.  Add a shared_preload_libraries module that uses 
> > the
> > OpenSSL PRNG in its _PG_init(), and suddenly you're vulnerable again.
> 
> Meh.  In a postmaster that wasn't built with SSL support at all, such
> a module is still dangerous (and I'm not convinced anybody would build
> such a module anyway).  I think we should confine our ambitions to
> preventing security issues caused by our own code.

You're adding lines of code to prematurely micro-optimize the backend fork
cycle.  If code introduced into the postmaster, by us or others, ever violates
the assumption behind that micro-optimization, certain users get a precipitous
loss of security with no clear alarm bells.  I don't like that trade.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-23 Thread Tomas Vondra
On 20.12.2012 12:33, Andres Freund wrote:
> On 2012-12-20 02:54:48 +0100, Tomas Vondra wrote:
>> On 19.12.2012 02:18, Andres Freund wrote:
>>> On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote:
>>>
>>> I think except of the temp buffer issue mentioned below its ready.
>>>
 -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
 +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
  {
 -  int i;
 +  int i, j;
 +
 +  /* sort the list of rnodes */
 +  pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);

/* If it's a local relation, it's localbuf.c's problem. */
 -  if (RelFileNodeBackendIsTemp(rnode))
 +  for (i = 0; i < nnodes; i++)
{
 -  if (rnode.backend == MyBackendId)
 -  DropRelFileNodeAllLocalBuffers(rnode.node);
 -  return;
 +  if (RelFileNodeBackendIsTemp(rnodes[i]))
 +  {
 +  if (rnodes[i].backend == MyBackendId)
 +  DropRelFileNodeAllLocalBuffers(rnodes[i].node);
 +  }
}
>>>
>>> While you deal with local buffers here you don't anymore in the big loop
>>> over shared buffers. That wasn't needed earlier since we just returned
>>> after noticing we have local relation, but thats not the case anymore.
>>
>> Hmm, but that would require us to handle the temp relations explicitly
>> wherever we call DropRelFileNodeAllBuffers. Currently there are two such
>> places - smgrdounlink() and smgrdounlinkall().
>>
>> By placing it into DropRelFileNodeAllBuffers() this code is shared and I
>> think it's a good thing.
>>
>> But that does not mean the code is perfect - it was based on the
>> assumption that if there's a mix of temp and regular relations, the temp
>> relations will be handled in the first part and the rest in the second one.
>>
>> Maybe it'd be better to improve it so that the temp relations are
>> removed from the array after the first part (and skip the second one if
>> there are no remaining relations).
> 
> The problem is that afaik without the backend-local part a temporary
> relation could match the same relfilenode as a "full" relation which
> would have pretty bad consequences.

Attached is a patch with fixed handling of temporary relations. I've
chosen to keep the logic in DropRelFileNodeAllBuffers and rather do a
local copy without the local relations.

regards
Tomas
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 8dc622a..8c12a43 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -312,6 +312,11 @@ smgrDoPendingDeletes(bool isCommit)
PendingRelDelete *pending;
PendingRelDelete *prev;
PendingRelDelete *next;
+
+   SMgrRelation   *srels = palloc(sizeof(SMgrRelation));
+   int nrels = 0,
+   i = 0,
+   maxrels = 1;
 
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -335,14 +340,31 @@ smgrDoPendingDeletes(bool isCommit)
SMgrRelation srel;
 
srel = smgropen(pending->relnode, 
pending->backend);
-   smgrdounlink(srel, false);
-   smgrclose(srel);
+   
+   /* extend the array if needed (double the size) 
*/
+   if (maxrels <= nrels) {
+   maxrels *= 2;
+   srels = repalloc(srels, 
sizeof(SMgrRelation) * maxrels);
+   }
+   
+   srels[nrels++] = srel;
}
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
}
}
+
+   if (nrels > 0)
+   {
+   smgrdounlinkall(srels, nrels, false);
+   
+   for (i = 0; i < nrels; i++)
+   smgrclose(srels[i]);
+   }
+   
+   pfree(srels);
+
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..8a7190b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -62,6 +62,7 @@
 #define BUF_WRITTEN0x01
 #define BUF_REUSABLE   0x02
 
+#define BSEARCH_LIMIT  10
 
 /* GUC variables */
 bool   zero_damaged_pages = false;
@@ -108,6 +109,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 
+static int rnode_comparator(const void * p1, const 

[HACKERS] compliation error in 9.3 devel

2012-12-23 Thread Hari Babu
 

The following commit is causing some compilation errors. 

c504513f83a9ee8dce4a719746ca73102cae9f13 



Error: 
../../../src/include/catalog/pg_aggregate.h:246: error: previous declaration
of AggregateCreate was here 
make: *** [pg_aggregate.o] Error 1 




Author: Robert Haas   2012-12-24 04:55:03 
Committer: Robert Haas   2012-12-24 05:07:58 
Parent: 31bc839724439440b2e94ea616b28ce5be94e19c (Prevent failure when
RowExpr or XmlExpr is parse-analyzed twice.) 
Child:   (Local uncommitted changes,
not checked in to index) 
Branches: Truncate-trailing-null-columns, master 
Follows: REL9_2_BETA2 
Precedes: 

Adjust many backend functions to return OID rather than void. 

Extracted from a larger patch by Dimitri Fontaine.  It is hoped that 
this will provide infrastructure for enriching the new event trigger 
functionality, but it seems possibly useful for other purposes as 
well. 


Regards, 
Hari babu.