Re: [HACKERS] MinGW / Windows / printf format specifiers

2016-02-18 Thread Craig Ringer
On 19 February 2016 at 12:15, Chapman Flack  wrote:


>
> The cause seems to be that Windows conventions have int = long = int32
> (even on 64-bit platforms) and only 'long long' = int64.


Yes, it's an LLP64 platform.

https://en.wikipedia.org/wiki/64-bit_computing


> The Java JNI
> headers of course know this, so they type jlong as 'long long', while
> jint they type as 'long' - curiously, because they could just call it
> int and get the same width. Maybe a habit from a 16-bit C environment?
>

They should be using the (u)int(nn)_t typedefs like int64_t, but some
compilers lag in their support for them.


> Have issues like this been dealt with in PostgreSQL code before, and did
> a favorite approach emerge?
> 


INT64_FORMAT and UINT64_FORMAT

src/include/c.h

git grep INT64_FORMAT


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WAL logging problem in 9.4.3?

2016-02-18 Thread Michael Paquier
On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas  wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> I am going to look into that very soon. For now and to not forget
> about this bug, I have added an entry in the CF app:
> https://commitfest.postgresql.org/9/528/

Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR:  XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION:  mdread, md.c:728

The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.

Here are other comments.

+   /* Flush updates to relations that we didn't WAL-logged */
+   smgrDoPendingSyncs(true);
"Flush updates to relations there were not WAL-logged"?

+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+   FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
islocal is always set as false, I'd rather remove it this argument
from FlushRelationBuffersWithoutRelCache.

for (i = 0; i < nrels; i++)
+   {
smgrclose(srels[i]);
+   }
Looks like noise.

+   if (!found)
+   {
+   pending->truncated_to = InvalidBlockNumber;
+   pending->sync_above = nblocks;
+
+   elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
block %u",
+rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+   }
+   else if (pending->sync_above == InvalidBlockNumber)
+   {
+   elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+   pending->sync_above = nblocks;
+   }
+   else
Here couldn't it be possible that when (sync_above !=
InvalidBlockNumber), nblocks can be higher than sync_above? In which
case we had better increase sync_above to nblocks, no?

+   if (!pendingSyncs)
+   createPendingSyncsHash();
+   pending = (PendingRelSync *) hash_search(pendingSyncs,
+(void *) &rel->rd_node,
+HASH_ENTER, &found);
This is lacking comments.

-   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+   BufferGetTag(buffer, &rnode, &forknum, &blknum);
+   if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+   !smgrIsSyncPending(rnode, blknum))
Here as well explaining in more details why the buffer does not need
to go through XLogSaveBufferForHint would be nice.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in bufmgr.c that result in waste of memory

2016-02-18 Thread Amit Kapila
On Fri, Feb 19, 2016 at 8:28 AM, Takashi Horikawa 
wrote:
>
> Hi all,
>
> I have just found a typo in the source code (not in a comment) of bufmgr.c
> that result in waste of memory. It might be a 'bug' but it does not result
> in any incorrect operation but just results in waste of a few memory
> resource.
>
> As sizeof(PrivateRefCountArray) found in InitBufferPoolAccess() is 64 and
> sizeof(PrivateRefCountEntry) which should be used here is 8, this typo
> produces 56 byte of unused memory area per one PrivateRefCount entry in
the
> hash table. I think this result in not only the waste of memory but also
> reduces the cache hit ratio.
>
> 
> void
> InitBufferPoolAccess(void)
> {
> HASHCTL hash_ctl;
>
> memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
>
> MemSet(&hash_ctl, 0, sizeof(hash_ctl));
> hash_ctl.keysize = sizeof(int32);
> Xhash_ctl.entrysize = sizeof(PrivateRefCountArray);
> Ohash_ctl.entrysize = sizeof(PrivateRefCountEntry);
>
> PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
>   HASH_ELEM | HASH_BLOBS);
> }
> 
>

Your proposed change seems right to me.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Joshua D. Drake

On 02/18/2016 08:22 PM, Tom Lane wrote:


Now, I have heard it argued that the OpenSSH/L authors are a bunch of
idiots who know nothing about security.  But it's not like insisting
on restrictive permissions on key files is something we invented out
of the blue.  It's pretty standard practice, AFAICT.

regards, tom lane


I think Tom has the right compromise. It must be 0600 for us, and 0640 
or less for root. That opens up the ability for other systems to have 
what it needs (although I am unsure of how Windows handles this) and 
allows us to keep a modicum of self respect in terms of what we allow.


Sincerely,

JD


--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> I completely disagree that those file-permissions checks are useless.

> If that's something you're concerned with then the right answer is to
> monitor the file permissions- and there are tools which do exactly that.
> I disagree that it's PG's charter to do that and, frankly, you *won't*
> be told, in most cases, promptly about such a change.

I will just quote this bit from "man ssh":

 ~/.ssh/identity
 ~/.ssh/id_dsa
 ~/.ssh/id_ecdsa
 ~/.ssh/id_rsa
 Contains the private key for authentication.  These files contain
 sensitive data and should be readable by the user but not acces-
 sible by others (read/write/execute).  ssh will simply ignore a
 private key file if it is accessible by others.

Now, I have heard it argued that the OpenSSH/L authors are a bunch of
idiots who know nothing about security.  But it's not like insisting
on restrictive permissions on key files is something we invented out
of the blue.  It's pretty standard practice, AFAICT.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MinGW / Windows / printf format specifiers

2016-02-18 Thread Chapman Flack
So I announce a PL/Java 1.5.0-BETA1 with a comment that windows-MSVC
was well tested but that I hadn't found anyone to test windows-mingw,
and behold, what should happen but someone turns up with an interest
in testing the mingw build, so we might have that back in shape as
well for -BETA2.

Haven't run into any showstoppers, but gcc is issuing printf-format
warnings in three places where JNI values of type jint or jlong are
printed.

The cause seems to be that Windows conventions have int = long = int32
(even on 64-bit platforms) and only 'long long' = int64. The Java JNI
headers of course know this, so they type jlong as 'long long', while
jint they type as 'long' - curiously, because they could just call it
int and get the same width. Maybe a habit from a 16-bit C environment?

gcc warns about %ld used with a jlong argument (and it is right,
because for Windows that really should be %lld or %I64d).

It also complains about %d used with a jint argument (a bit spuriously,
because it wants to see %d paired with 'int' while 'jint' is typed as
'long', producing a warning _even though those have the same width_).

I'm familiar with a common way of handling this using a macro like
PRId64 that expands to the correct printf format code on each platform
(printf("My value is %10" PRId64 "\n", v)) ... which then becomes a
potential localization headache because if the format string is the
message key, it's now not the same between platforms.

In my particular case, PL/Java has never had any localization effort
yet, and the only affected sites right now are two debug messages and
one error, not places where the urgency to localize burns hottest.

But if others here have already considered these issues and settled
on a good approach, I'd be happy to not spend time inventing another.

I found some of the printf format specifier differences, Windows to
other platforms, described well in this StackOverflow thread:

http://stackoverflow.com/questions/6299083/cross-platform-printing-of-64-bit-integers-with-printf

And the MinGW wiki has a directly relevant page:

https://sourceforge.net/p/mingw-w64/wiki2/gnu%20printf/

They provide their own printf that supports %lld (you can get it by
default by defining __USE_MINGW_ANSI_STDIO) ... and, to avoid
spurious compiler warnings, they also define a macro __MINGW_PRINTF_FORMAT
that can be used in __attribute__((__format__ ...))) so gcc's format
checker applies the right checks.

Have issues like this been dealt with in PostgreSQL code before, and did
a favorite approach emerge?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about google summer of code 2016

2016-02-18 Thread Atri Sharma
On 19 Feb 2016 8:30 am, "Chapman Flack"  wrote:
>
> On 02/18/16 19:35, Amit Langote wrote:
>
> > Apparently, the deadline is: February 20, 2016 at 04:00 (+0900 UTC)
> >
> > https://summerofcode.withgoogle.com/
>
> For anybody finding that web site as anti-navigable as I did, here
> are more direct links to the actual rules, and terms of agreement
> for the various participants:
>
> https://summerofcode.withgoogle.com/rules/
> https://summerofcode.withgoogle.com/terms/org
> https://summerofcode.withgoogle.com/terms/mentor
> https://summerofcode.withgoogle.com/terms/student
>
> Here is a question: does it ever happen that PostgreSQL acts as
> the org for a project that is PostgreSQL-related but isn't
> directly PGDG-led?
>
> ... there are definitely interesting and promising areas for further
> development in PL/Java beyond what I would ever have time to tackle
> solo, and I could easily enjoy mentoring someone through one or
> another of them over a summer, which could also help reinvigorate
> the project and get another developer familiar with it at a
> non-superficial level.  While I could easily see myself mentoring,
> I think it would feel like overkill to apply individually as a
> one-trick 'organization'.
>
> I see that there was a "based on PL/Java" GSoC'12 project, so maybe
> there is some room for non-core ideas under the PostgreSQL ægis?

FWIW it wasn't a PL/Java based project per se, it was a JDBC FDW.

I agree, there might be scope for non core projects and PL/Java sounds like
a good area.

Regards,

Atri


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-18 Thread Michael Paquier
On Fri, Feb 19, 2016 at 11:53 AM, Peter Eisentraut  wrote:
> On 2/18/16 12:20 PM, Joe Conway wrote:
>> Just to be clear, AFAIK there is no issue with the committed changes on
>> Windows. If there is please provide a concrete example that we can discuss.
>
> Originally it was mentioned that this feature could be used in the
> regression tests by making certain tests conditional on configure
> options.  Which presumably won't work if the build was on Windows.

MSVC code passes VAL_CONFIGURE to pg_config.h by calling
GetFakeConfigure() and make the output of pg_config consistent with
when ./configure is used. So for CONFIGURE I see no issues. Things
like CPPFLAGS or LIBS though become listed as "not recorded" with this
change so the output of pg_config is more verbose when MSVC is used.
This still seems an acceptable trade-off even after reviewing this
patch to make this information available on the backend. And it seems
as well that this would become set, at least partially, when using
cmake build instead of the MSVC cruft in src/tools/msvc.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about google summer of code 2016

2016-02-18 Thread Chapman Flack
On 02/18/16 19:35, Amit Langote wrote:

> Apparently, the deadline is: February 20, 2016 at 04:00 (+0900 UTC)
> 
> https://summerofcode.withgoogle.com/

For anybody finding that web site as anti-navigable as I did, here
are more direct links to the actual rules, and terms of agreement
for the various participants:

https://summerofcode.withgoogle.com/rules/
https://summerofcode.withgoogle.com/terms/org
https://summerofcode.withgoogle.com/terms/mentor
https://summerofcode.withgoogle.com/terms/student

Here is a question: does it ever happen that PostgreSQL acts as
the org for a project that is PostgreSQL-related but isn't
directly PGDG-led?

... there are definitely interesting and promising areas for further
development in PL/Java beyond what I would ever have time to tackle
solo, and I could easily enjoy mentoring someone through one or
another of them over a summer, which could also help reinvigorate
the project and get another developer familiar with it at a
non-superficial level.  While I could easily see myself mentoring,
I think it would feel like overkill to apply individually as a
one-trick 'organization'.

I see that there was a "based on PL/Java" GSoC'12 project, so maybe
there is some room for non-core ideas under the PostgreSQL ægis?

In any case, I am quite confident that I could *not* complete a
separate org application by tomorrow 2 pm EST. In reading the rules,
it looks possible that the Ideas List does not have to accompany
the org application, but would be needed shortly after acceptance?

If acceptance announcements are 29 February, I could have some
ideas drafted by then.

Is this a thinkable thought?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in bufmgr.c that result in waste of memory

2016-02-18 Thread Takashi Horikawa
Hi all,

I have just found a typo in the source code (not in a comment) of bufmgr.c
that result in waste of memory. It might be a 'bug' but it does not result
in any incorrect operation but just results in waste of a few memory
resource.

As sizeof(PrivateRefCountArray) found in InitBufferPoolAccess() is 64 and
sizeof(PrivateRefCountEntry) which should be used here is 8, this typo
produces 56 byte of unused memory area per one PrivateRefCount entry in the
hash table. I think this result in not only the waste of memory but also
reduces the cache hit ratio.


void
InitBufferPoolAccess(void)
{
HASHCTL hash_ctl;

memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));

MemSet(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(int32);
Xhash_ctl.entrysize = sizeof(PrivateRefCountArray);
Ohash_ctl.entrysize = sizeof(PrivateRefCountEntry);

PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
  HASH_ELEM | HASH_BLOBS);
}


Attached patch is for the latest development version at the git repository.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories



correction_of_PrivateRefCount.patch
Description: Binary data


smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-18 Thread Peter Eisentraut
On 2/18/16 12:20 PM, Joe Conway wrote:
> Just to be clear, AFAIK there is no issue with the committed changes on
> Windows. If there is please provide a concrete example that we can discuss.

Originally it was mentioned that this feature could be used in the
regression tests by making certain tests conditional on configure
options.  Which presumably won't work if the build was on Windows.

I don't doubt that your code actually works on Windows.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-18 Thread Peter Eisentraut
On 2/18/16 3:30 AM, Andres Freund wrote:
> I don't understand why you're so opposed to this. Several people said
> that they're interested in this information in the current discussion
> and it has been requested repeatedly over the years.

I think no one except Andrew Dunstan has requested this, and his use
case is disputed.  Everyone else is either confusing this with the
pg_controldata part or is just transitively claiming that someone else
wanted it.

I don't have a problem with the implementation, but I don't understand
what this feature is meant for.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Peter Eisentraut
On 2/18/16 10:17 AM, Tom Lane wrote:
> Christoph Berg  writes:
>> Currently the server insists on ssl_key_file's permissions to be 0600
>> or less, and be owned by the database user. Debian has been patching
>> be-secure.c since forever (the git history goes back to 8.2beta1) to
>> relax that to 0640 or less, and owned by root or the database user.
> 
> Debian can do that if they like, but it's entirely unacceptable as an
> across-the-board patch.  Not all systems treat groups as being narrow
> domains in which it's okay to assume that group-readable files are
> secure enough to be keys.  As an example, on OS X user files tend to be
> group "staff" or "admin" which'd be close enough to world readable.
> 
> We could allow group-readable if we had some way to know whether to
> trust the specific group, but I don't think there's any practical
> way to do that.  System conventions vary too much.

Wouldn't POSIX ACLs bypass this anyway?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Further, the notion that *this* is the footgun is completely off the
> > reservation- if the files have been changed to allow untrusted users to
> > have access to them, there isn't diddly we can do about it.
> 
> I completely disagree that those file-permissions checks are useless.
> What they accomplish is, if you accidentally set up an insecure key file,
> you'll get told about it fairly promptly, and have the opportunity to
> either fix the permissions or generate a new key, depending on your
> opinion of how likely it is that somebody stole the key already.  If we
> made no checks then, more than likely, an insecure key file would just
> sit there indefinitely, waiting for some passing blackhat to grab it.

If that's something you're concerned with then the right answer is to
monitor the file permissions- and there are tools which do exactly that.

I disagree that it's PG's charter to do that and, frankly, you *won't*
be told, in most cases, promptly about such a change.  We don't check
the permissions on every file or even every directory on every query or
even every connection.  With uptimes of days, weeks and months quite
often seen, this idea that PG is protecting you with these checks in
any way is instead creating a false sense of security.

> We can certainly discuss whether we need more than one model of what
> appropriate permissions are, but I do not believe that "rip out the
> check" is a helpful answer.

We're over-claiming what these protections are actually providing.  At
best, they help a new user out when they're first setting up their
database, if they're doing it from source and fully understand how to
correctly set up pg_hba.conf and friends to secure *those*.  Users
installing packages from distributions will have the benefit that the
distribution will get the permissions correct initially and not have to
worry about them.  Permission changes to these files after the system is
set up and running indicate that it's already too late because PG won't
tell the user there's an issue until a DB restart a month or two later,
and by then, if there's an untrusted user who was able to gain access
thanks to such changes, you can be sure they will have.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Tom Lane
Stephen Frost  writes:
> Further, the notion that *this* is the footgun is completely off the
> reservation- if the files have been changed to allow untrusted users to
> have access to them, there isn't diddly we can do about it.

I completely disagree that those file-permissions checks are useless.
What they accomplish is, if you accidentally set up an insecure key file,
you'll get told about it fairly promptly, and have the opportunity to
either fix the permissions or generate a new key, depending on your
opinion of how likely it is that somebody stole the key already.  If we
made no checks then, more than likely, an insecure key file would just
sit there indefinitely, waiting for some passing blackhat to grab it.

We can certainly discuss whether we need more than one model of what
appropriate permissions are, but I do not believe that "rip out the
check" is a helpful answer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Access method extendability

2016-02-18 Thread Michael Paquier
On Fri, Feb 19, 2016 at 12:51 AM, Alexander Korotkov wrote:
>> 11 I'd really like to see regression tests (TAP-tests?) for replication
>> with generic xlog.
>
>
> TAP test for replication added to bloom contrib. This test run on additional
> make target wal-check.

Just putting my eyes on that...

diff --git a/contrib/bloom/WALTest.pm b/contrib/bloom/WALTest.pm
new file mode 100644
index ...b2daf8b
*** a/contrib/bloom/WALTest.pm
--- b/contrib/bloom/WALTest.pm

This is basically a copy of RewindTest.pm. This is far from generic.
If this patch gets committed first I would suggest to wait for the
more-generic routines that would be added in PostgresNode.pm and then
come back to it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Andres Freund  writes:
> > ... We don't prevent the user from making the
> > configuration file world-writable either,
> 
> Maybe we should.  It wasn't an issue originally, because the config files
> were necessarily inside $PGDATA which we restrict permissions on.  But
> these days you can place the config files in places where untrustworthy
> people could get at them.

No, we should be improving our support of systems which provide more
specific groups, not destroying it.  Being able to run backups as a user
who is not able to modify the database would be great too, and that case
isn't covered by your approach to "allow group rights if the file is
owned by root."

Further, the notion that *this* is the footgun is completely off the
reservation- if the files have been changed to allow untrusted users to
have access to them, there isn't diddly we can do about it.  All we're
doing with this is imposing our own idea of what the system policy
should be, even though there are clear examples where that's just
blatently wrong.

If we really want to force these checks to happen (and I'm not
convinced that they're actually useful at all), then we need to provide
a way for users and distributions to control the specifics of the checks
as they chose.  Maybe that's a command-line switch instead of a GUC, or
it's something else, but there clearly isn't "one true way" here and we
should be flexible.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-18 Thread Michael Paquier
On Sat, Feb 13, 2016 at 2:26 PM, Michael Paquier
 wrote:
> On Sat, Feb 13, 2016 at 1:01 PM, Robert Haas  wrote:
>> On Fri, Feb 12, 2016 at 12:56 PM, Andres Freund  wrote:
>>> On 2016-02-12 12:37:35 -0500, Robert Haas wrote:
 On Mon, Feb 8, 2016 at 4:18 AM, Andres Freund  wrote:
 > I'm not really a fan. I'd rather change existing callers to add a
 > 'flags' bitmask argument. Right now there can't really be XLogInserts()
 > in extension code, so that's pretty ok to change.

 Yeah, but to what benefit?  You're just turning a smaller patch into a
 bigger one and requiring churning a bunch of code that wouldn't
 otherwise need to be touched.  I think Michael has a good point.
>>>
>>> It has the advantage of not ending up with an extra interface, that
>>> we're otherwise never going to get rid of? If not now, when would we
>>> remove it? Sure it touches a few more lines, but that's entirely trivial
>>> mechanical changes, so what?
>
> Note: the patch has grown from 15kB to 46kB by switching to the
> extended interface to the addition of an argument in XLogInsert().
>
>> I don't feel that there's only one right way to do this, but I think
>> Michael's position is both reasonable and similar to what we have done
>> in previous cases of this sort.
>
> To be honest, my heart still balances for the Extended() interface.
> This reduces the risk of conflicts with back-patching with 9.5.

Andres, others, what else can I do to make this thread move on? I can
produce any version of this patch depending on committer's
requirements.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Feb 18, 2016 at 4:15 PM, Tom Lane  wrote:
>> Because they've been removed from the right-link/left-link chains.

> That isn't the same thing as being inaccessible by scans, clearly
> (just what you call the "leaf scan sequence").

Only a physical-order scan, ie vacuum, would visit a dead page
(ignoring transient corner cases like a page getting deleted while an
indexscan is in flight to it).  So I think treating it as part of the
fragmentation measure is completely wrong: the point of that measure,
AFAICS, is to model how close an index-order traversal is to linear.
Half-dead pages are also normally very transient --- the only way they
persist is if there's a crash partway through a page deletion.  So I think
it's appropriate to assume that future indexscans won't visit those,
either.

> there are usage patterns where half-dead pages might accumulate.

Other than a usage pattern of "randomly SIGKILL backends every few
seconds", I don't see how that would happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about google summer of code 2016

2016-02-18 Thread Amit Langote
On 2016/02/18 22:44, Alexander Korotkov wrote:
> On Wed, Feb 17, 2016 at 10:40 AM, Amit Langote wrote:
>> I didn't find for 2016 but here is the PostgreSQL wiki page for the last
>> year's GSoC page: https://wiki.postgresql.org/wiki/GSoC_2015#Project_Ideas
> 
> 
> I've created wiki page for GSoC 2016. It contains unimplemented ideas from
> 2015 page.
> Now, GSoC accepting proposals from organizations. Typically, we have call
> for mentors in hackers mailing list in this period.
> Thom, do we apply this year?

Apparently, the deadline is: February 20, 2016 at 04:00 (+0900 UTC)

https://summerofcode.withgoogle.com/

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 4:15 PM, Tom Lane  wrote:
> Because they've been removed from the right-link/left-link chains.

That isn't the same thing as being inaccessible by scans, clearly
(just what you call the "leaf scan sequence"). Besides, half-dead
pages still have right-link/left-link chains, and there are usage
patterns where half-dead pages might accumulate.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> Why would dead pages not get traversed by scans?

Because they've been removed from the right-link/left-link chains.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 3:06 PM, Tom Lane  wrote:
> Don't see why; the documentation and field names clearly imply that those
> numbers are accumulated only over leaf pages.

Then I guess what might be lacking is a delineation of what a leaf
page is for the purposes of pgstatindex().

> I certainly wouldn't expect
> the fragmentation measure to include dead pages, for example, since they
> would not get traversed by scans.  (Whether the "rightlink points to a
> higher page number" rule for fragmentation is actually very useful is a
> separate question; but as long as that's the measure, only pages that
> are part of the leaf scan sequence should count.)

Why would dead pages not get traversed by scans? It's clear that they
would be traversed to the extent that there will be buffer locking and
inspection of the B-Tree special area flags to determine that the page
should be ignored.

>> Having looked at the 2008 commit d287818eb514d431 myself, ISTM
>> that your intent might well have been to have that happen -- why else
>> would any reasonable person have changed the order at all?
>
> My best guess is that I was thinking that the tests were independent

Dunno. I'd have thought that the basic fact that the ordering had some
significance was fairly obvious. One of the tests is "if
(P_IGNORE(...))", and I can't imagine that not provoking thought.

I'm particular about things like this because I care about making sure
these tools are useful for teaching novice hackers about Postgres
internals. I've made my point and will leave it at that, though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove or weaken hints about "effective resolution of sleep delays is 10 ms"?

2016-02-18 Thread Peter Geoghegan
On Tue, Feb 16, 2016 at 7:10 AM, Tom Lane  wrote:
> Given this, I'm on board with just removing the weasel-wording about
> timer resolution, except maybe for commit_delay where useful values
> are small enough that it's a hazard on old systems.

+1, but I'd move the guidance for commit_delay's effective resolution
from "29.4. WAL Configuration" to where commit_delay is introduced,
"18.5. Write Ahead Log".


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New pg_upgrade data directory inside old one?

2016-02-18 Thread Bruce Momjian
On Mon, Feb 15, 2016 at 06:32:06PM +0100, Magnus Hagander wrote:
> 
> 
> On Mon, Feb 15, 2016 at 6:29 PM, Bruce Momjian  wrote:
> 
> Someone on IRC reported that if they had run the pg_upgrade-created
> delete_old_cluster.sh shell script it would have deleted their old _and_
> new data directories.  (Fortunately they didn't run it.)
> 
> I was confused how this could have happened, and the user explained that
> their old cluster was in /u/pgsql/data, and that they wanted to switch to
> a per-major-version directory naming schema, so they put the new data
> directory in /u/pgsql/data/9.5.  (They could have just moved the
> directory while the server was down, but didn't.)
> 
> Unfortunately, there is no check for having the new cluster data
> directory inside the old data directory, only a check for tablespace
> directories in the old cluster.  (I never anticipated someone would do
> this.)
> 
> 
> Interesting - I definitely wouldn't have expected that either. And it
> definitely seems like a foot-gun we should protect the users against. 

Patch applied back through 9.3 where delete script tests were added.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> I think that the P_ISLEAF() instrumentation of free space and
> fragments might still need to happen for deleted and/or half dead
> pages.

Don't see why; the documentation and field names clearly imply that those
numbers are accumulated only over leaf pages.  I certainly wouldn't expect
the fragmentation measure to include dead pages, for example, since they
would not get traversed by scans.  (Whether the "rightlink points to a
higher page number" rule for fragmentation is actually very useful is a
separate question; but as long as that's the measure, only pages that
are part of the leaf scan sequence should count.)

> Having looked at the 2008 commit d287818eb514d431 myself, ISTM
> that your intent might well have been to have that happen -- why else
> would any reasonable person have changed the order at all?

My best guess is that I was thinking that the tests were independent,
and rearranged them so that the most common case would be tested first.
I'm quite sure I didn't intend to change the statistical behavior, else
I would have updated docs and/or mentioned it in the commit message.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 11:52 AM, Tom Lane  wrote:
> It looks like this was done correctly to begin with, and I broke it in
> d287818eb514d431b1a68e1f3940cd958f82aa34.  Not sure what I was thinking :-(

I think that you might not have simply changed the order in a totally
misguided way back in 2008, as you seem to imply. Consider what this
block does following your commit just now:

...

else if (P_IGNORE(opaque))
indexStat.empty_pages++;/* this is the "half dead" state */
else if (P_ISLEAF(opaque))
{
int max_avail;

max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special +
SizeOfPageHeaderData);
indexStat.max_avail += max_avail;
indexStat.free_space += PageGetFreeSpace(page);

indexStat.leaf_pages++;

/*
 * If the next leaf is on an earlier block, it means a
 * fragmentation.
 */
if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
indexStat.fragments++;
}

...

I think that the P_ISLEAF() instrumentation of free space and
fragments might still need to happen for deleted and/or half dead
pages. Having looked at the 2008 commit d287818eb514d431 myself, ISTM
that your intent might well have been to have that happen -- why else
would any reasonable person have changed the order at all?

The avg_leaf_density and leaf_fragmentation fields might now be argued
to misrepresent the true picture. This is not clearly a departure from
their documented behavior, if only because the descriptions are almost
the same as the names of the fields themselves. If you think that the
instrumentation of free space is the most useful possible behavior as
of today, which it might well be, then you might have clarified that
this behavior was your intent in today's commit, for example by
updating the descriptions of the fields avg_leaf_density and
leaf_fragmentation in the docs.

Just a thought.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
I wrote:
> Peter Geoghegan  writes:
>> I think we should change it. It seems like a bug to me.

> Me too.  Is it enough bug-like to be something to back-patch, or should
> we just change it in HEAD?

Actually, there's a significantly worse bug here: I just realized that the
page type tests are done in the wrong order.  A deleted page that was
formerly a leaf will be reported as though it was a live leaf page,
because both the BTP_LEAF and BTP_DELETED flags are set for such a page.

It looks like this was done correctly to begin with, and I broke it in
d287818eb514d431b1a68e1f3940cd958f82aa34.  Not sure what I was thinking :-(

Anyway, I think that puts the final nail in the coffin of the idea that
the current code's behavior is sane enough to preserve.  I think we should
fix all these things and back-patch 'em all.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
I vote back patch. Subtle differences between the branches should be
avoided.

--
Peter Geoghegan


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Feb 18, 2016 at 10:49 AM, Tom Lane  wrote:
>> Yeah, that seems a bit strange to me as well.  Should we change it to
>> count the root as an internal page, or is that going too far?

> I think we should change it. It seems like a bug to me.

Me too.  Is it enough bug-like to be something to back-patch, or should
we just change it in HEAD?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 10:49 AM, Tom Lane  wrote:
> Yeah, that seems a bit strange to me as well.  Should we change it to
> count the root as an internal page, or is that going too far?

I think we should change it. It seems like a bug to me. I've had the
same point come up ("leaf-ness/internal-ness and root-ness are
orthogonal") a couple of times with Heikki over the years. I just
haven't used pgstattuple very much for some reason, and so didn't
catch it before now.

> Note that it's already the case that in a one-page index (root is also
> a leaf), the root will be included in the leaf_pages count.  So it
> sure seems inconsistent that it's not included in the internal_pages
> count when it's not a leaf.

That's what I was thinking.

> Well, actually, since we don't have write lock on the index it'd be
> possible to see zero or multiple roots because the root's location
> changes.  That's already mentioned in the documentation, if somewhat
> obliquely.

Ah, yes. Another consequence of going in physical order.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> That's odd. Having taken a quick look at pgstatindex_impl(), I dislike
> that it counts the root separately from internal pages. That's not how
> they're actually presented and understood in the code.

Yeah, that seems a bit strange to me as well.  Should we change it to
count the root as an internal page, or is that going too far?

Note that it's already the case that in a one-page index (root is also
a leaf), the root will be included in the leaf_pages count.  So it
sure seems inconsistent that it's not included in the internal_pages
count when it's not a leaf.

> It's also possible to have more than one root page, since we do a scan
> in physical order, which the code considers. There could be a fast
> root and a true root. Looks like one is counted as deleted, though:

Well, actually, since we don't have write lock on the index it'd be
possible to see zero or multiple roots because the root's location
changes.  That's already mentioned in the documentation, if somewhat
obliquely.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 10:01 AM, Tom Lane  wrote:
> Yeah ... the numbers already appear not to add up because the root page
> is counted in index_size but not any other column, so there's already
> something worthy of explanation there.  Maybe something like "The reported
> index_size will normally correspond to two more pages than are accounted
> for by internal_pages + leaf_pages + empty_pages + deleted_pages, because
> it also includes the index's metapage and root page".

That's odd. Having taken a quick look at pgstatindex_impl(), I dislike
that it counts the root separately from internal pages. That's not how
they're actually presented and understood in the code.

It's also possible to have more than one root page, since we do a scan
in physical order, which the code considers. There could be a fast
root and a true root. Looks like one is counted as deleted, though:

regression=# select index_size/2^13 as total_pages, root_block_no,
internal_pages, leaf_pages, deleted_pages from
pgstatindex('btree_tall_idx');
 total_pages │ root_block_no │ internal_pages │ leaf_pages │ deleted_pages
─┼───┼┼┼───
 206 │81 │  3 │160 │42
(1 row)

BTW, I am actively working on the amcheck B-Tree checker tool, and
hope to post it soon.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Feb 18, 2016 at 9:40 AM, Tom Lane  wrote:
>> I think this is a bug and we ought to fix the code to include the
>> metapage in the reported index_size.  Thoughts?

> I tend to agree, but I think you should note that specifically in the
> documentation. I'm in favor of tools like pgstattuple and pageinspect
> going into this kind of detail in their documentation generally.

Yeah ... the numbers already appear not to add up because the root page
is counted in index_size but not any other column, so there's already
something worthy of explanation there.  Maybe something like "The reported
index_size will normally correspond to two more pages than are accounted
for by internal_pages + leaf_pages + empty_pages + deleted_pages, because
it also includes the index's metapage and root page".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Peter Geoghegan
On Thu, Feb 18, 2016 at 9:40 AM, Tom Lane  wrote:
> I think this is a bug and we ought to fix the code to include the
> metapage in the reported index_size.  Thoughts?

I tend to agree, but I think you should note that specifically in the
documentation. I'm in favor of tools like pgstattuple and pageinspect
going into this kind of detail in their documentation generally.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] The number of bytes is stored in index_size of pgstatindex() ?

2016-02-18 Thread Tom Lane
=?UTF-8?B?5aSn5aGa5oay5Y+4?=  writes:
> It seems that description of index_size that is returned by pgstatindex() is 
> wrong.
> The document says "Total number of pages in index".
> But it looks like the number of bytes is stored in index_size.

Hmm, you're right, because what the code does is

values[j++] = psprintf(INT64_FORMAT,
   (indexStat.root_pages +
indexStat.leaf_pages +
indexStat.internal_pages +
indexStat.deleted_pages +
indexStat.empty_pages) * BLCKSZ);

so the result is clearly measured in bytes not pages.

I was going to just change the docs to say "Total index size in bytes",
but then I noticed that this number is not that either: it does not
count the metapage and therefore is BLCKSZ less than the true file
size.  As an example:

regression=# select * from pgstatindex('tenk1_unique1');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages
 | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
-+++---++---
-+-+---+--+
   2 |  1 | 237568 | 3 |  0 | 28
 |   0 | 0 |87.91 |  0
(1 row)

regression=# select relfilenode from pg_class where relname = 'tenk1_unique1';
 relfilenode 
-
   27449
(1 row)

$ ls -l 27449
-rw---. 1 postgres postgres 245760 Feb 17 18:58 27449


I think this is a bug and we ought to fix the code to include the
metapage in the reported index_size.  Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-18 Thread Anastasia Lubennikova

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
  wrote:

I fixed it in the new version (attached).


Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it 
looks much better.
I described all details of the compression in this document 
https://goo.gl/50O8Q0 (the same text without pictures is attached in 
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about 
tricky moments of implementation and questions about future work.

Please don't hesitate to comment it.


Sorry, previous patch was dirty. Hotfix is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple  = IndexTuple (with metadata in TID field+ key) + PostingList


1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of  item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.

2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.

3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion. 

In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).

4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.

5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxpos

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-18 Thread Joe Conway
On 02/18/2016 12:30 AM, Andres Freund wrote:
> On 2016-02-17 21:19:08 -0500, Peter Eisentraut wrote:
>> On 2/17/16 9:08 PM, Michael Paquier wrote:
>>> On Thu, Feb 18, 2016 at 11:02 AM, Peter Eisentraut  wrote:
 On 2/17/16 5:20 PM, Josh berkus wrote:
> I have a use-case for this feature, at part of it containerized
> PostgreSQL. Right now, there is certain diagnostic information (like
> timeline) which is exposed ONLY in pg_controldata.

 I'm talking about the pg_config() function, not pg_controldata.
>>>
>>> Andrew has mentioned a use case he had at the beginning of this thread
>>> to enhance a bit the regression tests related to libxml.
>>
>> While that idea was useful, I think we had concluded that there are
>> better ways to do this and that this way probably wouldn't even work
>> (Windows?).

Just to be clear, AFAIK there is no issue with the committed changes on
Windows. If there is please provide a concrete example that we can discuss.

> I don't understand why you're so opposed to this. Several people said
> that they're interested in this information in the current discussion
> and it has been requested repeatedly over the years. For superusers you
> can already hack access, but it's darn ugly.

Exactly.


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] proposal: PL/Pythonu - function ereport

2016-02-18 Thread Catalin Iacob
On 2/18/16, Pavel Stehule  wrote:
> it doesn't look badly. Is there any possibility how to emulate it with
> Python2 ? What do you think about some similar implementation on Python2?

The example code I gave works as is in Python2.

The Python3 keyword only arguments are only syntactic sugar. See
https://www.python.org/dev/peps/pep-3102 for the details. But, as the
PEP notes,

def f(a, b, *, key1, key2)

is similar to doing this which also works in Python2

def f(a, b, *ignore, key1, key2):
if ignore:
   raise TypeError('too many positional arguments')

For our case, we want to accept any number of positional arguments due
to compatibility so we don't need or want the check for 'too many
positional arguments'.

Note that in both Python 2 and 3, invoking f(1, 2, key1='k1',
key2='k2') is just syntactic sugar for constructing the (1, 2) tuple
and {'key1': 'k1', 'key2': 'k2'} dict and passing those to f which
then unpacks them into a, b, key1 and key2. You see that reflected in
the C API where you get PyObject* args and PyObject* kw and you unpack
them explicitly with PyArg_ParseTupleAndKeywords or just use tuple and
dict calls to peek inside.

What you loose by not having Python3 is that you can't use
PyArg_ParseTupleAndKeywords and tell it that detail and so on are
keywork only arguments. But because we don't want to unpack args we
probably couldn't use that anyway even if we wouldn't support Python2.

Therefore you need to look inside the kw dictionary manually as my
example shows. What we could also do is check that the kw dictionary
*only* contains detail, hint and so on and raise a TypeError if it has
more things to avoid silently accepting stuff like:
plpy.error('message', some_param='abc'). In Python3
PyArg_ParseTupleAndKeywords would ensure that, but we need to do it
manually.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about Restart point and checkpoint_segments

2016-02-18 Thread Benoit Lobréau
Sorry for the noise, I got my answer.

2016-02-16 18:02 GMT+01:00 Benoit Lobréau :

> Hello,
>
> I am using a hot_standby setup on PostgreSQL 9.1
> While I was testing, I found out that only checkpoint_timeout (+ a
> checkpoint since the last restart point) could trigger a restart point.
>
> The code (bgwriter.c) seems to confirm this:
>
> /*
> * Check progress against WAL segments written and checkpoint_segments.
> *
> * We compare the current WAL insert location against the location
> * computed before calling CreateCheckPoint. The code in XLogInsert that
> * actually triggers a checkpoint when checkpoint_segments is exceeded
> * compares against RedoRecptr, so this is not completely accurate.
> * However, it's good enough for our purposes, we're only calculating an
> * estimate anyway.
> */
> if (!RecoveryInProgress())  ===> Only in case of primary
> {
>recptr = GetInsertRecPtr();
>elapsed_xlogs =
>(((double) (int32) (recptr.xlogid -
> ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
>((double) recptr.xrecoff - (double)
> ckpt_start_recptr.xrecoff) / XLogSegSize) /
>CheckPointSegments;
>
>if (progress < elapsed_xlogs)  ===> progress in volume
>{
>ckpt_cached_elapsed = elapsed_xlogs;
>return false;
>}
> }
>
> /*
> * Check progress against time elapsed and checkpoint_timeout.
> */
> gettimeofday(&now, NULL);
> elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
>
> now.tv_usec / 100.0) / CheckPointTimeout;
>
>
> if (progress < elapsed_time)  ===> progress in time
> {
>ckpt_cached_elapsed = elapsed_time;
>return false;
> }
>
> /* It looks like we're on schedule. */
> return true;
>
> I also found a post from Simon Riggs [1]: "checkpoint_segments is ignored
> on standby."
>
> The documentation is stating the opposite [2]: "In standby mode, a
> restartpoint is also triggered if checkpoint_segments log segments have
> been replayed since last restartpoint and at least one checkpoint record
> has been replayed."
>
> Since I am not a native english speaker, maybe I misunderstood the
> documentation. But to me, it looks wrong.
> If it's indeed wrong. Could you explain why checkpoint_segments doesn't
> trigger a restart_point in standby mode ?
>
> Thank you
> Benoit
>
> [1]
> http://www.postgresql.org/message-id/CA+U5nMKdf7odZzYNnoRkkCZmJpGEy=oqbu9nan_zva_rtzi...@mail.gmail.com
> [2] http://www.postgresql.org/docs/9.1/static/wal-configuration.html
>


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Tom Lane
Andres Freund  writes:
> ... We don't prevent the user from making the
> configuration file world-writable either,

Maybe we should.  It wasn't an issue originally, because the config files
were necessarily inside $PGDATA which we restrict permissions on.  But
these days you can place the config files in places where untrustworthy
people could get at them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Andres Freund
On 2016-02-18 10:17:49 -0500, Tom Lane wrote:
> Christoph Berg  writes:
> > Currently the server insists on ssl_key_file's permissions to be 0600
> > or less, and be owned by the database user. Debian has been patching
> > be-secure.c since forever (the git history goes back to 8.2beta1) to
> > relax that to 0640 or less, and owned by root or the database user.
> 
> Debian can do that if they like, but it's entirely unacceptable as an
> across-the-board patch.  Not all systems treat groups as being narrow
> domains in which it's okay to assume that group-readable files are
> secure enough to be keys.  As an example, on OS X user files tend to be
> group "staff" or "admin" which'd be close enough to world readable.
> 
> We could allow group-readable if we had some way to know whether to
> trust the specific group, but I don't think there's any practical
> way to do that.  System conventions vary too much.

Isn't that a bit overly restrictive? Asking users to patch out checks,
for perfectly reasonable configurations, strikes me as a bit unbalanced.
There's never reasons to make the file world read/writable; but there
seem to be plenty of scenarios where the file should be read/writable by
specific groups.  We don't prevent the user from making the
configuration file world-writable either, so there's not really that
much of a security benefit of being overly restrictive with other
parameters - you can just ocnfigure an archive command of your choosing
and such.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Tom Lane
Bruce Momjian  writes:
> On Thu, Feb 18, 2016 at 10:17:49AM -0500, Tom Lane wrote:
>> We could allow group-readable if we had some way to know whether to
>> trust the specific group, but I don't think there's any practical
>> way to do that.  System conventions vary too much.

> Should we have a GUC to control the group permissions restriction?  I
> can certainly see value in allowing for group access to the certificate.

Meh ... I think such a GUC would mostly be a way to shoot yourself in
the foot.  (For example, imagine an OS X user who sets it to "staff"
instead of doing the right thing and adjusting the file's permissions.)

I did have a thought though: could we allow two distinct permissions
configurations?  That is, allow either:

* file is owned by us, mode 0600 or less

* file is owned by root, mode 0640 or less

The first case is what we allow today.  (We don't need an explicit
ownership check; if the mode is 0600 and we can read it, we must be
the owner.)  The second case is what Debian wants.  We already know
we are not root, so if we can read the file, we must be part of the
group that root has allowed to read the file, and at that point it's
on root's head whether or not that group is secure.  I don't have a
problem with trusting root's judgment on security matters --- if the
root admin is incompetent, there are probably holes everywhere anyway.

The problem with the proposed patch is that it's conflating these
distinct cases, but that's easily fixed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-18 Thread Filip Rembiałkowski
Another update - separated new internal function to satisfy opr_sanity.sql
diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 7352b29..3a6d4f5 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -160,7 +160,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 	strcpy_quoted(payload, SPI_getvalue(trigtuple, tupdesc, colno), '\'');
 }
 
-Async_Notify(channel, payload->data);
+Async_Notify(channel, payload->data, false);
 			}
 			ReleaseSysCache(indexTuple);
 			break;
diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index 4dd5608..933c76c 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-NOTIFY channel [ , payload ]
+NOTIFY [ ALL | DISTINCT ] channel [ , payload ]
 
  
 
@@ -105,6 +105,10 @@ NOTIFY channel [ , ALL is specified (contrary to DISTINCT, the
+   default), the server will deliver all notifications, including duplicates.
+   Removal of duplicate notifications takes place within transaction block,
+   finished with COMMIT, END or SAVEPOINT.
   
 
   
@@ -190,6 +194,12 @@ NOTIFY channel [ , NOTIFY command if you need to work with
 non-constant channel names and payloads.

+   
+There is a three-argument version, pg_notify(text,
+text, boolean). The third argument acts like
+the ALL keyword when set to true, and
+DISTINCT when set to false. 
+   
   
  
 
@@ -210,6 +220,21 @@ Asynchronous notification "virtual" with payload "This is the payload" received
 LISTEN foo;
 SELECT pg_notify('fo' || 'o', 'pay' || 'load');
 Asynchronous notification "foo" with payload "payload" received from server process with PID 14728.
+
+/* Identical messages from same (sub-) transaction can be eliminated - unless you use the ALL keyword */
+LISTEN bar;
+BEGIN;
+NOTIFY bar, 'Coffee please';
+NOTIFY bar, 'Coffee please';
+NOTIFY bar, 'Milk please';
+NOTIFY ALL bar, 'Milk please';
+SAVEPOINT s;
+NOTIFY bar, 'Coffee please';
+COMMIT;
+Asynchronous notification "bar" with payload "Coffee please" received from server process with PID 31517.
+Asynchronous notification "bar" with payload "Milk please" received from server process with PID 31517.
+Asynchronous notification "bar" with payload "Milk please" received from server process with PID 31517.
+Asynchronous notification "bar" with payload "Coffee please" received from server process with PID 31517.
 
  
 
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c39ac3a..54d1680 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -524,7 +524,42 @@ pg_notify(PG_FUNCTION_ARGS)
 	/* For NOTIFY as a statement, this is checked in ProcessUtility */
 	PreventCommandDuringRecovery("NOTIFY");
 
-	Async_Notify(channel, payload);
+	Async_Notify(channel, payload, false);
+
+	PG_RETURN_VOID();
+}
+
+
+/*
+ * pg_notify_3args
+ *SQL function to send a notification event, 3-argument version
+ */
+Datum
+pg_notify_3args(PG_FUNCTION_ARGS)
+{
+	const char *channel;
+	const char *payload;
+	bool use_all;
+
+	if (PG_ARGISNULL(0))
+		channel = "";
+	else
+		channel = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	if (PG_ARGISNULL(1))
+		payload = "";
+	else
+		payload = text_to_cstring(PG_GETARG_TEXT_PP(1));
+
+	if (PG_ARGISNULL(2))
+		use_all = false;
+	else
+		use_all = PG_GETARG_BOOL(2);
+
+	/* For NOTIFY as a statement, this is checked in ProcessUtility */
+	PreventCommandDuringRecovery("NOTIFY");
+
+	Async_Notify(channel, payload, use_all);
 
 	PG_RETURN_VOID();
 }
@@ -540,7 +575,7 @@ pg_notify(PG_FUNCTION_ARGS)
  *		^^
  */
 void
-Async_Notify(const char *channel, const char *payload)
+Async_Notify(const char *channel, const char *payload, bool use_all)
 {
 	Notification *n;
 	MemoryContext oldcontext;
@@ -570,9 +605,10 @@ Async_Notify(const char *channel, const char *payload)
 	 errmsg("payload string too long")));
 	}
 
-	/* no point in making duplicate entries in the list ... */
-	if (AsyncExistsPendingNotify(channel, payload))
-		return;
+	if (!use_all)
+		/* remove duplicate entries in the list */
+		if (AsyncExistsPendingNotify(channel, payload))
+			return;
 
 	/*
 	 * The notification list needs to live until end of transaction, so store
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b307b48..7203f4a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -8528,11 +8528,12 @@ DropRuleStmt:
  *
  */
 
-NotifyStmt: NOTIFY ColId notify_payload
+NotifyStmt: NOTIFY all_or_distinct ColId notify_payload
 {
 	NotifyStmt *n = makeNode(NotifyStmt);
-	n->conditionname = $2;
-	n->payload = $3;
+	n->use_all = $2;
+	n->conditionname = $3;
+	n->payload = $4;
 	$$ = (Node *)n;
 }
 		;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 045f7f0..0e50561 100644
--- a/

Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Bruce Momjian
On Thu, Feb 18, 2016 at 10:17:49AM -0500, Tom Lane wrote:
> Christoph Berg  writes:
> > Currently the server insists on ssl_key_file's permissions to be 0600
> > or less, and be owned by the database user. Debian has been patching
> > be-secure.c since forever (the git history goes back to 8.2beta1) to
> > relax that to 0640 or less, and owned by root or the database user.
> 
> Debian can do that if they like, but it's entirely unacceptable as an
> across-the-board patch.  Not all systems treat groups as being narrow
> domains in which it's okay to assume that group-readable files are
> secure enough to be keys.  As an example, on OS X user files tend to be
> group "staff" or "admin" which'd be close enough to world readable.
> 
> We could allow group-readable if we had some way to know whether to
> trust the specific group, but I don't think there's any practical
> way to do that.  System conventions vary too much.

Should we have a GUC to control the group permissions restriction?  I
can certainly see value in allowing for group access to the certificate.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Tom Lane
Christoph Berg  writes:
> Currently the server insists on ssl_key_file's permissions to be 0600
> or less, and be owned by the database user. Debian has been patching
> be-secure.c since forever (the git history goes back to 8.2beta1) to
> relax that to 0640 or less, and owned by root or the database user.

Debian can do that if they like, but it's entirely unacceptable as an
across-the-board patch.  Not all systems treat groups as being narrow
domains in which it's okay to assume that group-readable files are
secure enough to be keys.  As an example, on OS X user files tend to be
group "staff" or "admin" which'd be close enough to world readable.

We could allow group-readable if we had some way to know whether to
trust the specific group, but I don't think there's any practical
way to do that.  System conventions vary too much.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] speeding up GIN build with parallel workers

2016-02-18 Thread Constantin S. Pan
On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov  wrote:

> My feedback is (Mac OS X 10.11.3)
> 
> set gin_parallel_workers=2;
> create index message_body_idx on messages using gin(body_tsvector);
> LOG:  worker process: parallel worker for PID 5689 (PID 6906) was
> terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..2f6f142 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,14 +16,20 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
 
 typedef struct
 {
@@ -265,6 +271,148 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+#define KEY_TASK 42
+#define GIN_MAX_WORKERS 24
+#define GIN_BLOCKS_PER_WORKER 4
+#define GIN_RESULT_LEN 1024
+#define GIN_RESULT_KEYLEN 1024
+
+typedef struct {
+	bool ready;
+	bool finished;
+
+	Datum			key;
+	OffsetNumber	attnum;
+	GinNullCategory	category;
+
+	char keycoded[GIN_RESULT_KEYLEN];
+	int keylen;
+
+	ItemPointerData list[GIN_RESULT_LEN];
+	int			nlist;
+
+	Latch	blatch;
+	Latch	wlatch;
+} WorkerResult;
+
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct {
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+	WorkerResult results[GIN_MAX_WORKERS];
+} PGinBuildTask;
+
+static volatile PGinBuildTask *task;
+
+static void waitBool(volatile bool *actual, bool wanted, volatile Latch *l)
+{
+	if (*actual == wanted) return;
+
+	while (*actual != wanted)
+		WaitLatch(l, WL_LATCH_SET, 0);
+	ResetLatch(l);
+}
+
+static void setBool(volatile bool *actual, bool newvalue, volatile Latch *l)
+{
+	*actual = newvalue;
+	SetLatch(l);
+}
+
+static void ginDumpEntry(GinState *ginstate,
+		 volatile WorkerResult *r,
+		 OffsetNumber attnum,
+		 Datum key,
+		 GinNullCategory category,
+		 ItemPointerData *list,
+		 int nlist)
+{
+	volatile char *addr;
+	bool isnull;
+	Form_pg_attribute att;
+
+	Assert(nlist > 0);
+	waitBool(&r->ready, false, &r->wlatch);
+
+	Assert(r->keylen == 0);
+	addr = r->keycoded;
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	r->attnum = attnum;
+	r->category = category;
+	if (r->category == GIN_CAT_EMPTY_ITEM || r->category == GIN_CAT_NULL_ITEM)
+	{
+		r->keylen = 1;
+	}
+	else
+	{
+		r->keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+		Assert(r->keylen > 0);
+		Assert(r->keylen <= GIN_RESULT_KEYLEN);
+
+		datumSerialize(key, isnull, att->attbyval, att->attlen, (char**)&addr);
+	}
+
+	while (nlist > 0)
+	{
+		if (nlist > GIN_RESULT_LEN)
+			r->nlist = GIN_RESULT_LEN;
+		else
+			r->nlist = nlist;
+		nlist -= r->nlist;
+
+		memcpy((void*)r->list, list, r->nlist * sizeof(ItemPointerData));
+		setBool(&r->ready, true, &r->blatch);
+		waitBool(&r->ready, false, &r->wlatch);
+	}
+}
+
+static void ginDumpAccumulator(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+			  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		if (IsParallelWorker())
+		{
+			volatile WorkerResult *r = &task->results[ParallelWorkerNumber];
+			ginDumpEntry(&buildstate->ginstate, r, attnum, key, category, list, nlist);
+		}
+		else
+			ginEntryInsert(&buildstate->ginstate,
+		   attnum, key, category,
+		   list, nlist,
+		   &buildstate->buildStats);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
  bool *isnull, bool tupleIsAlive, void *state)
@@ -283,52 +431,315 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
 	{
-		ItemPointerData *list;
-		Datum		key;

[HACKERS] [PATH] Jsonb, insert a new value into an array at arbitrary position

2016-02-18 Thread Dmitry Dolgov
Hi

As far as I see there is one basic update function for jsonb, that can't be
covered by `jsonb_set` - insert a new value into an array at arbitrary
position.
Using `jsonb_set` function we can only append into array at the end/at the
beginning, and it looks more like a hack:

```
=# select jsonb_set('{"a": [1, 2, 3]}', '{a, 100}', '4');
  jsonb_set
-
 {"a": [1, 2, 3, 4]}
(1 row)


=# select jsonb_set('{"a": [1, 2, 3]}', '{a, -100}', '4');
  jsonb_set
-
 {"a": [4, 1, 2, 3]}
(1 row)
```

I think the possibility to insert into arbitrary position will be quite
useful,
something like `json_array_insert` in MySql:

```
mysql> set @j = '["a", {"b": [1, 2]}, [3, 4]]';
mysql> select json_array_insert(@j, '$[1].b[0]', 'x');

 json_array_insert(@j, '$[1].b[0]', 'x')
+-+
 ["a", {"b": ["x", 1, 2]}, [3, 4]]
```

It can look like `jsonb_insert` function in our case:

```
=# select jsonb_insert('{"a": [0,1,2]}', '{a, 1}', '"new_value"');
 jsonb_insert
---
 {"a": [0, "new_value", 1, 2]}
(1 row)
```

I attached possible implementation, which is basically quite small (all
logic-related
modifications is only about 4 lines in `setPath` function). This
implementation
assumes a flag to separate "insert before"/"insert after" operations, and an
alias to `jsonb_set` in case if we working with a jsonb object, not an
array.

What do you think about this?


jsonb_insert.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl

2016-02-18 Thread Amit Kapila
On Mon, Feb 15, 2016 at 12:03 AM, Tom Lane  wrote:
>
> Robert Haas  writes:
>
> > With respect to pg_locks - and for that matter also pg_stat_activity -
> > I think you are right that improvement is needed.
>
> This is really the core of my concern at the moment.  I think that
> isolationtester is probably outright broken for any situation where the
> queries-under-test are being parallel executed, and so will be any other
> client that's trying to identify who blocks whom from pg_locks.
>
> > The simplest thing we could do to make that easier is, in
> > pg_stat_activity, have parallel workers advertise the PID that
> > launched them in a new field; and in pg_locks, have members of a lock
> > group advertise the leader's PID in a new field.
> >

The lock information for parallel query which uses one parallel worker and
another unrelated backend trying to acquire Access Exclusive lock on the
table which parallel query is using will be displayed as below by using lock
monitoring query [1]

 blocked_pid | blocking_pid |blocked_statement|
current_
statement_in_blocking_process
-+--+-+-
---
5052 | 3464 | Lock table t1 in Access Exclusive Mode; |
5052 | 4128 | Lock table t1 in Access Exclusive Mode; |
select c
ount(*) from t1 where c1 < 10;
(2 rows)

Here, backend 5052 is waiting for acquiring Access Exclusive lock on t1
which is held in Access Share mode by master backend 4128 and parallel
worker 3464.  Now, I think it is tricky for user to find what exactly
is going on by using current lock monitoring queries.  Do you think by
adding additional leader pid column, we can eliminate duplicity in above
case and all such cases without having additional join or making above
query more complex?

Can we think of not having lock information in pg_locks for parallel worker
for certain cases where parallel worker acquires lock in same
mode on same resource as leader backend?

Currently, pg_stat_activity displays NULL for both query and state for
parallel workers.  I think we should call pgstat_report_activity() in
ParallelWorkerMain() or ParallelQueryMain() to display the information.
One thing which needs some thought is what query should we display
for parallel worker, currently while forming query descriptor in parallel
worker we use "", so one idea is to just display the same
or display the query the used by master backend (if currently not
available, then we might need to pass it from master backend).

>
> That would be simple for us, but it would break every existing client-side
> query that tries to identify blockers/blockees; and not only would those
> queries need work but they would become very substantially more complex
> and slower (probably at least 4-way joins not 2-way).  We already know
> that isolationtester's query has performance problems in the buildfarm.
> I think more thought is needed here,

Agreed.  I think before deciding what exactly to add or change in pg_locks
or lock monitoring queries, it might be helpful if we define how we want
to convey the information to user/dba when group of parallel processes
are involved as blockers/blockees.  Some of the cases are:
a) a standalone backend waits for group of parallel processes holding
lock in same mode on same resource.
b) group of parallel workers are waiting for lock held by standalone
backend or some other parallel group
c) a standalone backend waits for group of parallel processes holding
lock in different mode on same resource.
and other similar cases.

Before jumping into discussion about solution, I would like to know
do you and or Robert also have above kind of cases in mind where you
want a better way to display information for user or something else?



[1] -
SELECT blocked_locks.pid AS blocked_pid,
 blocking_locks.pid AS blocking_pid,
 blocked_activity.queryAS blocked_statement,
 blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity  ON
blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM
blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM
blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM
blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM
blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM
blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.ob

Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-02-18 Thread Daniel Verite
Dean Rasheed wrote:

> If I want to sort the rows coming out of a query, my first thought
> is always going to be to add/adjust the query's ORDER BY clause, not
> use some weird +/- psql syntax.

About the vertical sort, I agree on all your points.
It's best to rely on ORDER BY for all the reasons mentioned,
as opposed to a separate sort in a second step.

But you're considering the case when a user is designing
or adapting a query for the purpose of crosstab
viewing. As mentioned in my previous reply (about the
methods to achieve horizontal sort), that scenario is not really
what motivates the feature in the first place.

If removing that sort option is required to move forward
with the patch because it's controversial, so be it,
but overall I don't see this as a benefit for the end user,
it's just an option.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about google summer of code 2016

2016-02-18 Thread Alexander Korotkov
On Wed, Feb 17, 2016 at 10:40 AM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:

> On 2016/02/17 16:27, Shubham Barai wrote:
> > Hello everyone,
> >
> > I am currently pursuing my bachelor of engineering in computer science
> > at Maharashtra
> > Institute of Technology, Pune ,India. I am very excited about
> contributing
> > to postgres through google summer of code program.
> >
> > Is postgres   applying for gsoc 2016 as mentoring organization ?
>
> I think it does.  Track this page for updates:
> http://www.postgresql.org/developer/summerofcode/
>
> You can contact one of the people listed on that page for the latest.
>
> I didn't find for 2016 but here is the PostgreSQL wiki page for the last
> year's GSoC page: https://wiki.postgresql.org/wiki/GSoC_2015#Project_Ideas


I've created wiki page for GSoC 2016. It contains unimplemented ideas from
2015 page.
Now, GSoC accepting proposals from organizations. Typically, we have call
for mentors in hackers mailing list in this period.
Thom, do we apply this year?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-02-18 Thread Daniel Verite
Peter Eisentraut wrote:

> On 2/9/16 11:21 AM, Daniel Verite wrote:
> > Note that NULL values in the column that pivots are discarded
> > by \crosstabview, because NULL as the name of a column does not
> > make sense.
> 
> Why not?
> 
> All you're doing is printing it out, and psql is quite capable of
> printing a null value.

Initially it's by analogy with the crosstab SRF, but it's true
that the same principle does not have to apply to crosstabview.

The code could set in the header whatever text "pset null" is set to,
at the place where a pivoted NULL would be supposed to go
if it was not filtered out in the first place.

I'll consider implementing that change if there's no objection.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Relaxing SSL key permission checks

2016-02-18 Thread Christoph Berg
Hi,

Currently the server insists on ssl_key_file's permissions to be 0600
or less, and be owned by the database user. Debian has been patching
be-secure.c since forever (the git history goes back to 8.2beta1) to
relax that to 0640 or less, and owned by root or the database user.

The reason for that is that we hooked the SSL certificate handling
into the system's /etc/ssl/ directory structure where private keys are
supposed to have permissions 0640 root:ssl-cert. The postgres user is
member of the ssl-cert group so it can read the key.

In the old days before 9.2 the server expected the SSL files in
PGDATA, and we created symlinks from there to /etc/ssl/. Since 9.2,
these certs are used in the ssl_*_file options.

Using symlinks in PGDATA to use system-wide certificates might have
been a hack, but with the "new" ssl_*_file options I think it might be
prudent to get the "allow group ssl-cert" patch upstreamed.

Comments? (There's no documentation yet, I'll add that if the feedback
is positive.)

Thanks,
Christoph
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
new file mode 100644
index 1e3dfb6..b42737f
*** a/src/backend/libpq/be-secure-openssl.c
--- b/src/backend/libpq/be-secure-openssl.c
*** be_tls_init(void)
*** 215,226 
  		 * directory permission check in postmaster.c)
  		 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 		if (!S_ISREG(buf.st_mode) || buf.st_mode & (S_IRWXG | S_IRWXO))
  			ereport(FATAL,
  	(errcode(ERRCODE_CONFIG_FILE_ERROR),
    errmsg("private key file \"%s\" has group or world access",
  		 ssl_key_file),
!    errdetail("Permissions should be u=rw (0600) or less.")));
  #endif
  
  		if (SSL_CTX_use_PrivateKey_file(SSL_context,
--- 215,229 
  		 * directory permission check in postmaster.c)
  		 */
  #if !defined(WIN32) && !defined(__CYGWIN__)
! 		if (!S_ISREG(buf.st_mode) || (buf.st_mode & (S_IWGRP | S_IRWXO)) ||
! 			((buf.st_uid != geteuid()) && buf.st_uid != 0))
  			ereport(FATAL,
  	(errcode(ERRCODE_CONFIG_FILE_ERROR),
    errmsg("private key file \"%s\" has group or world access",
  		 ssl_key_file),
!    errdetail("File must be owned by the \
! database user or root, must have no write permission for \"group\", and must \
! have no permissions for \"other\".")));
  #endif
  
  		if (SSL_CTX_use_PrivateKey_file(SSL_context,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-02-18 Thread Daniel Verite
Daniel Verite wrote:

> > >   ORDER BY name
> > > \crosstabview cols = (select to_char(d, 'Mon') from
> > > generate_series('2000-01-01'::date, '2000-12-01', '1 month') d)
> > 
> > My concern with that is that often you don't know what the columns will 
> > be, because you don't know what exact data the query will produce. So to 
> > use this syntax you'd have to re-create a huge chunk of the original 
> > query. :(
> 
> Also, if that additional query refers to tables, it should be executed
> with the same data visibility as the main query. Doesn't that mean
> that both queries should happen within the same repeatable
> read transaction?
> 
> Another  impractical aspect of this approach is that a
> meta-command invocation in psql must fit on a single line, so
> queries containing newlines are not acceptable as argument.
> This problem exists with "\copy (select...) to ..."  already.

Thinking more about that, it occurs to me that if the sort must come
from a user-supplied bit of SQL, it would be simpler to just direct the
user to submit it in the main query, in an additional dedicated column.

For instance, to get a specific, separate order on "h",
let the user change this:

  SELECT v, h, c FROM v_data ORDER BY v;

into that:

  SELECT v, h, row_number() over(order by h) as hn, c
   FROM v_data ORDER BY v;

then with a relatively simple modification to the patch,
this invocation:

 \crosstabview v h:hn c

would display "h" in the horizontal header ordered by "hn".

ISTM this handles two objections raised upthread:

1. The ORDER BY inside OVER() can be augmented with additional
clauses such as lc_collate, desc, nulls last, etc... contrary to
the controversed "+/-" syntax.

2. a post-sort "backdoor" query is no longer necessary.

The drawback for me is that this change doesn't play out with
my original scenario for the command, which is to give the ability to
scrutinize query results in crosstab mode, playing with variations on
what column is pivoted and how headers for both directions get sorted,
while ideally not changing _at all_ the original query in the query
buffer, but just invoking  successive \crosstabview [args] commands
with varying arguments.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a raft of parallelism-related bug fixes

2016-02-18 Thread Michael Paquier
On Thu, Feb 18, 2016 at 9:45 PM, Craig Ringer  wrote:
> That lets you  make assertions about replication behaviour. It was built for
> BDR and I think we'll need something along those lines in core if/when any
> kind of logical replication facilities land, for things like testing
> failover slots, etc.
>
> The patch is at:
>
> http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=commit;h=d859de3b13d39d4eddd91f3e6f316a48d31ee0fe
>
> and might be something it's worth having in core as we expand testing of
> replication, failover, etc.

Maybe there is an advantage to have it, but that's hard to make an
opinion without a complicated test case. Both of those things could
clearly work with each other at first sight. PostgresNode can set up a
set of nodes and this patch would be in charge of more complex
scenarios where the same connection or transaction block is needed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a raft of parallelism-related bug fixes

2016-02-18 Thread Craig Ringer
On 18 February 2016 at 20:35, Michael Paquier 
wrote:

> On Thu, Feb 18, 2016 at 5:35 PM, Amit Langote
>  wrote:
> > On 2016/02/18 16:38, Craig Ringer wrote:
> >> I should resurrect Abhijit's patch to allow the isolationtester to talk
> to
> >> multiple servers. We'll want that when we're doing tests like "assert
> that
> >> this change isn't visible on the replica before it becomes visible on
> the
> >> master". (Well, except we violate that one with our funky
> >> synchronous_commit implementation...)
> >
> > How much does (or does not) that overlap with the recovery test suite
> work
> > undertaken by Michael et al? I saw some talk of testing for patches in
> > works on the N synchronous standbys thread.
>
> This sounds like poll_query_until in PostgresNode.pm (already on HEAD)
> where the query used is something on pg_stat_replication for a given
> LSN to see if a standby has reached a given replay position.
>

No, it's quite different, though that's something handy to have that I've
emulated in the isolationtester using a plpgsql function.

The isolationtester changes in question allow isolationtester specs to run
different blocks against different hosts/ports/DBs.

That lets you  make assertions about replication behaviour. It was built
for BDR and I think we'll need something along those lines in core if/when
any kind of logical replication facilities land, for things like testing
failover slots, etc.

The patch is at:

http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=commit;h=d859de3b13d39d4eddd91f3e6f316a48d31ee0fe

and might be something it's worth having in core as we expand testing of
replication, failover, etc.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] a raft of parallelism-related bug fixes

2016-02-18 Thread Michael Paquier
On Thu, Feb 18, 2016 at 5:35 PM, Amit Langote
 wrote:
> On 2016/02/18 16:38, Craig Ringer wrote:
>> I should resurrect Abhijit's patch to allow the isolationtester to talk to
>> multiple servers. We'll want that when we're doing tests like "assert that
>> this change isn't visible on the replica before it becomes visible on the
>> master". (Well, except we violate that one with our funky
>> synchronous_commit implementation...)
>
> How much does (or does not) that overlap with the recovery test suite work
> undertaken by Michael et al? I saw some talk of testing for patches in
> works on the N synchronous standbys thread.

This sounds like poll_query_until in PostgresNode.pm (already on HEAD)
where the query used is something on pg_stat_replication for a given
LSN to see if a standby has reached a given replay position.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-18 Thread Kharage, Suraj
Hello,

>Remaining tasks are;
>- Document patch.
>- Regression test patch.
>- Syntax error message for s_s_names improvement.

Please find patch attached for regression test for multisync replication.
I have created this patch over Michael's recovery-test-suite patch.
Please review it.

Regards
Suraj kharage

__
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

regression_test_for_multisync_rep_v2.patch
Description: regression_test_for_multisync_rep_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Convert pltcl from strings to objects

2016-02-18 Thread Victor Wagner
On Tue, 9 Feb 2016 16:23:21 -0600
Jim Nasby  wrote:

> Currently, pl/tcl is implemented through nothing but string 
> manipulation. In other words, the C code is effectively creating a
> giant string that the tcl interpreter must re-parse every time the
> function is executed. Additionally, all arguments are treated as raw
> strings, instead of the far more efficient internal tcl object types.
> 
> The biggest win comes with how pltcl interfaces with SPI result sets. 
> Currently, the entire chunk of tcl code that is executed for each
> result row must be reparsed and recompiled from scratch. Now, the
> code is compiled once and the bytecode is stored.
> 
> This work is sponsored by Flight Aware (http://flightaware.com/).

I've looked to your patch. As far as I know Tcl Object API it looks
quite good.

Of course, it breaks compatibility with pre-8.0 versions of Tcl, but it
is almost twenty years since Tcl 8.0 come out.

I think that checking for pre-8.0 Tcl and defining Tcl_StringResult in
this case, should be removed from the code, because there is no object
API in pre-8.0 Tcl anyway, and it doesn't worth effort to maintain
compatibility. (line 51 of pltcl.c). 

There is suspicious  place at the end of compile_pltcl_fuction function,
where you've put comment that old prodesc cannot be deallocated,
because it might be used by other call.

It seems that reference counting mechanism which Tcl already has, can
help here. Would there be serious performance impact if you'll use Tcl
list instead of C structure to store procedure description? 
If not, you can rely on Tcl_IncrRefCount/Tcl_DecrRefCount to free a
procedure description when last active call finishes.

Function pltcl_elog have a big switch case to convert enum logpriority,
local to this function to PostgreSql log levels.

It seems not a most readable solution, because this enumeration is
desined to be passed to Tcl_GetIndexFromObj, so it can be used and is
used to index an array (logpriorities array of string representation).
You can define simular array with PostgreSQL integer constant and
replace page-sized switch with just two lines - this array definition
and getting value from it by index

static CONST84 int
loglevels[]={DEBUG,LOG,INFO,NOTICE,WARNING,ERROR,FATAL};


Tcl_GetIndexFromObj(

level=loglevels[priIndex];


It seems that you are losing some diagnostic information by
extracting just message field from ErrorData, which you do in 
pltcl_elog and pltcl_subtrans_abort.

Tcl has  mechanisms for passing around additional error information.
See Tcl_SetErrorCode and Tcl_AddErrorInfo functions

pltcl_process_SPI_result might benefit from providing SPI result code
in machine readable way via Tcl_SetErrorCode too.


In some cases you use Tcl_SetObjResult(interp, Tcl_NewStringObj("error
message",-1)) to report error with constant message, where in other
places Tcl_SetResult(interp,"error message",TCL_STATIC) left in place.

I see no harm in using old API with static messages, especially if
Tcl_AppendResult is used, but mixing two patterns above seems to be a
bit inconsistent.

As far as I can understand, using Tcl_StringObj to represent all
postgresql primitive (non-array) types and making no attempt to convert
tuple data into integer, boolean or double objects is design decision.

It really can spare few unnecessary type conversions. 

Thanks for your effort.

-- 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-18 Thread Ashutosh Bapat
On Thu, Feb 18, 2016 at 3:48 PM, Etsuro Fujita 
wrote:

> On 2016/02/16 16:40, Etsuro Fujita wrote:
>
>> On 2016/02/16 16:02, Ashutosh Bapat wrote:
>>
>>> On Tue, Feb 16, 2016 at 12:26 PM, Etsuro Fujita
>>> mailto:fujita.ets...@lab.ntt.co.jp>>
>>> wrote:
>>>
>>
> On 2016/02/16 15:22, Ashutosh Bapat wrote:
>>>
>>
> During join planning, the planner tries multiple combinations of
>>> joining
>>> relations, thus the same base or join relation can be part of
>>> multiple
>>> of combination. Hence remote_conds or joinclauses will get linked
>>> multiple times as they are bidirectional lists, thus breaking
>>> linkages
>>> of previous join combinations tried. E.g. while planning A join
>>> B join C
>>> join D planner will come up with combinations like A(B(CD)) or
>>> (AB)(CD)
>>> or ((AB)C)D etc. and remote_conds from A will first be linked
>>> into
>>> A(B(CD)), then AB breaking the first linkages.
>>>
>>
> Exactly, but I don't think that that needs to be considered because
>>> we have this at the beginning of postgresGetGForeignJoinPaths:
>>>
>>>  /*
>>>   * Skip if this join combination has been considered already.
>>>   */
>>>  if (joinrel->fdw_private)
>>>  return;
>>>
>>
> There will be different joinrels for A(B(CD)) and (AB) where A's
>>> remote_conds need to be pulled up.
>>>
>>
> Agreed.
>>
>
> The check you have mentioned above
>>> only protects us from adding paths multiple times to (AB) when we
>>> encounter it for (AB)(CD) and ((AB)C)D.
>>>
>>
> Sorry, I don't understand this fully.
>>
>
> Another thing I don't really understand is why list_copy is needed in the
> second list_concat for the case of INNER/FULL JOIN or in both list_concats
> for the case of LEFT/RIGHT JOIN, in your patch.  Since list_concat is
> nondestructive of its second argument, I don't think list_copy is needed in
> any such list_concat.  Maybe I'm missing something, though.
>

If the list in the joining relation changes (may be because we appended
parameterized conditions), we would be breaking links on all the upper
relations in the join tree in an unpredictable manner. The problem may not
show up now, but it's an avenue for unrecognizable bugs. So, it's safer to
copy the lists in the state that we want them.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
> 
> When reading the patch 0002 explanations, I had the following comments:
> 
> AFAICS, there are several levels of actions when writing things in pg:
> 
>  0: the thing is written in some internal buffer
> 
>  1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.


>  2: the buffer is actually passed to the OS (write, flush)
> 
>  3: the OS is advised to send the written data to the io subsystem
> (sync_file_range with SYNC_FILE_RANGE_WRITE)
> 
>  4: the OS is required to send the written data to the disk
> (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.


0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits.  The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns.  On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.


> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.

Right.


> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
> 
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>   potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>   flushing every wal_writer_delay ms or wal_writer_flush_after
>   bytes.

I've pushed these after some more polishing, now working on the next
two.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-18 Thread Etsuro Fujita

On 2016/02/16 16:40, Etsuro Fujita wrote:

On 2016/02/16 16:02, Ashutosh Bapat wrote:

On Tue, Feb 16, 2016 at 12:26 PM, Etsuro Fujita
mailto:fujita.ets...@lab.ntt.co.jp>> wrote:



On 2016/02/16 15:22, Ashutosh Bapat wrote:



During join planning, the planner tries multiple combinations of
joining
relations, thus the same base or join relation can be part of
multiple
of combination. Hence remote_conds or joinclauses will get linked
multiple times as they are bidirectional lists, thus breaking
linkages
of previous join combinations tried. E.g. while planning A join
B join C
join D planner will come up with combinations like A(B(CD)) or
(AB)(CD)
or ((AB)C)D etc. and remote_conds from A will first be linked
into
A(B(CD)), then AB breaking the first linkages.



Exactly, but I don't think that that needs to be considered because
we have this at the beginning of postgresGetGForeignJoinPaths:

 /*
  * Skip if this join combination has been considered already.
  */
 if (joinrel->fdw_private)
 return;



There will be different joinrels for A(B(CD)) and (AB) where A's
remote_conds need to be pulled up.



Agreed.



The check you have mentioned above
only protects us from adding paths multiple times to (AB) when we
encounter it for (AB)(CD) and ((AB)C)D.



Sorry, I don't understand this fully.


Another thing I don't really understand is why list_copy is needed in 
the second list_concat for the case of INNER/FULL JOIN or in both 
list_concats for the case of LEFT/RIGHT JOIN, in your patch.  Since 
list_concat is nondestructive of its second argument, I don't think 
list_copy is needed in any such list_concat.  Maybe I'm missing 
something, though.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-02-18 Thread Dean Rasheed
On 18 February 2016 at 02:00, Vitaly Burovoy  wrote:
>> + else
>> + have_digits = false;
> Does it worth to move it to the declaration and remove that "else" block?
> +   boolhave_digits = false;

OK, that's probably a bit neater.


> Is it important to use:
>> +  ObjectIdGetDatum(InvalidOid),
>> +  Int32GetDatum(-1)));
> instead of "0, -1"? Previously I left immediate constants because I
> found the same thing in jsonb.c (rows 363, 774)...

I think it's preferable to use the macros because they make it clearer
what datatypes are being passed around. I think that's the more common
style, but the JSONB code is not technically wrong either.


>> +  if (pg_strcasecmp(strptr, "bytes") == 0)
>> +  else if
>> +  else if
>> +  else if
> It seems little ugly for me. In that case (not using values from GUC)
> I'd create a static array for it and do a small cycle via it. Would it
> better?

That's a matter of personal preference. I prefer the values inline
because then the values are closer to where they're being used, which
for me makes it easier to follow (see e.g., convert_priv_string()).

In guc.c they're in lookup tables because they're referred to from
multiple places, and because it needs to switch between tables based
on context, neither of which is true here. If there is a need to
re-use these values elsewhere in the future it can be refactored, but
right now that feels like an over-engineered solution for this
problem.


>> + NumericGetDatum(mul_num),
> Two more space for indentation.

pgindent pulled that back.


> Also I've found that with allowing exponent there is a weird result
> (we tried to avoid "type numeric" in the error message):
> SELECT 
> pg_size_bytes('123e1000
> kB');
> ERROR:  invalid input syntax for type numeric:
> "123e1000"

OK, I'll add a check for that.
Thanks for testing.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-18 Thread Etsuro Fujita

On 2016/02/12 21:19, Etsuro Fujita wrote:

Comments on specific points follow.

This seems to need minor rebasing in the wake of the join pushdown patch.



Will do.


Done.


+   /* Likewise for ForeignScan that has pushed down
INSERT/UPDATE/DELETE */
+   if (IsA(plan, ForeignScan) &&
+   (((ForeignScan *) plan)->operation == CMD_INSERT ||
+((ForeignScan *) plan)->operation == CMD_UPDATE ||
+((ForeignScan *) plan)->operation == CMD_DELETE))
+   return;

I don't really understand why this is a good idea.  Since target lists
are only displayed with EXPLAIN (VERBOSE), I don't really understand
what is to be gained by suppressing them in any case at all, and
therefore I don't understand why it's a good idea to do it here,
either.  If there is a good reason, maybe the comment should explain
what it is.



I think that displaying target lists would be confusing for users.


There seems no objection from you (or anyone), I left that as proposed, 
and added more comments.



+   /* Check point 1 */
+   if (operation == CMD_INSERT)
+   return false;
+
+   /* Check point 2 */
+   if (nodeTag(subplan) != T_ForeignScan)
+   return false;
+
+   /* Check point 3 */
+   if (subplan->qual != NIL)
+   return false;
+
+   /* Check point 4 */
+   if (operation == CMD_UPDATE)

These comments are referring to something in the function header
further up, but you could instead just delete the stuff from the
header and mention the actual conditions here.



Will fix.


Done.

The patch doesn't allow the postgres_fdw to push down an UPDATE/DELETE 
on a foreign join, so I added one more condition here not to handle such 
cases.  (I'm planning to propose a patch to handle such cases, in the 
next CF.)



- If the first condition is supposed accept only CMD_UPDATE or
CMD_DELETE, say if (operation != CMD_UPDATE || operation !=
CMD_DELETE) rather than doing it this way.  Is-not-insert may in this
context be functionally equivalent to is-update-or-delete, but it's
better to write the comment and the code so that they exactly match
rather than kinda-match.

- For point 2, use IsA(subplan, ForiegnScan).



Will fix.


Done.


+   /*
+* ignore subplan if the FDW pushes down the
command to the remote
+* server
+*/

This comment states what the code does, instead of explaining why it
does it.  Please update it so that it explains why it does it.



Will update.


Done.


+   List   *fdwPushdowns;   /* per-target-table FDW
pushdown flags */

Isn't a list of booleans an awfully inefficient representation?  How
about a Bitmapset?



OK, will do.


Done.


+   /*
+* Prepare remote-parameter expressions for evaluation.
(Note: in
+* practice, we expect that all these expressions will be just
Params, so
+* we could possibly do something more efficient than using
the full
+* expression-eval machinery for this.  But probably there
would be little
+* benefit, and it'd require postgres_fdw to know more than is
desirable
+* about Param evaluation.)
+*/
+   dpstate->param_exprs = (List *)
+   ExecInitExpr((Expr *) fsplan->fdw_exprs,
+(PlanState *) node);

This is an exact copy of an existing piece of code and its associated
comment.  A good bit of the surrounding code is copied, too.  You need
to refactor to avoid duplication, like by putting some of the code in
a new function that both postgresBeginForeignScan and
postgresBeginForeignModify can call.

execute_dml_stmt() has some of the same disease.



Will do.


Done.

Other changes:

* I fixed docs as discussed before with Rushabh Lathia and Thom Brown.

* I keep Rushabh's code change that we call PlanDMLPushdown only when 
all the required APIs are available with FDW, but for 
CheckValidResultRel, I left the code as-is (no changes to that 
function), to match the docs saying that the FDW needs to provide the 
DML pushdown callback functions together with existing table-updating 
functions such as ExecForeignInsert, ExecForeignUpdate and 
ExecForeignDelete.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 1316,1321  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
--- 1316,1384 
  }
  
  /*
+  * deparse remote UPDATE statement
+  *
+  * The statement text is appended to buf, and we also create an integer List
+  * of the columns being retrieved by RETURNING (if any), which is returned
+  * to *retrieved_attrs.
+  */
+ void
+ deparsePushedDownUpdateSql(StringInfo buf, PlannerInfo *root,
+ 		   Index rtindex, Relation rel,
+ 		   List	*targetlist,
+ 		   List *targetAttrs,
+ 		   List	*remote_conds,
+ 		   List **params_list,
+ 		   List *returning

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-02-18 Thread David Rowley
On 23/01/2016 12:42 am, "David Rowley"  wrote:
>
> On 23 January 2016 at 05:36, Tomas Vondra 
wrote:
> > Otherwise I think the patch is ready for committer - I think this is a
> > valuable optimization and I see nothing wrong with the code.
> >
> >
> > Any objections to marking it accordingly?

I've just set this patch back to ready for committer. It was previous
marked as so but lost that status when it was moved to the march fest.


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Fabien COELHO


Hello Andres,


0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
 potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
 flushing every wal_writer_delay ms or wal_writer_flush_after
 bytes.


I've looked at these patches, especially the whole bench of explanations 
and comments which is a good source for understanding what is going on in 
the WAL writer, a part of pg I'm not familiar with.


When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:

 0: the thing is written in some internal buffer

 1: the buffer is advised to be passed to the OS (hint bits?)

 2: the buffer is actually passed to the OS (write, flush)

 3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)

 4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

It is not clear when reading the text which level is discussed. In 
particular, I'm not sure that "flush" refers to level 2, which is 
misleading. When reading the description, I'm rather under the impression 
that it is about level 4, but then if actual fsync are performed every 200 
ms then the tps would be very low...


After more considerations, my final understanding is that this behavior 
only occurs with "asynchronous commit", aka a situation when COMMIT does 
not wait for data to be really fsynced, but the fsync is to occur within 
some delay so it will not be too far away, some kind of compromise for 
performance where commits can be lost.


Now all this is somehow alien to me because the whole point of committing 
is having the data to disk, and I would not consider a database to be safe 
if commit does not imply fsync, but I understand that people may have to 
compromise for performance.


Is my understanding right?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-02-18 Thread Simon Riggs
On 18 February 2016 at 08:33, Andres Freund  wrote:

> Hi,
>
> On 2016-02-17 21:47:50 -0500, Peter Eisentraut wrote:
> > It would be nice if pg_ctl promote supported the -w (wait) option.
> >
> > How could pg_ctl determine when the promotion has finished?
>
> How about looking into pg_control? ControlFileData->state ought to have
> the correct information.
>

+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] a raft of parallelism-related bug fixes

2016-02-18 Thread Amit Langote
On 2016/02/18 16:38, Craig Ringer wrote:
> I should resurrect Abhijit's patch to allow the isolationtester to talk to
> multiple servers. We'll want that when we're doing tests like "assert that
> this change isn't visible on the replica before it becomes visible on the
> master". (Well, except we violate that one with our funky
> synchronous_commit implementation...)

How much does (or does not) that overlap with the recovery test suite work
undertaken by Michael et al? I saw some talk of testing for patches in
works on the N synchronous standbys thread.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-02-18 Thread Andres Freund
Hi,

On 2016-02-17 21:47:50 -0500, Peter Eisentraut wrote:
> It would be nice if pg_ctl promote supported the -w (wait) option.
> 
> How could pg_ctl determine when the promotion has finished?

How about looking into pg_control? ControlFileData->state ought to have
the correct information.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-18 Thread Andres Freund
On 2016-02-17 21:19:08 -0500, Peter Eisentraut wrote:
> On 2/17/16 9:08 PM, Michael Paquier wrote:
> > On Thu, Feb 18, 2016 at 11:02 AM, Peter Eisentraut  wrote:
> >> On 2/17/16 5:20 PM, Josh berkus wrote:
> >>> I have a use-case for this feature, at part of it containerized
> >>> PostgreSQL. Right now, there is certain diagnostic information (like
> >>> timeline) which is exposed ONLY in pg_controldata.
> >>
> >> I'm talking about the pg_config() function, not pg_controldata.
> > 
> > Andrew has mentioned a use case he had at the beginning of this thread
> > to enhance a bit the regression tests related to libxml.
> 
> While that idea was useful, I think we had concluded that there are
> better ways to do this and that this way probably wouldn't even work
> (Windows?).

I don't understand why you're so opposed to this. Several people said
that they're interested in this information in the current discussion
and it has been requested repeatedly over the years. For superusers you
can already hack access, but it's darn ugly.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-02-18 Thread Simon Riggs
On 18 February 2016 at 02:47, Peter Eisentraut  wrote:

> It would be nice if pg_ctl promote supported the -w (wait) option.
>
> How could pg_ctl determine when the promotion has finished?
>
> We could query pg_is_in_recovery(), but that would require database
> access, which we got rid of in pg_ctl.
>
> We could check when recovery.conf is gone or recovery.done appears.
> This could work, but I think some people are trying to get rid of these
> files, so building a feature on that might be a problem?  (I think the
> latest news on that was that the files might be different in this
> hypothetical future but there would still be a file to prevent
> re-recovery on restart.)
>
> Other ideas?
>

We use a trigger file under the covers, so a promotion complete file seems
useful also, but crappy.

Perhaps we should have a Server Status file that shows Standby or Master,
obviously not updated on crash.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services