Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Antonin Houska
Masahiko Sawada  wrote:

> On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
> > Masahiko Sawada  wrote:
> >
> >> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
> >>  wrote:
> >> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
> >> > wrote:
> >> >>
> >> >> Because I don't want to break the current user semantics. that is,
> >> >> currently it's guaranteed that the subsequent reads can see the
> >> >> committed result of previous writes even if the previous transactions
> >> >> were distributed transactions. And it's ensured by writer side. If we
> >> >> can make the reader side ensure it, the backend process don't need to
> >> >> wait for the resolver process.
> >> >>
> >> >> The waiting backend process are released by resolver process after the
> >> >> resolver process tried to resolve foreign transactions. Even if
> >> >> resolver process failed to either connect to foreign server or to
> >> >> resolve foreign transaction the backend process will be released and
> >> >> the foreign transactions are leaved as dangling transaction in that
> >> >> case, which are processed later. Also if resolver process takes a long
> >> >> time to resolve foreign transactions for whatever reason the user can
> >> >> cancel it by Ctl-c anytime.
> >> >>
> >> >
> >> > So, there's no guarantee that the next command issued from the
> >> > connection *will* see the committed data, since the foreign
> >> > transaction might not have committed because of a network glitch
> >> > (say). If we go this route of making backends wait for resolver to
> >> > resolve the foreign transaction, we will have add complexity to make
> >> > sure that the waiting backends are woken up in problematic events like
> >> > crash of the resolver process OR if the resolver process hangs in a
> >> > connection to a foreign server etc. I am not sure that the complexity
> >> > is worth the half-guarantee.
> >> >
> >>
> >> Hmm, maybe I was wrong. I now think that the waiting backends can be
> >> woken up only in following cases;
> >> - The resolver process succeeded to resolve all foreign transactions.
> >> - The user did the cancel (e.g. ctl-c).
> >> - The resolver process failed to resolve foreign transaction for a
> >> reason of there is no such prepared transaction on foreign server.
> >>
> >> In other cases the resolver process should not release the waiters.
> >
> > I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
> > effort is needed to ensure that backend does not keep waiting if the 
> > resolver
> > can't finish it's work in forseable future. But this effort is not worth
> > because by waking the backend up you might prevent the next transaction from
> > seeing the changes the previous one tried to make."
> >
> > On the other hand, your last comments indicate that you try to be even more
> > stringent in letting the backend wait. However even this stringent approach
> > does not guarantee that the next transaction will see the data changes made 
> > by
> > the previous one.
> >
> 
> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user.

I missed the point that user should not expect atomicity of the commit to be
guaranteed if he has cancelled his request.

The other things are clear to me, including the fact that atomic commit and
atomic visibility will be implemented separately. Thanks.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: ERROR: too many dynamic shared memory segments

2017-11-27 Thread Dilip Kumar
On Tue, Nov 28, 2017 at 4:18 AM, Thomas Munro  wrote:

> On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa 
> wrote:
> > As for the crash. I dug up the initial log and it looks like a
> segmentation
> > fault...
> >
> > 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR:
> too
> > many dynamic shared memory segments
>
> I think there are two failure modes: one of your sessions showed the
> "too many ..." error (that's good, ran out of slots and said so and
> our error machinery worked as it should), and another crashed with a
> segfault, because it tried to use a NULL "area" pointer (bad).  I
> think this is a degenerate case where we completely failed to launch
> parallel query, but we ran the parallel query plan anyway and this
> code thinks that the DSA is available.  Oops.
>

 I think BitmapHeapScan check whether dsa is valid or not if DSA is not
valid then it should assume it's non-parallel plan.

Attached patch should fix the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bug_fix_in_pbhs_when_dsa_not_initialized.patch
Description: Binary data


Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread Amit Kapila
On Mon, Nov 27, 2017 at 11:26 PM, hubert depesz lubaczewski
 wrote:
> On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:
>> I think it is "actual_time * 1" for anything below Gather.
>
> Well, I think I found another problem.
>
> Please take a look at:
> https://explain.depesz.com/s/Bs8c
>
> Node 15 is Gather with loops = 2974 (because it was ran by nested loop).
>
> There were 4 workers, so 5 threads working, but the loops on parallel
> seq scan is 17.698 million ?!
>

The number of loops displayed on parallel seq scan seems to be wrong, see below.

> The problem is that for explain.depesz.com I'm calculating how much time
> pg actually spent doing given thing.
>
> So, if an operation has actual time of 1ms, but 100 loops, it took
> 100ms.
>
> In case of parallel operations it looks like I can't realistically find
> the information anywhere?
>
> In the explain above, we see that Nested loop took, in total, around
> 2 million miliseconds (step 9).
> Out of which, ~ 7000 ms was hash join and it's subnodes.
> This leaves most of the time for Gather step, which was called 2974
> times with actual time reported as 618ms.
> 2974 * 618 is 1837932 miliseconds, which seems reasonable.
>
> But then - how much of this time was spent in Parallel Seq Scan in step
> #16 ?
>
> 2974 * 609ms?
>

Yeah.

>Why is loops there 17 million?
>

That is wrong and I think you have hit a bug.  It should be 2974 * 5 =
14870 as you have seen in other cases.  The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans.  Attached patch should fix the problem you
are seeing.  I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_accum_instr_parallel_workers_v1.patch
Description: Binary data


Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-27 Thread Rajkumar Raghuwanshi
On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke
 wrote:
> Let me know if I missed any comment to be fixed.

Hi,

I have applied v8 patches on commit id 8735978e7aebfbc499843630131c18d1f7346c79,
and getting below observation, please take a look.

Observation:
"when joining a foreign partition table with local partition table
getting wrong output
with partition_wise_join enabled, same is working fine on PG-head
without aggregates patch."

Test-case:
CREATE EXTENSION postgres_fdw;
CREATE SERVER pwj_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(dbname 'postgres',port '5432',use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER pwj_server;

CREATE TABLE fplt1 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int,  c text);
CREATE TABLE fplt1_p2 (a int,  c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN
('', '0001', '0002', '0003') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p1');
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN
('0004', '0005', '0006', '0007') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p2');
INSERT INTO fplt1_p1 SELECT i, to_char(i%8, 'FM') FROM
generate_series(0, 199, 2) i;
INSERT INTO fplt1_p2 SELECT i, to_char(i%8, 'FM') FROM
generate_series(200, 398, 2) i;

CREATE TABLE lplt2 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE lplt2_p1 PARTITION OF lplt2 FOR VALUES IN ('',
'0001', '0002', '0003');
CREATE TABLE lplt2_p2 PARTITION OF lplt2 FOR VALUES IN ('0004',
'0005', '0006', '0007');
INSERT INTO lplt2 SELECT i, to_char(i%8, 'FM') FROM
generate_series(0, 398, 3) i;

SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c;
  c   |  c   | count
--+--+---
  |  | 1
 0004 | 0004 | 1
 0006 | 0006 | 1
(3 rows)

SET enable_partition_wise_join = on;
SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c;
  c   |  c   | count
--+--+---
  |  | 1
 0004 | 0004 | 1
(2 rows)


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation



Re: [HACKERS] new function for tsquery creartion

2017-11-27 Thread Victor Drobny

On 2017-11-19 04:30, Tomas Vondra wrote:
Hello,

Hi,

On 09/13/2017 10:57 AM, Victor Drobny wrote:

On 2017-09-09 06:03, Thomas Munro wrote:

Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.

Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)



I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.


Thank you for your time. In the attachment you can find rebased version.
(based on e842791b0 commit)


I've only looked on the diff at this point, will do more testing once
the rebase happens.

Some comments:

1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and ). I propose to split the patch into two or
more parts, each addressing one of those bits.


I agree. I have split it in 3 parts: support for around operator,
queryto_tsquery function and documentation.


I guess there will be two or three parts - first adding the syntax,
second adding  and third adding the AROUND(n). Seems reasonable?

2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not 
invented

by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...


Yes, those syntax is not introduced by google, but, as for me, it was 
the
easiest way to give a brief description of it. Of cause it can be 
changed,

I just don't know how. Any suggestions are welcomed! ;)


3) In the SGML docs, please use  instead of just
quoting the values. So it should be | instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.


Fixed. I hope that i didn't miss anything.


4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).

5) I'm not sure about negative values in the  operator. I don't
find it particularly confusing - once you understand that (a  b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.

But I guess the main question is "Is there really a demand for the new
 operator, or have we just invented if because we can?"


The operator  is not introduced yet. It's just a concept. It were 
our

thoughts about implementation AROUND operator through  in future.

6) There seem to be some new constants defined, with names not 
following

the naming convention. I mean this

#define WAITOPERAND 1
#define WAITOPERATOR2
#define WAITFIRSTOPERAND3
#define WAITSINGLEOPERAND   4
#define INSIDEQUOTES5   <-- the new one

and

#define TSPO_L_ONLY0x01
#define TSPO_R_ONLY0x02
#define TSPO_BOTH  0x04
#define TS_NOT_EXAC0x08 <-- the new one

Perhaps that's OK, but it seems a bit inconsistent.


I agree. I have fixed it.



regards


--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
 break;
 
 			case OP_PHRASE:
+			case OP_AROUND:
 			case OP_AND:
 s1 = tsquery_opr_selec(item + 1, operand,
 	   lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index aba456e..1c852bd 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,10 +239,11 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
 			return !result;
 
 		case OP_PHRASE:
+		case OP_AROUND:
 
 			/*
 			 * GIN doesn't contain any information about positions, so treat
-			 * OP_PHRASE as OP_AND with recheck requirement
+			 * OP_PHRASE and OP_AROUND as OP_AND with recheck requirement
 			 */
 			*(gcv->need_recheck) = true;
 			/* Pass down in_phrase == true in case there's a NOT below */
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 5cdfe4d..07d56b0 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,7 +29,8 @@ const int	tsearch_op_priority[OP_COUNT] =
 	4,			/* OP_NOT */
 	2,			/* OP_AND */
 	1,			/* OP_OR */
-	3			/* OP_PHRASE */
+	3,			/* OP_PHRASE */
+	3			/* OP_AROUND */
 };
 
 

Re: [HACKERS] [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

2017-11-27 Thread Nikolay Shaplov
В письме от 28 ноября 2017 11:23:38 пользователь Michael Paquier написал:
> On Tue, Nov 28, 2017 at 11:22 AM, Michael Paquier
> 
>  wrote:
> > On Tue, Nov 14, 2017 at 4:44 PM, Michael Paquier
> > 
> >  wrote:
> >> On Mon, Oct 30, 2017 at 9:49 AM, Nikolay Shaplov  
wrote:
> >>> В письме от 3 сентября 2017 11:45:43 пользователь Alvaro Herrera
> >>> написал:
> >>> 
> >>> Yet another git rebase. This patch can be applied to master branch
> >>> again.
> >>> 
> >>> For this patch no review needed now, as I will offer part of it (one
> >>> that
> >>> refuses to set toast reloptions when there is no TOAST table) as a
> >>> separate
> >>> patch.
> >> 
> >> This patch needs a rebase, there are conflicts with HEAD.
> > 
> > Bumping the patch to next CF as no updates have happened for the last two
> > weeks.
> Er, returned with feedback.. This will be my last patch for the day.
May be you are right. This patch should be split into smaller patches. I will 
do it bit by bit whet I have time. But I see no reason for the big one to 
remain on commitfest, as it is not for direct commit.

PS. I will do the rebase soon, this patch overlaps with two recent commits. So 
this require very careful rebasing


-- 
Do code for fun. Can do it for money (Perl & C/C++ ~10h/week)



Re: [HACKERS] PATCH: pgbench - option to build using ppoll() for larger connection counts

2017-11-27 Thread Fabien COELHO


Hello Michaël,


Could you rebase the v11 patch?


This patch has been waiting for a rebase for more than three weeks as
of today, I am marking it as returned with feedback. It would be a
good idea to reply to Robert's input in
https://www.postgresql.org/message-id/ca+tgmoyybnv-ddhvpmxlb2nx2sqenjirtwhmdeazucgotb2...@mail.gmail.com.


ISTM that this was done: If -c is high enough, pgbench fails without the 
patch.


--
Fabien.

Re: [HACKERS] Useless code in ExecInitModifyTable

2017-11-27 Thread Etsuro Fujita

(2017/11/28 11:18), Michael Paquier wrote:

On Tue, Sep 5, 2017 at 12:41 PM, Ryan Murphy  wrote:

The new status of this patch is: Waiting on Author


This status is misleading, so I switched it back to "needs review"
(please be careful about that!).


I think I forgot to change that status.  Sorry about that.


I can still apply the patch
correctly. Sorry I am not taking the time to have a meaningful opinion
about this patch. The patch passes all regression tests. As I am on
this entry, I am bumping the patch to next CF..


Ok, thanks!

Best regards,
Etsuro Fujita



Re: simplehash: tb->sizemask = 0

2017-11-27 Thread Tom Lane
Tomas Vondra  writes:
> I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:

> if (tb->size == SH_MAX_SIZE)
> tb->sizemask = 0;
> else
> tb->sizemask = tb->size - 1;

> Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
> (i.e. no bits set)?

Yeah, which is very obviously broken: for one thing, the Asserts
in SH_NEXT/SH_PREV would surely go off.

(Why are those assertions, anyway, and not test-and-elog?
I do not think an assertion failure is a suitable way to
report "hash table full".)

> I don't think we're building hash tables with 2^32 buckets, though.

What this proves is that nobody has ever tested the behavior at
SH_MAX_SIZE.  I would suggest building a test version with that
set small enough to be conveniently reachable, and then exercising
the behavior as the limit is approached and reached.

regards, tom lane



Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Ashutosh Bapat
On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada  wrote:
> On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
>> Masahiko Sawada  wrote:
>>
>>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
>>>  wrote:
>>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
>>> > wrote:
>>> >>
>>> >> Because I don't want to break the current user semantics. that is,
>>> >> currently it's guaranteed that the subsequent reads can see the
>>> >> committed result of previous writes even if the previous transactions
>>> >> were distributed transactions. And it's ensured by writer side. If we
>>> >> can make the reader side ensure it, the backend process don't need to
>>> >> wait for the resolver process.
>>> >>
>>> >> The waiting backend process are released by resolver process after the
>>> >> resolver process tried to resolve foreign transactions. Even if
>>> >> resolver process failed to either connect to foreign server or to
>>> >> resolve foreign transaction the backend process will be released and
>>> >> the foreign transactions are leaved as dangling transaction in that
>>> >> case, which are processed later. Also if resolver process takes a long
>>> >> time to resolve foreign transactions for whatever reason the user can
>>> >> cancel it by Ctl-c anytime.
>>> >>
>>> >
>>> > So, there's no guarantee that the next command issued from the
>>> > connection *will* see the committed data, since the foreign
>>> > transaction might not have committed because of a network glitch
>>> > (say). If we go this route of making backends wait for resolver to
>>> > resolve the foreign transaction, we will have add complexity to make
>>> > sure that the waiting backends are woken up in problematic events like
>>> > crash of the resolver process OR if the resolver process hangs in a
>>> > connection to a foreign server etc. I am not sure that the complexity
>>> > is worth the half-guarantee.
>>> >
>>>
>>> Hmm, maybe I was wrong. I now think that the waiting backends can be
>>> woken up only in following cases;
>>> - The resolver process succeeded to resolve all foreign transactions.
>>> - The user did the cancel (e.g. ctl-c).
>>> - The resolver process failed to resolve foreign transaction for a
>>> reason of there is no such prepared transaction on foreign server.
>>>
>>> In other cases the resolver process should not release the waiters.
>>
>> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
>> effort is needed to ensure that backend does not keep waiting if the resolver
>> can't finish it's work in forseable future. But this effort is not worth
>> because by waking the backend up you might prevent the next transaction from
>> seeing the changes the previous one tried to make."
>>
>> On the other hand, your last comments indicate that you try to be even more
>> stringent in letting the backend wait. However even this stringent approach
>> does not guarantee that the next transaction will see the data changes made 
>> by
>> the previous one.
>>
>
> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user. In
> other words, the backend should not be waken up and the resolver
> should continue to resolve at certain intervals even if the resolver
> fails to connect to the foreign server or fails to resolve it. This is
> similar to what synchronous replication guaranteed today. Keeping this
> semantics is very important for users. Note that the reading a
> consistent result by concurrent reads is a separated problem.

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

>
> The read result including foreign servers can be inconsistent if the
> such transaction is cancelled or the coordinator server crashes during
> two-phase commit processing. That is, if there is in-doubt transaction
> the read result can be inconsistent, even for subsequent reads. But I
> think this behaviour can be accepted by users. For the resolution of
> in-doubt transactions, the resolver process will try to resolve such
> transactions after the coordinator server recovered. On the other
> hand, for the reading a consistent result on such situation by
> subsequent reads, for example, we can disallow backends to inquiry SQL
> to the foreign server if a foreign transaction of the foreign server
> is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing 

Re: Commit fest 2017-11

2017-11-27 Thread Michael Paquier
On Wed, Nov 1, 2017 at 11:30 PM, Michael Paquier
 wrote:
> On Wed, Nov 1, 2017 at 9:04 AM, Michael Paquier
>  wrote:
>> Anybody willing to take the hat of the commit fest manager? If nobody,
>> I am fine to take the hat as default choice this time.
>
> And now it is open. Let's the fest begin.

The current commit fest is coming to an end, and as many may have
noticed, I have begun classifying patches depending on their status.
This will likely take a couple of days. As a last push, I would like
to point out that there are 22 patches marked as ready for committer:
https://commitfest.postgresql.org/15/?status=3.
-- 
Michael



Re: [HACKERS] [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

2017-11-27 Thread Michael Paquier
On Tue, Nov 28, 2017 at 11:22 AM, Michael Paquier
 wrote:
> On Tue, Nov 14, 2017 at 4:44 PM, Michael Paquier
>  wrote:
>> On Mon, Oct 30, 2017 at 9:49 AM, Nikolay Shaplov  wrote:
>>> В письме от 3 сентября 2017 11:45:43 пользователь Alvaro Herrera написал:
>>>
>>> Yet another git rebase. This patch can be applied to master branch again.
>>>
>>> For this patch no review needed now, as I will offer part of it (one that
>>> refuses to set toast reloptions when there is no TOAST table) as a separate
>>> patch.
>>
>> This patch needs a rebase, there are conflicts with HEAD.
>
> Bumping the patch to next CF as no updates have happened for the last two 
> weeks.

Er, returned with feedback.. This will be my last patch for the day.
-- 
Michael



Re: [HACKERS] GnuTLS support

2017-11-27 Thread Michael Paquier
On Mon, Nov 27, 2017 at 2:37 PM, Michael Paquier
 wrote:
> On Mon, Nov 27, 2017 at 10:28 AM, Andreas Karlsson  wrote:
>> Hm, after reading more of our MSVC code it seems like building with MSVC
>> does not really use switch, people rather have to create a config.pl. Is
>> that correct? The MSVC scripts also seems to only support uuid-ossp which it
>> just calls $config->{uuid}.
>>
>> If so we could either have:
>>
>> $config->{ssl} = "openssl";
>> $config->{sslpath} = "/path/to/openssl";
>
> Using this one may actually finish by being cleaner as there is no
> need to cross-check for both options defined.

Andreas, as I can see that you are still actively working on this
patch, I am bumping it to next CF.
-- 
Michael



Re: [HACKERS] Useless code in ExecInitModifyTable

2017-11-27 Thread Michael Paquier
On Tue, Sep 5, 2017 at 12:41 PM, Ryan Murphy  wrote:
> The new status of this patch is: Waiting on Author

This status is misleading, so I switched it back to "needs review"
(please be careful about that!). I can still apply the patch
correctly. Sorry I am not taking the time to have a meaningful opinion
about this patch. The patch passes all regression tests. As I am on
this entry, I am bumping the patch to next CF..
-- 
Michael



Re: [HACKERS] Patch: add --if-exists to pg_recvlogical

2017-11-27 Thread Michael Paquier
On Fri, Sep 29, 2017 at 10:06 PM, Peter Eisentraut
 wrote:
> On 9/22/17 15:31, Peter Eisentraut wrote:
>> I suggest you create a patch that focuses on the --create part.
>>
>> You can create a separate follow-on patch for the --start part if you
>> wish, but I think that that patch would be rejected.
>
> I have moved this entry to the next commit fest, awaiting your updated
> patch.

Still waiting for an update, and marked as returned with feedback. It
seems to me that your patch adds a couple of option interactions, so
this should be double-checked with exiting options and a set of TAP
tests should be added as well.
-- 
Michael



Re: [HACKERS] PATCH: pgbench - break out timing data for initialization phases

2017-11-27 Thread Michael Paquier
On Tue, Sep 26, 2017 at 3:07 AM, Fabien COELHO  wrote:
>
> Hello Doug,
>
>> total time: 316.03 s (insert 161.60 s, commit 0.64 s, vacuum 60.77 s,
>> index 93.01 s)
>
>
> Definitely interesting.
>
> There is a "ready for committers" patch in the CF which extensively rework
> the initialization: it becomes customizable, and this approach may not work
> as is after that...
>
> Maybe you could investigate how it may be implemented on top of that?
>
> Either show times when the phases are performed computed, or maybe use some
> auxiliary data structure to keep the information (hmmm...).

The patch needs as well a rebase, I am marking it as returned with
feedback per lack of activity. When sending a new version, please make
sure to:
- reply to the feedback which has been provided previously.
- register a new version of the patch in the commit fest app.
Thanks,
-- 
Michael



Re: [HACKERS] proposal: Support Unicode host variable in ECPG

2017-11-27 Thread Michael Paquier
On Tue, Nov 21, 2017 at 4:02 PM, Michael Paquier
 wrote:
> The patch does not apply anymore and needs a rebase. There are also a
> bunch of whitespace errors, no documentation in the patch, and it
> seems to me that this patch introduces many concepts so it could be
> broken down into many individual steps. Finally, I strongly recommend
> that you review other's patches. You have two patches, including this
> one, registered into this commit fest but your name is showing nowhere
> as a reviewer. The patch you are proposing here is large, the usual
> recommendation being to review one patch of equal difficulty for each
> patch sent to keep the inflow and outflow of patches balanced.

With a high-level look, this patch requires a bit more work.. I am
unsure as well if that's a feature which would prove to be useful, but
without documentation it is hard to make an opinion of it. I am
marking the patch as returned with feedback.
-- 
Michael



Re: [HACKERS] [PATCH] A hook for session start

2017-11-27 Thread Michael Paquier
On Tue, Nov 21, 2017 at 4:09 AM, Fabrízio de Royes Mello
 wrote:
> Due to some "Blackfriday" commitments I'll be able to work again with this
> patch on next week.

Okay, this has proved to require broader changes than thought first. I
am marking the patch as returned with feedback.
-- 
Michael



Re: [HACKERS] PATCH: pgbench - option to build using ppoll() for larger connection counts

2017-11-27 Thread Michael Paquier
On Fri, Nov 3, 2017 at 2:29 PM, Fabien COELHO  wrote:
> Could you rebase the v11 patch?

This patch has been waiting for a rebase for more than three weeks as
of today, I am marking it as returned with feedback. It would be a
good idea to reply to Robert's input in
https://www.postgresql.org/message-id/ca+tgmoyybnv-ddhvpmxlb2nx2sqenjirtwhmdeazucgotb2...@mail.gmail.com.
-- 
Michael



Re: [HACKERS] WAL logging problem in 9.4.3?

2017-11-27 Thread Michael Paquier
On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170913.174239.25978735.horiguchi.kyot...@lab.ntt.co.jp>
>> filterdiff seems to did something wrong..
>
> # to did...
>
> The patch is broken by filterdiff so I send a new patch made
> directly by git format-patch. I confirmed that a build completes
> with applying this.

To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
-- 
Michael



Fix a typo in xact.c

2017-11-27 Thread Masahiko Sawada
Hi,

Attached patch for $subject.

s/recoreds/record/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_typo_in_xact_c.patch
Description: Binary data


Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2017-11-27 Thread Michael Paquier
On Tue, Nov 28, 2017 at 12:11 AM, Daniel Gustafsson  wrote:
>> The patch needs a rebase, and there are a couple of places that need
>> an extra lookup I think:
>> $ git grep defname -- *.c | grep strcasecmp | wc -l
>>  39
>
> Rebased and handled a few more places which I had either missed in the last
> round, or that had been added in the meantime.  “PARALLEL” in aggregatecmds.c
> is intentionally using pg_strcasecmp() due to the old-style syntax which is
> still supported.

This meritates a comment. Code readers may get confused.

> AFAICS this covers all relevant codepaths from the 39 above.

I was just looking at the tsearch code which uses pg_strcmpcase, and
those are defined with makeDefElem() so you should switch to strcmp in
this case as well, no? If I patch the code myself I would get an error
when double-quoting, making those command more consistent with the
rest of what you are patching here:
create extension unaccent;
alter text search dictionary unaccent (Rules = 'unaccent'); -- ok
alter text search dictionary unaccent (RuLes = 'unaccent'); -- ok
alter text search dictionary unaccent ("Rules" = 'unaccent'); -- error
-- 
Michael



Re: [PATCH] Atomic pgrename on Windows

2017-11-27 Thread Andres Freund
On 2017-11-28 09:47:45 +0900, Michael Paquier wrote:
> On Mon, Nov 27, 2017 at 3:28 PM, Alexander Korotkov
>  wrote:
> > Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
> > appears to be possible to atomically replace file on Windows – ReplaceFile()
> > does that.  ReplaceFiles() requires target file to exist, this is why we
> > still need to call MoveFileEx() when it doesn't exist.
> 
> Do you think that it could be safer to unlink the target file first
> with pgunlink()? This way you make sure that the target file is
> removed and not locked. This change makes me worrying about the
> introduction of more race conditions.

That seems like a *seriously* bad idea. What if we crash inbetween the
unlink and the rename?


I'm confused about the need for this. Shouldn't normally
opening all files FILE_SHARE_DELETE take care of this? See
https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
"Note  Delete access allows both delete and rename operations."

Is there an external process active that doesn't set that flag? Are we
missing setting it somewhere?

Greetings,

Andres Freund



Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread Michael Paquier
On Tue, Nov 28, 2017 at 9:40 AM, David Steele  wrote:
> Perhaps I'm the one who is misunderstanding. If you propose a patch I'll be
> happy to review it, though I doubt there is a lot to be gained even if it
> would be a better implementation.

OK. I'll keep that in mind. Thanks for the input.
-- 
Michael



Re: [PATCH] Atomic pgrename on Windows

2017-11-27 Thread Michael Paquier
On Mon, Nov 27, 2017 at 3:28 PM, Alexander Korotkov
 wrote:
> Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
> appears to be possible to atomically replace file on Windows – ReplaceFile()
> does that.  ReplaceFiles() requires target file to exist, this is why we
> still need to call MoveFileEx() when it doesn't exist.

Do you think that it could be safer to unlink the target file first
with pgunlink()? This way you make sure that the target file is
removed and not locked. This change makes me worrying about the
introduction of more race conditions.

> This patch is based on work of Victor Spirin who was asked by Postgres Pro
> to research this problem.

Victor has no community account so his name cannot be registered as a
co-author of the patch. I have added your name though.
-- 
Michael



Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Robert Haas
On Mon, Nov 27, 2017 at 4:35 PM Masahiko Sawada 
wrote:

> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user. In
> other words, the backend should not be waken up and the resolver
> should continue to resolve at certain intervals even if the resolver
> fails to connect to the foreign server or fails to resolve it. This is
> similar to what synchronous replication guaranteed today. Keeping this
> semantics is very important for users. Note that the reading a
> consistent result by concurrent reads is a separated problem.
>
> The read result including foreign servers can be inconsistent if the
> such transaction is cancelled or the coordinator server crashes during
> two-phase commit processing. That is, if there is in-doubt transaction
> the read result can be inconsistent, even for subsequent reads. But I
> think this behaviour can be accepted by users. For the resolution of
> in-doubt transactions, the resolver process will try to resolve such
> transactions after the coordinator server recovered. On the other
> hand, for the reading a consistent result on such situation by
> subsequent reads, for example, we can disallow backends to inquiry SQL
> to the foreign server if a foreign transaction of the foreign server
> is remained.
>
> For the concurrent reads, the reading an inconsistent result can be
> happen even without in-doubt transaction because we can read data on a
> foreign server between PREPARE and COMMIT PREPARED while other foreign
> servers have committed. I think we should deal with this problem by
> other feature on reader side, for example, atomic visibility. If we
> have atomic visibility feature, we also can solve the above problem.


+1 to all of that.

...Robert

>
> --
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Isn't partition drop code seriously at risk of deadlock?

2017-11-27 Thread Amit Langote
On 2017/11/28 9:04, Tom Lane wrote:
> The complaint in bug #14927 that heap_drop_with_catalog is not bothering
> to check for SearchSysCache lookup failure (in code evidently newly added
> for the partition feature) seems to me to be only scratching the surface
> of what's wrong with that code.  In particular, I do not understand how
> it can possibly be deadlock-free to be trying to grab AccessExclusiveLock
> on a partition's parent table when we already have such a lock on the
> partition.  Which we do, or at least had better, long before we get to
> heap_drop_with_catalog.

We do that as of 258cef12540fa1 [1].  The lock on the parent is taken in
RangeVarCallbackForDropRelation() before the partition itself is locked.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=258cef12540fa1




Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread David Steele

On 11/27/17 7:11 PM, Michael Paquier wrote:

On Mon, Nov 27, 2017 at 11:06 PM, David Steele  wrote:

On 11/15/17 10:09 PM, Michael Paquier wrote:


read_backup_label() is a static function in the backend code. With #2
I do not imply to change the order of the elements written in the
backup_label file, just to make the way they are parsed smarter.


My point is that the order *could* be changed and it wouldn't be noticed
if the read function were improved as you propose.

I'm not against the idea, just think we should continue to enforce the
order unless we decide an interface break is OK.


I still don't quite understand here. The order the items are read does
not cause a backward-incompatible change. True that there is no reason
to change that either now.


Perhaps I'm the one who is misunderstanding. If you propose a patch I'll 
be happy to review it, though I doubt there is a lot to be gained even 
if it would be a better implementation.


Regards,
--
-David
da...@pgmasters.net



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Masahiko Sawada
On Tue, Nov 28, 2017 at 4:56 AM, Robert Haas  wrote:
> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
>> pg_prune_xid makes sense only for heap pages.  Once we introduce special
>> area for heap pages, we can move pg_prune_xid there and save some bytes in
>> index pages.  However, this is an optimization not directly related to
>> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
>> this optimization as well?  But if we have any doubts in this, it can be
>> removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.

IIUC xid-lsn ranges patch[1] doesn't require the page format
conversion. Is there any reason we can not go on that way? FWIW, I've
rebased the patch to current HEAD.

[1] https://www.postgresql.org/message-id/5242F8BF.6010807%40vmware.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Expression based aggregate transition / combine function invocation

2017-11-27 Thread Andres Freund
Hi,

this is part of my work to make expression evaluation JITable. In a lot
of analytics queries the major bottleneck is transition function
invocation (makes sense, hardly anyone wants to see billions of
rows). Therefore for JITing to be really valuable transition function
stuff needs to be JITable.

Excerpt from the preliminary commit message:

  Previously aggregate transition and combination functions were invoked
  by special case code in nodeAgg.c, evaluting input and filters
  separately using the expression evaluation machinery. That turns out
  to not be great for performance for several reasons:
  - repeated expression evaluations have some cost
  - the transition functions invocations are poorly predicted
  - filter and input computation had to be done separately
  - the special case code made it hard to implement JITing of the whole
transition function invocation

  Address this by building one large expression that computes input,
  evaluates filters, and invokes transition functions.

  This leads to moderate speedups in queries bottlenecked by aggregate
  computations, and enables large speedups for similar cases once JITing
  is done.


While this gets rid of a substantial amount of duplication between the
infrastructure for transition and combine functions, it still increases
codesize a bit.

Todo / open Questions:
- Location of transition function building functions. Currently they're
  in execExpr.c. That allows not to expose a bunch of functions local to
  it, but requires exposing some aggregate structs to the world. We
  could go the other way round as well.

- Right now we waste a bunch of time by having to access transition
  states indexed by both grouping set number and the transition state
  offset therein. It'd be nicer if we could cheaply reduce the number of
  indirections, but I can't quite see how without adding additional
  complications.


Here's some example tpch Q01 timings:
master: 11628 ms (best of three)
patches: 10330 ms (best of three)

other tpch queries are similar, aggregate improvement is a factor of
x 1.04.

Greetings,

Andres Freund
>From 3ff1371aae4615c15f6bcdc232170a3a295e8021 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 24 Nov 2017 14:14:35 -0800
Subject: [PATCH 1/3] Simplify representation of aggregate transition values a
 bit.

Previously aggregate transition values for hash and direct (sort + no
group by) aggregates were represented differently. That made upcoming
changes hard, so represent both as grouping set indexed array of
per-group data.

Author: Andres Freund
---
 src/backend/executor/nodeAgg.c | 94 --
 src/include/nodes/execnodes.h  |  6 ++-
 2 files changed, 48 insertions(+), 52 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index da6ef1a94c4..a939ae99fa8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -532,13 +532,13 @@ static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
 static void initialize_aggregates(AggState *aggstate,
-	  AggStatePerGroup pergroup,
-	  int numReset);
+	  AggStatePerGroup *pergroups,
+	  bool isHash, int numReset);
 static void advance_transition_function(AggState *aggstate,
 			AggStatePerTrans pertrans,
 			AggStatePerGroup pergroupstate);
-static void advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup,
-   AggStatePerGroup *pergroups);
+static void advance_aggregates(AggState *aggstate, AggStatePerGroup *sort_pergroups,
+   AggStatePerGroup *hash_pergroups);
 static void advance_combine_function(AggState *aggstate,
 		 AggStatePerTrans pertrans,
 		 AggStatePerGroup pergroupstate);
@@ -793,15 +793,14 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
  * If there are multiple grouping sets, we initialize only the first numReset
  * of them (the grouping sets are ordered so that the most specific one, which
  * is reset most often, is first). As a convenience, if numReset is 0, we
- * reinitialize all sets. numReset is -1 to initialize a hashtable entry, in
- * which case the caller must have used select_current_set appropriately.
+ * reinitialize all sets.
  *
  * When called, CurrentMemoryContext should be the per-query context.
  */
 static void
 initialize_aggregates(AggState *aggstate,
-	  AggStatePerGroup pergroup,
-	  int numReset)
+	  AggStatePerGroup *pergroups,
+	  bool isHash, int numReset)
 {
 	int			transno;
 	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
@@ -812,31 +811,19 @@ initialize_aggregates(AggState *aggstate,
 	if (numReset == 0)
 		numReset = numGroupingSets;
 
-	for (transno = 0; transno < numTrans; transno++)
+	for (setno = 0; setno < numReset; setno++)
 	{
-		AggStatePerTrans pertrans = 

Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread Michael Paquier
On Mon, Nov 27, 2017 at 11:06 PM, David Steele  wrote:
> On 11/15/17 10:09 PM, Michael Paquier wrote:
>> On Thu, Nov 16, 2017 at 9:20 AM, David Steele  wrote:
>>> For this patch at least, I think we should do #1.  Getting rid of the order
>>> dependency is attractive but there may be other programs that are depending
>>> on the order.  I know you are not proposing to change the order now, but it
>>> *could* be changed with #2.
>>
>> read_backup_label() is a static function in the backend code. With #2
>> I do not imply to change the order of the elements written in the
>> backup_label file, just to make the way they are parsed smarter.
>
> My point is that the order *could* be changed and it wouldn't be noticed
> if the read function were improved as you propose.
>
> I'm not against the idea, just think we should continue to enforce the
> order unless we decide an interface break is OK.

I still don't quite understand here. The order the items are read does
not cause a backward-incompatible change. True that there is no reason
to change that either now.

>>> Also, I think DEBUG1 would be a more appropriate log level if any logging is
>>> done.
>>
>> OK, the label and time strings can include spaces. The label string is
>> assumed to not be bigger than 1028 bytes, and the timestamp is assumed
>> that it can get up to 128. So it is possible to use stuff like
>> fscanf("%127[^\n]\n") to avoid any buffer overflow problems on the
>> backend. If a backup_label is generated with strings longer than that,
>> then read_backup_label would fail to parse the next entries but that's
>> not really something to worry about IMO as those are only minor sanity
>> checks. Having a safer coding in the backend is more important, and
>> the new checks should not trigger any hard failures.
>
> I have tested and get an error as expected when I munge the backup_label
> file:
>
> FATAL:  invalid data in file "backup_label"
> DETAIL:  Timeline ID parsed is 2, but expected 1
>
> Everything else looks good so I will mark it ready for committer.

Thanks. This maps what I saw.
-- 
Michael



Isn't partition drop code seriously at risk of deadlock?

2017-11-27 Thread Tom Lane
The complaint in bug #14927 that heap_drop_with_catalog is not bothering
to check for SearchSysCache lookup failure (in code evidently newly added
for the partition feature) seems to me to be only scratching the surface
of what's wrong with that code.  In particular, I do not understand how
it can possibly be deadlock-free to be trying to grab AccessExclusiveLock
on a partition's parent table when we already have such a lock on the
partition.  Which we do, or at least had better, long before we get to
heap_drop_with_catalog.

regards, tom lane



Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Tom Lane
Magnus Hagander  writes:
> What I've been thinking about for that one before is if we could just
> invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
> collector for a single table or index stat. If autovacuum never needs to
> see a consistent view between multiple tables, I would think that's going
> to be a win in a lot of cases.

Perhaps.  Autovac might run through quite a few tables before it finds
one in need of processing, though, so I'm not quite certain this would
yield such great benefits in isolation.

> However, when it comes to the stats system, I'd say that on any busy system
> (which would be the ones to care about), the stats structures are still
> going to be *written* a lot more than they are read.

Uh, what?  The stats collector doesn't write those files at all except
on-demand.

regards, tom lane



Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane
Dean Rasheed  writes:
> On 27 November 2017 at 16:35, Tom Lane  wrote:
>> On looking closer, the reason it's like that in Fujita-san's patch
>> is to minimize the API churn seen by FDW AddForeignUpdateTargets
>> functions, specifically whether they see a tlist that's before or
>> after what expand_targetlist() does.  I'm doubtful that the
>> potential savings is worth taking risks there.  In particular,
>> it seems like a good thing that expand_targetlist() verifies the
>> correct tlist ordering *after* the FDW function has acted.
>> So now my inclination is to leave this alone.

> Ah yes, that seems like a worthwhile check to keep. Never mind then.

Pushed with that and some cosmetic fiddling with comments and docs.
Thanks for the discussion!

regards, tom lane



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Alexander Korotkov
On Mon, Nov 27, 2017 at 10:56 PM, Robert Haas  wrote:

> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
> > pg_prune_xid makes sense only for heap pages.  Once we introduce special
> > area for heap pages, we can move pg_prune_xid there and save some bytes
> in
> > index pages.  However, this is an optimization not directly related to
> > 64-bit xids.  Idea is that if we anyway change page format, why don't
> apply
> > this optimization as well?  But if we have any doubts in this, it can be
> > removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.  If we get the heap
> storage API working, then we could have a heap AM that works as it
> does today and a newheap AM with such changes, but I have a bit of a
> hard time imagining a patch that causes a hard compatibility break
> ever being accepted.


Thank you for raising this question.  There was a discussion about 64-bit
xids during PGCon 2017.  Couple ways to provide pg_upgrade were discussed.

1) We've page layout version in the page (current is number 4).  So, we can
define new page layout version 5.  Pages with new layout version would
contain 64-bit base values for xid and multixact.  The question is how to
deal with page of layout version 4.  If this page have enough of free space
to fit extra 16 bytes, then it could be upgraded on the fly.  If it doesn't
contains enough of space for than then things becomes more complicated: we
can't upgrade it to new format, but we still need to fit new xmax value
there in the case tuple being updated or deleted.  pg_upgrade requires
server restart.  Thus, once we set hint bits, pre-pg_upgrade xmin is not
really meaningful – corresponding xid is visible for every post-pg_upgrade
snapshot.  So, idea is to use both xmin and xmax tuple fields on such
unupgradable page to store 64-bit xmax.  This idea was proposed by me, but
was criticized by some session attendees (sorry, but I don't recall who
were them) for its complexity and suspected overhead.

2) Alternative idea was to use unused bits in page header.  Naturally, if
we would look for unused bits in pd_flags (3 bits of 16 is
used), pd_pagesize_version (we can left 1 bit of 16 to distinguish between
old and new format) and pd_special (we can leave 1 bit to distinguish
sequence pages), we can scrape together 43 bits.  That would be far enough
for single base value, because we definitely don't need all lower 32-bits
of base value (21 bits is more than enough).  But I'm not sure about two
base values: if we would live 2 bits for lower part of base value, than it
leaves us 19 bits for high part of base value.  This solution would give us
2^51 maximum values for xids and multixacts.  I'm not sure if it's enough
to assume these counters infinite.  AFAIK, there are products on the market
whose have 48-bit transaction identifiers and don't care about wraparound
or something...

New heap AM for 64-bit xids is an interesting idea too.  I would even say
that pluggable storage API being discussed now is excessive for this
particular purpose (but still can fit!), because in most of aspects heap
with 64-bit xids is absolutely same as current heap (in contrast to heap
with undo log, for example).  Best fit API for heap with 64-bit xid support
would be pluggable heap page format.  But I don't think it deserves
separate API though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Masahiko Sawada
On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
> Masahiko Sawada  wrote:
>
>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
>>  wrote:
>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
>> > wrote:
>> >>
>> >> Because I don't want to break the current user semantics. that is,
>> >> currently it's guaranteed that the subsequent reads can see the
>> >> committed result of previous writes even if the previous transactions
>> >> were distributed transactions. And it's ensured by writer side. If we
>> >> can make the reader side ensure it, the backend process don't need to
>> >> wait for the resolver process.
>> >>
>> >> The waiting backend process are released by resolver process after the
>> >> resolver process tried to resolve foreign transactions. Even if
>> >> resolver process failed to either connect to foreign server or to
>> >> resolve foreign transaction the backend process will be released and
>> >> the foreign transactions are leaved as dangling transaction in that
>> >> case, which are processed later. Also if resolver process takes a long
>> >> time to resolve foreign transactions for whatever reason the user can
>> >> cancel it by Ctl-c anytime.
>> >>
>> >
>> > So, there's no guarantee that the next command issued from the
>> > connection *will* see the committed data, since the foreign
>> > transaction might not have committed because of a network glitch
>> > (say). If we go this route of making backends wait for resolver to
>> > resolve the foreign transaction, we will have add complexity to make
>> > sure that the waiting backends are woken up in problematic events like
>> > crash of the resolver process OR if the resolver process hangs in a
>> > connection to a foreign server etc. I am not sure that the complexity
>> > is worth the half-guarantee.
>> >
>>
>> Hmm, maybe I was wrong. I now think that the waiting backends can be
>> woken up only in following cases;
>> - The resolver process succeeded to resolve all foreign transactions.
>> - The user did the cancel (e.g. ctl-c).
>> - The resolver process failed to resolve foreign transaction for a
>> reason of there is no such prepared transaction on foreign server.
>>
>> In other cases the resolver process should not release the waiters.
>
> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
> effort is needed to ensure that backend does not keep waiting if the resolver
> can't finish it's work in forseable future. But this effort is not worth
> because by waking the backend up you might prevent the next transaction from
> seeing the changes the previous one tried to make."
>
> On the other hand, your last comments indicate that you try to be even more
> stringent in letting the backend wait. However even this stringent approach
> does not guarantee that the next transaction will see the data changes made by
> the previous one.
>

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

For the concurrent reads, the reading an inconsistent result can be
happen even without in-doubt transaction because we can read data on a
foreign server between PREPARE and COMMIT PREPARED while other foreign
servers have committed. I think we should deal with this problem by
other feature on reader side, for example, atomic visibility. If we
have atomic visibility feature, we also can solve the above problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [PATCH] Fix crash in int8_avg_combine().

2017-11-27 Thread Hadi Moshayedi
On Sat, Nov 25, 2017 at 10:47 PM, Andres Freund 
wrote:

> diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
> > index 869c59dc85..2dc59e89cd 100644
> > --- a/src/include/utils/memutils.h
> > +++ b/src/include/utils/memutils.h
> > @@ -189,7 +189,7 @@ extern MemoryContext SlabContextCreate(MemoryContext
> parent,
> >   * Few callers should be interested in this, but tuplesort/tuplestore
> need
> >   * to know it.
> >   */
> > -#define ALLOCSET_SEPARATE_THRESHOLD  8192
> > +#define ALLOCSET_SEPARATE_THRESHOLD  16384
>
> Huh, what's that about in this context?
>

There is following static assert:

StaticAssertStmt(ALLOC_CHUNK_LIMIT == ALLOCSET_SEPARATE_THRESHOLD, ...)

and ALLOCK_CHUNK_LIMIT is defined as:

#define ALLOC_CHUNK_LIMIT (1 << (ALLOCSET_NUM_FREELISTS-1+ALLOC_MINBITS))

and there is this comment:

"CAUTION: ALLOC_MINBITS must be large enough so that 1<

Re: [PATCH] Fix crash in int8_avg_combine().

2017-11-27 Thread Hadi Moshayedi
On Sat, Nov 25, 2017 at 10:55 PM, Tom Lane  wrote:

> I believe we already dealt with this:
>
> Author: Tom Lane 
> Branch: REL_10_STABLE [619a8c47d] 2017-11-14 17:49:49 -0500
> Branch: REL9_6_STABLE [4a15f87d2] 2017-11-14 17:49:49 -0500
> Branch: REL9_5_STABLE [d4e38489f] 2017-11-14 17:49:49 -0500
>
> Prevent int128 from requiring more than MAXALIGN alignment.
>
> Our initial work with int128 neglected alignment considerations, an
> oversight that came back to bite us in bug #14897 from Vincent
> Lachenal.
> It is unsurprising that int128 might have a 16-byte alignment
> requirement;
> what's slightly more surprising is that even notoriously lax Intel
> chips
> sometimes enforce that.
>
> Raising MAXALIGN seems out of the question: the costs in wasted disk
> and
> memory space would be significant, and there would also be an on-disk
> compatibility break.  Nor does it seem very practical to try to allow
> some
> data structures to have more-than-MAXALIGN alignment requirement, as
> we'd
> have to push knowledge of that throughout various code that copies data
> structures around.
>
> The only way out of the box is to make type int128 conform to the
> system's
> alignment assumptions.  Fortunately, gcc supports that via its
> __attribute__(aligned()) pragma; and since we don't currently support
> int128 on non-gcc-workalike compilers, we shouldn't be losing any
> platform
> support this way.
>
> Although we could have just done pg_attribute_aligned(MAXIMUM_ALIGNOF)
> and
> called it a day, I did a little bit of extra work to make the code more
> portable than that: it will also support int128 on compilers without
> __attribute__(aligned()), if the native alignment of their 128-bit-int
> type is no more than that of int64.
>
> Add a regression test case that exercises the one known instance of the
> problem, in parallel aggregation over a bigint column.
>
> Back-patch of commit 751804998.  The code known to be affected only
> exists
> in 9.6 and later, but we do have some stuff using int128 in 9.5, so
> patch
> back to 9.5.
>
> Discussion: https://postgr.es/m/20171110185747.31519.28038@
> wrigleys.postgresql.org
>
> Does that solution not work for you?
>

It works for me. My REL_10_STABLE was out of date. A git pull fixed it.


Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Magnus Hagander
On Mon, Nov 27, 2017 at 7:53 PM, Robert Haas  wrote:

> On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane  wrote:
> >>> Mumble.  It's a property I'm pretty hesitant to give up, especially
> >>> since the stats views have worked like that since day one.  It's
> >>> inevitable that weakening that guarantee would break peoples' queries,
> >>> probably subtly.
> >
> >> You mean, queries against the stats views, or queries in general?  If
> >> the latter, by what mechanism would the breakage happen?
> >
> > Queries against the stats views, of course.
>
> Hmm.  Those are probably rare.  If we only took a snapshot of the
> statistics for the backends that explicitly access those views, that
> probably wouldn't be too crazy.
>
> Sorry if this is a stupid question, but how often and for what purpose
> to regular backends need the stats collector data for purposes other
> than querying the stats views?  I thought that the data was only used
> to decide whether to VACUUM/ANALYZE, and therefore would be accessed
> mostly by autovacuum, and for that you'd actually want the most
> up-to-date view of the stats for a particular table that is available,
> not any older snapshot.
>
>
Autovacuum resets the stats to make sure. Autovacuum in particular can
probably be made a lot more efficient, because it only ever looks at one
relation at a time, I think.

What I've been thinking about for that one before is if we could just
invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
collector for a single table or index stat. If autovacuum never needs to
see a consistent view between multiple tables, I would think that's going
to be a win in a lot of cases.

I don't think regular backends use them at all. But anybody looking at the
stats do, and it is pretty important there.

However, when it comes to the stats system, I'd say that on any busy system
(which would be the ones to care about), the stats structures are still
going to be *written* a lot more than they are read. We certainly don't
read them at the rate of once per transaction. A lot of the reads are also
limited to one database of course.

I wonder if we want to implement some sort of copy-on-read-snapshot in the
stats collector itself. So instead of unconditionally publishing
everything, have the backends ask for it. When a backend asks for it it
gets a "snapshot counter" or something from the stats collector, and on the
next write after that we do a copy-write if the snapshot it still
available. (no, i have not thought in detail)

Or -- if we keep a per-database hashtable in dynamic shared memory (which
we can now). Can we copy it into local memory in the backend fast enough
that we can hold a lock and just queue up the stats updates during the
copy? If we can copy the complete structure, that would fix one of the
bigger bottlenecks with it today which is that we dump and rebuild the
hashtables as we go through the tempfiles.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: default range partition and constraint exclusion

2017-11-27 Thread Robert Haas
On Mon, Nov 27, 2017 at 4:04 AM, Kyotaro HORIGUCHI
 wrote:
> This is the story in my understanding.

Thanks, that's helpful.  Sorry I didn't have time yet to study this in
detail myself.  If we're routing tuples to a partition for which the
partition constraint is evaluating to null, that's OK, but if we're
routing tuples to a partition for which the partition constraint is
evaluating to false, that's a bug, and the right solution is to
correct either the tuple routing or the partition constraint.  In this
case, it looks like the tuple routing is working as expected, so that
would say we ought to fix the constraint.

I am out of time for today but will try to look at this some more tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



simplehash: tb->sizemask = 0

2017-11-27 Thread Tomas Vondra
Hi,

I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
(i.e. no bits set)? At least that's what I get from

printf("%#x\n", (unsigned int)0);

That would mean SH_INITIAL_BUCKET/SH_NEXT/SH_PREV can only ever return
bucket 0, no?

I don't think we're building hash tables with 2^32 buckets, though.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Peter Geoghegan
On Mon, Nov 27, 2017 at 11:56 AM, Robert Haas  wrote:
> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
>> pg_prune_xid makes sense only for heap pages.  Once we introduce special
>> area for heap pages, we can move pg_prune_xid there and save some bytes in
>> index pages.  However, this is an optimization not directly related to
>> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
>> this optimization as well?  But if we have any doubts in this, it can be
>> removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.  If we get the heap
> storage API working, then we could have a heap AM that works as it
> does today and a newheap AM with such changes, but I have a bit of a
> hard time imagining a patch that causes a hard compatibility break
> ever being accepted.

I actually think that we could use that field in indexes for storing
an epoch. This could be used to avoid having to worry about
anti-wraparound VACUUM for deleted index pages that contain a cached
XID -- the one we compare to RecentGlobalXmin as part of the recycling
interlock. (I suggested this to Sawada-san at one point, in the
context of avoiding vacuuming indexes on large append-mostly tables.)

In any case, we'd hardly go to all that effort to save just 4 bytes in
the page header.

-- 
Peter Geoghegan



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Robert Haas
On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
 wrote:
> pg_prune_xid makes sense only for heap pages.  Once we introduce special
> area for heap pages, we can move pg_prune_xid there and save some bytes in
> index pages.  However, this is an optimization not directly related to
> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
> this optimization as well?  But if we have any doubts in this, it can be
> removed with no problem.

My first reaction is that changing the page format seems like a
non-starter, because it would break pg_upgrade.  If we get the heap
storage API working, then we could have a heap AM that works as it
does today and a newheap AM with such changes, but I have a bit of a
hard time imagining a patch that causes a hard compatibility break
ever being accepted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Robert Haas
On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane  wrote:
>>> Mumble.  It's a property I'm pretty hesitant to give up, especially
>>> since the stats views have worked like that since day one.  It's
>>> inevitable that weakening that guarantee would break peoples' queries,
>>> probably subtly.
>
>> You mean, queries against the stats views, or queries in general?  If
>> the latter, by what mechanism would the breakage happen?
>
> Queries against the stats views, of course.

Hmm.  Those are probably rare.  If we only took a snapshot of the
statistics for the backends that explicitly access those views, that
probably wouldn't be too crazy.

Sorry if this is a stupid question, but how often and for what purpose
to regular backends need the stats collector data for purposes other
than querying the stats views?  I thought that the data was only used
to decide whether to VACUUM/ANALYZE, and therefore would be accessed
mostly by autovacuum, and for that you'd actually want the most
up-to-date view of the stats for a particular table that is available,
not any older snapshot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Robert Haas
On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
 wrote:
> Hmmm. Okay, we must make stats collector more effeicient if we
> want to have additional counters with smaller significance in the
> table stats. Currently sizeof(PgStat_StatTabEntry) is 168
> bytes. The whole of the patchset increases it to 232 bytes. Thus
> the size of a stat file for a database with 1 tables
> increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
> not dynamically expandable so placing stats on shared hash
> doesn't seem effective. Stats as a regular table could work but
> it seems too-much.

dshash, which is already committed, is both DSM-based and dynamically
expandable.

> Is it acceptable that adding a new section containing this new
> counters, which is just loaded as a byte sequence and parsing
> (and filling the corresponding hash) is postponed until a counter
> in the section is really requested?  The new counters need to be
> shown in a separate stats view (maybe named pg_stat_vacuum).

Still makes the stats file bigger.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] path toward faster partition pruning

2017-11-27 Thread Jesper Pedersen

Hi Amit,

On 11/24/2017 12:00 AM, Amit Langote wrote:

On 2017/11/23 3:56, Jesper Pedersen wrote:
EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
t2.c = t1.b WHERE t2.d = 1;

I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
"WHERE" one is.


Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
will give rise to a t1.b = 1 and hence help prune t1's partition as well:



No, I meant 't2.c = t1.b'. If you take the same example, but don't 
partition you will get the following plan:


test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER 
JOIN t2 ON t2.c = t1.b WHERE t2.d = 1;

  QUERY PLAN
--
 Nested Loop
   ->  Index Scan using idx_t2_d on t2
 Index Cond: (d = 1)
   ->  Index Only Scan using idx_t1_b_a on t1
 Index Cond: (b = t2.c)
(5 rows)

Maybe "5.10.2. Declarative Partitioning" could be expanded to include 
some general "guidelines" of where partition based plans should be 
checked against their non-partition counterparts (at least the first 
bullet in 5.10.1 says ".. in certain situations .."). Probably a 
separate patch from this.


[snip]


Should pruning of partitions for UPDATEs (where the partition key isn't
updated) and DELETEs be added to the TODO list?


Note that partition pruning *does* work for UPDATE and DELETE, but only if
you use list/range partitioning.  The reason it doesn't work in this case
(t1 is hash partitioned) is that the pruning is still based on constraint
exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle
hash partitioning.



Thanks for your description.



I can see how that seems a bit odd.  If you use hash partitioning,
UPDATE/DELETE do not benefit from partition-pruning, even though SELECT
does.  That's because SELECT uses the new partition-pruning method (this
patch set) which supports hash partitioning, whereas UPDATE and DELETE use
constraint exclusion which doesn't.  It would be a good idea to make even
UPDATE and DELETE use the new method thus bringing everyone on the same
page, but that requires us to make some pretty non-trivial changes to how
UPDATE/DELETE planning works for inheritance/partitioned tables, which we
should undertake separately, imho.



Agreed.

Best regards,
 Jesper



Re: Do we accept doc changes to back branches?

2017-11-27 Thread Stephen Frost
Greetings,

* Joshua D. Drake (j...@commandprompt.com) wrote:
> I don't recall, do we allow non-bug fix (or what constitutes a bug)
> for back branches with docs? I have been reviewing the logical
> replication docs and they could use some love but I need to know
> which branch I should start work on.

The short answer, I believe, is to always start with master because we
certainly want doc fixes to be included in the next release (assuming
that documentation exists in master...  if it doesn't then start with
the latest release it's in).  Following that, patches for prior branches
are certainly welcome where they improve the documentation in a
meaningful way.  A good example is Dean's recent improvements on the RLS
documentation by adding a table which lists out the privileges and
policies matrix that RLS follows.

See commit 87c2a17fee784c7e1004ba3d3c5d8147da676783.

Thanks!

Stephen


signature.asc
Description: Digital signature


Do we accept doc changes to back branches?

2017-11-27 Thread Joshua D. Drake

-hackers,


I don't recall, do we allow non-bug fix (or what constitutes a bug) for 
back branches with docs? I have been reviewing the logical replication 
docs and they could use some love but I need to know which branch I 
should start work on.



Thanks,


JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.org
* Unless otherwise stated, opinions are my own.   *




Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Dean Rasheed
On 27 November 2017 at 16:35, Tom Lane  wrote:
> I wrote:
>> Dean Rasheed  writes:
>>> A separate point -- it might be marginally more efficient to have the
>>> work of rewriteTargetListUD() done after expand_targetlist() to avoid
>>> the possible renumbering of the resjunk entries.
>
>> Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
>> a bit silly when you put it like that.
>
> On looking closer, the reason it's like that in Fujita-san's patch
> is to minimize the API churn seen by FDW AddForeignUpdateTargets
> functions, specifically whether they see a tlist that's before or
> after what expand_targetlist() does.  I'm doubtful that the
> potential savings is worth taking risks there.  In particular,
> it seems like a good thing that expand_targetlist() verifies the
> correct tlist ordering *after* the FDW function has acted.
> So now my inclination is to leave this alone.
>

Ah yes, that seems like a worthwhile check to keep. Never mind then.

Regards,
Dean



Re: [HACKERS] Custom compression methods

2017-11-27 Thread Tomas Vondra
Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after 
> insertion of all data to new database? I tried using your 'archie' 
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
> 

I can trigger it pretty reliably with these steps:

git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
patch -p1 < ~/custom_compression_methods_v6.patch
./configure --enable-debug --enable-cassert \
 CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
 --prefix=/home/postgres/pg-compress
make -s clean && make -s -j4 install
cd contrib/
make -s clean && make -s -j4 install

export PATH=/home/postgres/pg-compress/bin:$PATH
pg_ctl -D /mnt/raid/pg-compress init
pg_ctl -D /mnt/raid/pg-compress -l compress.log start
createdb archie
cd ~/archie/sql/
psql archie < create.sql

~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1


I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Oliver Ford
On Mon, Nov 27, 2017 at 4:40 PM, Erik Rijkers  wrote:
> On 2017-11-27 17:34, Erik Rijkers wrote:
>>
>> On 2017-11-27 16:01, Oliver Ford wrote:
>>>
>>> Attached is it in bare diff form.
>>
>>
>> [0001-window-frame-v3.patch]
>>
>> Thanks, that did indeed fix it:
>>
>> make && make check  now  ok.
>>
>> There were errors in the doc build (unmatched tags); I fixed them in
>> the attached doc-patch (which should go on top of yours).
>
>
> 0001-window-frame-v3-fixtags.diff
>
> now attached, I hope...
>

Cheers here's v4 with the correct docs.


0001-window-frame-v4.patch
Description: Binary data


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-27 Thread Tomas Vondra
Hi,

Attached is an updated version of the patch series, fixing the issues
reported by Mark Dilger:

1) Fix fabs() issue in histogram.c.

2) Do not rely on extra_data being StdAnalyzeData, and instead lookup
the LT operator explicitly. This also adds a simple regression tests to
make sure ANALYZE on arrays works fine, but perhaps we should invent
some simple queries too.

3) I've removed / clarified some of the comments mentioned by Mark.

4) I haven't changed how the statistics kinds are defined in relation.h,
but I agree there should be a comment explaining how STATS_EXT_INFO_*
relate to StatisticExtInfo.kinds.

5) The most significant change happened histograms. There used to be two
structures for histograms:

  - MVHistogram - expanded (no deduplication etc.), result of histogram
build and never used for estimation

  - MVSerializedHistogram - deduplicated to save space, produced from
MVHistogram before storing in pg_statistic_ext and never used for
estimation

So there wasn't really any reason to expose the "non-serialized" version
outside histogram.c. It was just confusing and unnecessary, so I've
moved MVHistogram to histogram.c (and renamed it to MVHistogramBuild),
and renamed MVSerializedHistogram. And same for the MVBucket stuff.

So now we only deal with MVHistogram everywhere, except in histogram.c.

6) I've also made MVHistogram to include a varlena header directly (and
be packed as a bytea), which allows us to store it without having to
call any serialization functions).

I guess if we should do (5) and (6) for the MCV lists too, it seems more
convenient than the current approach. And perhaps even for the
statistics added to 9.6 (it does not change the storage format).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip


Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Erik Rijkers

On 2017-11-27 17:34, Erik Rijkers wrote:

On 2017-11-27 16:01, Oliver Ford wrote:

Attached is it in bare diff form.


[0001-window-frame-v3.patch]

Thanks, that did indeed fix it:

make && make check  now  ok.

There were errors in the doc build (unmatched tags); I fixed them in
the attached doc-patch (which should go on top of yours).


0001-window-frame-v3-fixtags.diff

now attached, I hope...

--- doc/src/sgml/syntax.sgml.orig	2017-11-27 17:19:30.253810944 +0100
+++ doc/src/sgml/syntax.sgml	2017-11-27 17:20:55.515626931 +0100
@@ -1805,8 +1805,8 @@
 and the optional frame_clause
 can be one of
 
-{ RANGE | ROWS } frame_start [ frame_exclusion_clause ]
-{ RANGE | ROWS } BETWEEN frame_start AND frame_end [ frame_exclusion_clause ]
+{ RANGE | ROWS } frame_start [ frame_exclusion_clause ]
+{ RANGE | ROWS } BETWEEN frame_start AND frame_end [ frame_exclusion_clause ]
 
 where frame_start and frame_end can be
 one of
@@ -1817,7 +1817,7 @@
 value FOLLOWING
 UNBOUNDED FOLLOWING
 
-where the optional frame_exclusion_clause can be one of
+where the optional frame_exclusion_clause can be one of
 
 EXCLUDE CURRENT ROW
 EXCLUDE TIES
@@ -1888,24 +1888,24 @@
 

 The value PRECEDING and
-value FOLLOWING cases, when used
-in ROWS mode, indicate that the frame starts or ends the specified
-number of rows before or after the current row. In ROWS mode,
-value must be an integer expression not containing any variables,
+value FOLLOWING cases, when used
+in ROWS mode, indicate that the frame starts or ends the specified
+number of rows before or after the current row. In ROWS mode,
+value must be an integer expression not containing any variables,
 aggregate functions, or window functions.
-When used in RANGE mode, they indicate that the frame starts or ends when the value of
-each row's ORDER BY column is within the start value and end value bounds. In RANGE mode,
-value can be either an integer expression or a date/time interval.
-In RANGE mode, there must be exactly one ORDER BY column and if the column is an integer column,
-then value must be an integer.
-If it is a date/time column, then value must be an interval. In both modes,
+When used in RANGE mode, they indicate that the frame starts or ends when the value of
+each row's ORDER BY column is within the start value and end value bounds. In RANGE mode,
+value can be either an integer expression or a date/time interval.
+In RANGE mode, there must be exactly one ORDER BY column and if the column is an integer column,
+then value must be an integer.
+If it is a date/time column, then value must be an interval. In both modes,
 the value must not be null or negative; but it can be zero, which just selects the current row.

 

-For the frame_exclusion_clause, EXCLUDE CURRENT ROW
-excludes the current row from the frame. EXCLUDE TIES excludes any peers of the current row from the
-frame. EXCLUDE NO OTHERS does nothing, but is provided in order to optionally document the intention
+For the frame_exclusion_clause, EXCLUDE CURRENT ROW
+excludes the current row from the frame. EXCLUDE TIES excludes any peers of the current row from the
+frame. EXCLUDE NO OTHERS does nothing, but is provided in order to optionally document the intention
 not to exclude any other rows.

 


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane
I wrote:
> Dean Rasheed  writes:
>> A separate point -- it might be marginally more efficient to have the
>> work of rewriteTargetListUD() done after expand_targetlist() to avoid
>> the possible renumbering of the resjunk entries.

> Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
> a bit silly when you put it like that.

On looking closer, the reason it's like that in Fujita-san's patch
is to minimize the API churn seen by FDW AddForeignUpdateTargets
functions, specifically whether they see a tlist that's before or
after what expand_targetlist() does.  I'm doubtful that the
potential savings is worth taking risks there.  In particular,
it seems like a good thing that expand_targetlist() verifies the
correct tlist ordering *after* the FDW function has acted.
So now my inclination is to leave this alone.

regards, tom lane



Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Erik Rijkers

On 2017-11-27 16:01, Oliver Ford wrote:

Attached is it in bare diff form.


[0001-window-frame-v3.patch]

Thanks, that did indeed fix it:

make && make check  now  ok.

There were errors in the doc build (unmatched tags); I fixed them in the 
attached doc-patch (which should go on top of yours).


(In very limited testing I did not find any problems yet)


thanks,

Erik Rijkers





Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane
Dean Rasheed  writes:
> I wonder if, years from now, it might look a bit odd that
> rewriteTargetListUD() is doing part of work of preptlist.c, is only
> called from there, and yet is located in the rewriter.

Yeah, I probably wouldn't have done it like this in a green field,
but maintaining traceability to the existing code is valuable IMO.

> A separate point -- it might be marginally more efficient to have the
> work of rewriteTargetListUD() done after expand_targetlist() to avoid
> the possible renumbering of the resjunk entries.

Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
a bit silly when you put it like that.

regards, tom lane



Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Oliver Ford
On Mon, Nov 27, 2017 at 12:06 PM, Oliver Ford  wrote:
> On Fri, Nov 24, 2017 at 3:08 PM, Erikjan Rijkers  wrote:
>>   SELECT pg_get_viewdef('v_window');
>> ! pg_get_viewdef
>> ! --
>> !   SELECT i.i,+
>> !  sum(i.i) OVER (ORDER BY i.i) AS sum_rows+
>>   FROM generate_series(1, 10) i(i);
>>   (1 row)
>>
>> --- 1034,1043 
>>   (10 rows)
>>
>>   SELECT pg_get_viewdef('v_window');
>> ! pg_get_viewdef
>> !
>> ---
>> !   SELECT i.i,
>> +
>> !  sum(i.i) OVER (ORDER BY i.i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
>> AS sum_rows+
>>   FROM generate_series(1, 10) i(i);
>>   (1 row)
>>
>>
>> This small hickup didn't prevent building an instance but obviously I
>> haven't done any real tests yet.
>>
>>
>> thanks,
>>
>>
>> Erik Rijkers
>
> After another clone and rebuild it works alright with the correct
> spacing on mine, so the attached v2 should all pass. I noticed that I
> hadn't added the exclusions clauses to the view defs code, so that's
> also in this patch with extra tests to cover it.

Sorry previous patch was in full-commit form and not just a diff.
Attached is it in bare diff form.


0001-window-frame-v3.patch
Description: Binary data


Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread David Steele
Hi Michael,

On 11/15/17 10:09 PM, Michael Paquier wrote:
> On Thu, Nov 16, 2017 at 9:20 AM, David Steele  wrote:
>> For this patch at least, I think we should do #1.  Getting rid of the order
>> dependency is attractive but there may be other programs that are depending
>> on the order.  I know you are not proposing to change the order now, but it
>> *could* be changed with #2.
> 
> read_backup_label() is a static function in the backend code. With #2
> I do not imply to change the order of the elements written in the
> backup_label file, just to make the way they are parsed smarter.

My point is that the order *could* be changed and it wouldn't be noticed
if the read function were improved as you propose.

I'm not against the idea, just think we should continue to enforce the
order unless we decide an interface break is OK.

>> Also, I think DEBUG1 would be a more appropriate log level if any logging is
>> done.
> 
> OK, the label and time strings can include spaces. The label string is
> assumed to not be bigger than 1028 bytes, and the timestamp is assumed
> that it can get up to 128. So it is possible to use stuff like
> fscanf("%127[^\n]\n") to avoid any buffer overflow problems on the
> backend. If a backup_label is generated with strings longer than that,
> then read_backup_label would fail to parse the next entries but that's
> not really something to worry about IMO as those are only minor sanity
> checks. Having a safer coding in the backend is more important, and
> the new checks should not trigger any hard failures.

I have tested and get an error as expected when I munge the backup_label
file:

FATAL:  invalid data in file "backup_label"
DETAIL:  Timeline ID parsed is 2, but expected 1

Everything else looks good so I will mark it ready for committer.

Regards,
-- 
-David
da...@pgmasters.net



Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-11-27 Thread Юрий Соколов
2017-11-26 19:51 GMT+03:00 Юрий Соколов :
>
> 2017-11-06 18:07 GMT+03:00 Sokolov Yura :
> >
> > On 2017-10-20 11:54, Sokolov Yura wrote:
> >>
> >> Hello,
> >>
> >> On 2017-10-19 19:46, Andres Freund wrote:
> >>>
> >>> On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
> 
>  > > +   init_local_spin_delay();
>  >
>  > The way you moved this around has the disadvantage that we now do
this -
>  > a number of writes - even in the very common case where the lwlock
can
>  > be acquired directly.
> 
>  Excuse me, I don't understand fine.
>  Do you complain against init_local_spin_delay placed here?
> >>>
> >>>
> >>> Yes.
> >>
> >>
> >> I could place it before perform_spin_delay under `if (!spin_inited)`
if you
> >> think it is absolutely must.
> >
> >
> > I looked at assembly, and remembered, that last commit simplifies
> > `init_local_spin_delay` to just two-three writes of zeroes (looks
> > like compiler combines 2*4byte write into 1*8 write). Compared to
> > code around (especially in LWLockAcquire itself), this overhead
> > is negligible.
> >
> > Though, I found that there is benefit in calling LWLockAttemptLockOnce
> > before entering loop with calls to LWLockAttemptLockOrQueue in the
> > LWLockAcquire (in there is not much contention). And this way, `inline`
> > decorator for LWLockAttemptLockOrQueue could be omitted. Given, clang
> > doesn't want to inline this function, it could be the best way.
>
> In attach version with LWLockAcquireOnce called before entering loop
> in LWLockAcquire.
>

Oh... there were stupid error in previos file.
Attached fixed version.


lwlock_v6.patch.gz
Description: GNU Zip compressed data


Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread Amit Kapila
On Mon, Nov 27, 2017 at 2:45 PM, hubert depesz lubaczewski
 wrote:
> On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
>> > For example, check step 13 in https://explain.depesz.com/s/gNBd
>> >
>> > It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
>> >
>> > But with parallel execution it seems to be no longer the case.
>> >
>> > For example:
>> > https://explain.depesz.com/s/LTMp
>> > or
>> > https://explain.depesz.com/s/QHRi
>> > Should I, for explain.depesz.com, when dealing
>> > with partial* and parallel* nodes, use "loops=1" for calculation of
>> > exclusive/inclusive time? always? some other nodes?
>> >
>>
>> I am not sure what exactly inclusive or exclusive means, but for
>> parallel nodes, total stats are accumulated so you are seeing loops as
>> 'worker nodes + 1'.  Now, as presumably workers run parallelly, so I
>> think the actual time will be what will be shown in the node not
>> actual time * nloops.
>
> Please check the plans:
> https://explain.depesz.com/s/gNBd (step 13)
> and https://explain.depesz.com/s/LTMp (step 3)
>
> Inclusive time is basically "loops * actual time", so for Index Scan,
> which had 1873 loops and actual time of 3.002..3.016, we got 1873
> * 3.016 = 5648.968ms.
>
> In case of parallel workers it looks like the inclusive time is
> basically the upper value from actual time.
>
> The question now is: how can I tell which nodes should use "actual_time
> * 1" and which "actual_time * loops" time?
>
> Anything "below" "Gather"?
>

I think it is "actual_time * 1" for anything below Gather.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transform for pl/perl

2017-11-27 Thread Anthony Bykov
On Wed, 15 Nov 2017 08:58:54 +
Aleksander Alekseev  wrote:

> The following review has been posted through the commitfest
> application: make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
> 
> Hello Anthony,
> 
> Great patch! Everything is OK and I almost agree with Pavel.
> 
> The only thing that I would like to suggest is to add a little more
> tests for various corner cases. For instance:
> 
> 1. Converting in both directions (Perl <-> JSONB) +/- infinity, NaN,
> MAX_INT, MIN_INT.
> 
> 2. Converting in both directions strings that contain non-ASCII
> (Russian / Japanese / etc) characters and special characters like \n,
> \t, \.
> 
> 3. Make sure that converting Perl objects that are not representable
> in JSONB (blessed hashes, file descriptors, regular expressions, ...)
> doesn't crash everything and shows a reasonable error message.
> 
> The new status of this patch is: Waiting on Author

Hello Aleksander,
Thank you for your review.

I've added more tests and I had to change behavior of transform when
working with not-representable-in-JSONB format objects - now it will
through an exception. You can find an example in tests.

Please, find the 4-th version of patch in attachments.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/contrib/Makefile b/contrib/Makefile
index 8046ca4..53d44fe 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += hstore_plperl
+SUBDIRS += hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += hstore_plperl
+ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile
new file mode 100644
index 000..cd86553
--- /dev/null
+++ b/contrib/jsonb_plperl/Makefile
@@ -0,0 +1,40 @@
+# contrib/jsonb_plperl/Makefile
+
+MODULE_big = jsonb_plperl
+OBJS = jsonb_plperl.o $(WIN32RES)
+PGFILEDESC = "jsonb_plperl - jsonb transform for plperl"
+
+PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl
+
+EXTENSION = jsonb_plperlu jsonb_plperl
+DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql
+
+REGRESS = jsonb_plperl jsonb_plperl_relocatability jsonb_plperlu jsonb_plperlu_relocatability
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/jsonb_plperl
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We must link libperl explicitly
+ifeq ($(PORTNAME), win32)
+# these settings are the same as for plperl
+override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
+# ... see silliness in plperl Makefile ...
+SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+else
+rpathdir = $(perl_archlibexp)/CORE
+SHLIB_LINK += $(perl_embed_ldflags)
+endif
+
+# As with plperl we need to make sure that the CORE directory is included
+# last, probably because it sometimes contains some header files with names
+# that clash with some of ours, or with some that we include, notably on
+# Windows.
+override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
new file mode 100644
index 000..b5c0980
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -0,0 +1,239 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+-- test hash -> jsonb
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+-- test array -> jsonb
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+-- test scalar -> jsonb
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testSVToJsonb();
+ testsvtojsonb 
+---
+ 1
+(1 row)
+
+-- test blessed scalar -> jsonb
+CREATE FUNCTION testBlessedToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $class = shift;
+my $tmp = { a=>"a", 1=>"1" };
+bless $tmp, $class;
+return $tmp;
+$$;
+SELECT testBlessedToJsonb();
+  testblessedtojsonb  
+--
+ {"1": "1", "a": "a"}
+(1 row)
+
+-- test blessed scalar -> jsonb
+CREATE FUNCTION 

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread hubert depesz lubaczewski
On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
> > For example, check step 13 in https://explain.depesz.com/s/gNBd
> >
> > It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
> >
> > But with parallel execution it seems to be no longer the case.
> >
> > For example:
> > https://explain.depesz.com/s/LTMp
> > or
> > https://explain.depesz.com/s/QHRi
> > Should I, for explain.depesz.com, when dealing
> > with partial* and parallel* nodes, use "loops=1" for calculation of
> > exclusive/inclusive time? always? some other nodes?
> >
> 
> I am not sure what exactly inclusive or exclusive means, but for
> parallel nodes, total stats are accumulated so you are seeing loops as
> 'worker nodes + 1'.  Now, as presumably workers run parallelly, so I
> think the actual time will be what will be shown in the node not
> actual time * nloops.

Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)

Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.

In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.

The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?

Anything "below" "Gather"?
Anything starting with "Partial?"
Anything starting with "Parallel"?
Anything with "Worker" in node "description" in explain?

depesz



RE: How is the PostgreSQL debuginfo file generated

2017-11-27 Thread Rui Hai Jiang
Thank you for your help.



I have tried many ways you suggested.  I was not familiar with this area, so I 
did spend lots of time trying.



At last, I found the easiest way to do this is to build a new SRPM package, 
then build other RPMs from the new SRPM.



Thanks a lot,

Ruihai




From: Tom Lane 
Sent: Friday, November 24, 2017 12:50:37 AM
To: Rui Hai Jiang
Cc: pgsql-hack...@postgresql.org
Subject: Re: How is the PostgreSQL debuginfo file generated

Rui Hai Jiang  writes:
> I’m wondering how to build the debuginfo package from  the PostgreSQL source.
> For example to generate something like this one :   
> postgresql91-debuginfo.x86_64.

On platforms where such things exist, that's handled by the packaging
system, not by PG proper.  You should proceed by making a new SRPM
and building RPMs from that.

regards, tom lane


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Amit Kapila
On Fri, Nov 24, 2017 at 4:03 PM, Alexander Korotkov
 wrote:
>> > 0002-heap-page-special-1.patch
>> > Putting xid and multixact bases into PageHeaderData would take extra 16
>> > bytes on index pages too.  That would be waste of space for indexes.
>> > This
>> > is why I decided to put bases into special area of heap pages.
>> > This patch adds special area for heap pages contaning prune xid and
>> > magic
>> > number.  Magic number is different for regular heap page and sequence
>> > page.
>> >
>>
>> uint16 pd_pagesize_version;
>> - TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
>>   ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
>>   } PageHeaderData;
>>
>> Why have you moved pd_prune_xid from page header?
>
>
> pg_prune_xid makes sense only for heap pages.  Once we introduce special
> area for heap pages, we can move pg_prune_xid there and save some bytes in
> index pages.  However, this is an optimization not directly related to
> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
> this optimization as well?
>

Sure, but I think this patch could have been proposed on top of your
main patch not other way.  Another similar thing I have noticed is
below:

*** typedef struct CheckPoint
*** 39,45 
  TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
   * timeline (equals ThisTimeLineID otherwise) */
  bool fullPageWrites; /* current full_page_writes */
- uint32 nextXidEpoch; /* higher-order bits of nextXid */
  TransactionId nextXid; /* next free XID */
  Oid nextOid; /* next free OID */
  MultiXactId nextMulti; /* next free MultiXactId */


I think if we have 64-bit Transaction Ids, then we might not need
epoch at all, but I think this can also be a separate change from the
main patch.  The point is that already main patch is big enough that
we should not try to squeeze other related changes in it.

>  But if we have any doubts in this, it can be
> removed with no problem.
>
>>
>> > 0003-64bit-xid-1.patch
..
>> >
>>
>> It seems there is no README or some detailed explanation of how all
>> this works like how the value of pd_xid_base is maintained.  I don't
>> think there are enough comments in the patch to explain the things.  I
>> think it will be easier to understand and review the patch if you
>> provide some more details either in email or in the patch.
>
>
> OK.  I'm going to write README and include it into the patch.
>

Thanks.  Few assorted comments on the main patch:

1.
+ /*
+  * Ensure that given xid fits base of given page.
+  */
+ bool
+ heap_page_prepare_for_xid(Relation relation, Buffer buffer,
+   TransactionId xid, bool multi)

I couldn't find any use of the return value of this function,
basically, if the specified xid can update the patch, then it returns
false otherwise it updates the tuples on a page and base xid on a page
such that new xid can update the tuple.  So either, you should make
this function return void or split it into two parts such that first
function check if the new xid can update the tuple, if so proceed with
updating the tuple, otherwise, update the base xid in the page and
xmin/xmax in tuples.

2.
heap_page_prepare_for_xid()
{
..
+ /* Find minimum and maximum xids in the page */
+ found = heap_page_xid_min_max(page, multi, , );
+
+ /* No items on the page? */
+ if (!found)
+ {
+ int64 delta;
+
+ if (!multi)
+ delta = (xid - FirstNormalTransactionId) - pageSpecial->pd_xid_base;
+ else
+ delta = (xid - FirstNormalTransactionId) - pageSpecial->pd_multi_base;
+
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, delta);
+ MarkBufferDirty(buffer);
+ return false;
+ }
..
}

When there are no items on the page what is need to try to traverse
all items again (via heap_page_shift_base), you can ideally shift the
base directly as (xid - FirstNormalTransactionId) in this case.

3.
heap_page_prepare_for_xid()
{
..
+ /* Can we just shift base on the page */
+ if (xid < base + FirstNormalTransactionId)
+ {
+ int64 freeDelta = MaxShortTransactionId - max,
+ requiredDelta = (base + FirstNormalTransactionId) - xid;
+
+ if (requiredDelta <= freeDelta)
+ {
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, - (freeDelta + requiredDelta) / 2);
+ MarkBufferDirty(buffer);
+ return true;
+ }
+ }
+ else
+ {
+ int64 freeDelta = min - FirstNormalTransactionId,
+ requiredDelta = xid - (base + MaxShortTransactionId);
+
+ if (requiredDelta <= freeDelta)
+ {
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, (freeDelta + requiredDelta) / 2);
+ MarkBufferDirty(buffer);
+ return true;
+ }
+ }
..
}

I think the above code doesn't follow guidelines for writing a WAL.
Currently, (a) it modifies page (b) write WAL (c) mark buffer dirty.
You need to perform step (c) before step (b).  Also, all these steps
need to be performed in the critical section.


Re: Typo in config_default.pl comment

2017-11-27 Thread Magnus Hagander
On Mon, Nov 27, 2017 at 2:21 AM, Andreas Karlsson  wrote:

> Hi,
>
> There is a --with-tls in the comments in config_default.pl which should
> be --with-tcl.
>

Oops. Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] UPDATE of partition key

2017-11-27 Thread Amit Khandekar
On 24 November 2017 at 10:52, Amit Langote
 wrote:
> On 2017/11/23 21:57, Amit Khandekar wrote:
>> If we collect the partition keys in expand_partitioned_rtentry(), we
>> need to pass the root relation also, so that we can convert the
>> partition key attributes to root rel descriptor. And the other thing
>> is, may be, we can check beforehand (in expand_inherited_rtentry)
>> whether the rootrte's updatedCols is empty, which I think implies that
>> it's not an UPDATE operation. If yes, we can just skip collecting the
>> partition keys.
>
> Yeah, it seems like a good idea after all to check in
> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
> and if so check if any of the updatedCols are partition keys.  If we find
> some, then it will suffice to just set a simple flag in the
> PartitionedChildRelInfo that will be created for the root table.  That
> should be done *after* we have visited all the tables in the partition
> tree including some that might be partitioned and hence will provide their
> partition keys.  The following block in expand_inherited_rtentry() looks
> like a good spot:
>
> if (rte->inh && partitioned_child_rels != NIL)
> {
> PartitionedChildRelInfo *pcinfo;
>
> pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed

2017-11-27 Thread Alexander Korotkov
On Tue, Nov 14, 2017 at 3:26 PM, Magnus Hagander 
wrote:

> On Mon, Nov 13, 2017 at 3:17 PM, Dean Rasheed 
> wrote:
>
>> On 28 October 2017 at 13:46, Pavel Stehule 
>> wrote:
>> > I though about Alexander proposal, and I am thinking so it can be
>> probably
>> > best if we respect psql design. I implemented two command suffixes
>> > (supported only when it has sense) "s" sorted by size and "d" as descent
>> >
>> > so list of tables can be sorted with commands:
>> >
>> > \dt+sd (in this case, the order is not strict), so command
>> > \dtsd+ is working too (same \disd+ or \di+sd)
>> >
>> > These two chars are acceptable. Same principle is used for \l command
>> >
>> > \lsd+ or \l+sd
>> >
>> > What do you think about it?
>> >
>>
>> I really hate that syntax. This is going to turn into an
>> incomprehensible mess, and isn't easily extended to support other
>> options.
>>
>
> +1. While useful in itself, I think it's definitely a dangerous pattern to
> go down, as it'll only get worse.
>
>
> I agree with people who have said they would prefer this to be
>> available as a per-command option rather than as a variable that you
>> have to set, but it needs a clearer syntax. I actually like Stephen's
>> idea of using a user-defined SQL snippet, because that's a familiar
>> syntax to people, and it avoids adding an ever-increasing number of
>> options to these commands. Instead, the syntax could simply be:
>>
>
> +1 here as well. And anybody who is actually going to need this level of
> control definitely will know SQL...
>
> And if one wants to save some "standard patterns", it should be doable to
> save the pattern itself in a variable and then use it with something like
> "\dt :mysort" and have it expand the normal way there.
>

+1
I agree, that would look better, especially with "standard patterns" which
could help with too long to type each time SQL snippets.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company