Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Amit Kapila
On Sun, Oct 11, 2015 at 10:09 AM, Tom Lane  wrote:
> Dmitry Vasilyev  writes:
> > The log you can see bellow:
> > ...
> > 2015-10-10 19:00:32 AST DEBUG:  cleaning up dynamic shared memory
control segment with ID 851401618
> > 2015-10-10 19:00:32 AST DEBUG:  invoking IpcMemoryCreate(size=290095104)
> > 2015-10-10 19:00:42 AST FATAL:  pre-existing shared memory block is
still in use
> > 2015-10-10 19:00:42 AST HINT:  Check if there are any old server
processes still running, and terminate them.
>
..
>
> If I had to guess, on the basis of no evidence, I'd wonder whether the
> DSM code broke it; there is evidently at least one DSM segment in play
> in your use-case.  But that's only a guess.
>

There is some possibility based on the above DEBUG messages that
DSM could cause this problem, but I think the last message (pre-existing
shared memory block is still in use) won't be logged for DSM.  We create
the new dsm segment in below code dsm_postmaster_startup()->
dsm_impl_op()->dsm_impl_windows()

dsm_impl_windows()
{
..
if (op == DSM_OP_CREATE)
..
}

Basically in this path, we try to recreate the dsm with different name if it
fails with ALREADY_EXIST error.

To diagnose the reason of problem, I think we can write a diagnostic
patch which would do below 2 points:

1. Increase the below loop count 10 to 50 or 100 in win32_shmem.c
or instead of loop count, we can increase the sleep time as well.
PGSharedMemoryCreate()
{
..
for (i = 0; i < 10; i++)
..
if (GetLastError() == ERROR_ALREADY_EXISTS)
{
..
Sleep(1000);
continue;
}
..
}

2. Increase the log messages both in win32_shmem.c and dsm related
code which can help us in narrowing down the problem.

If you find this as reasonable approach to diagnose the root cause
of problem, I can work on writing a diagnostic patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Tom Lane
Dmitry Vasilyev  writes:
> On Сб, 2015-10-10 at 10:55 -0500, Tom Lane wrote:
>> and (b) you still haven't convinced me that you had an actual service
>> stop, and not just that the recovery time was longer than psql would
>> wait before retrying the connection.

> The log you can see bellow:
> ...
> 2015-10-10 19:00:32 AST DEBUG:  cleaning up dynamic shared memory control 
> segment with ID 851401618
> 2015-10-10 19:00:32 AST DEBUG:  invoking IpcMemoryCreate(size=290095104)
> 2015-10-10 19:00:42 AST FATAL:  pre-existing shared memory block is still in 
> use
> 2015-10-10 19:00:42 AST HINT:  Check if there are any old server processes 
> still running, and terminate them.

Thanks for providing some detail!  It's clear from the above log excerpt
that we're timing out after 10 seconds in win32_shmem.c's version of
PGSharedMemoryCreate, because CreateFileMapping is still reporting that
the old shared memory segment still exists.  When we last discussed this
sort of problem in
http://www.postgresql.org/message-id/flat/49fa3b6f.6080...@dunslane.net
there was no evidence that such a failure could persist for longer than a
second or two.  Now it seems that on your machine the failure state can
persist for at least 10 seconds, but I don't know why.

If I had to guess, on the basis of no evidence, I'd wonder whether the
DSM code broke it; there is evidently at least one DSM segment in play
in your use-case.  But that's only a guess.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Michael Paquier
On Sun, Oct 11, 2015 at 8:54 AM, Ali Akbar  wrote:
> C:\Windows\system32>taskkill /F /PID 2080
> SUCCESS: The process with PID 2080 has been terminated.

taskkill /f *forcefully* terminates the process targeted [1]. Isn't
that equivalent to a kill -9? If you headshot a backend process on
Linux with kill -9, an instance won't restart either.
[1]: 
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/taskkill.mspx?mfr=true
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-10-10 Thread Peter Geoghegan
On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes  wrote:
> This needs a rebase, there are several conflicts in 
> src/backend/executor/nodeAgg.c

I attached a revised version of the second patch in the series, fixing
this bitrot.

I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
previous patch, where no final on-the-fly merge step is required (no
merge step is required whatsoever, because replacement selection
managed to produce only one run). The function mergeruns() previously
only "abandoned" abbreviated ahead of any merge step iff there was
one. When there was only one run (not requiring a merge) it happened
to continue to keep its state consistent with abbreviated keys still
being in use. It didn't matter before, because abbreviated keys were
only for tuplesort to compare, but that's different now.

That bug is fixed in this revision by reordering things within
mergeruns(). The previous order of the two things that were switched
is not at all significant (I should know, I wrote that code).

Thanks
-- 
Peter Geoghegan
From a82f0f723e1f5206d9c19b1b277acc0625aa54b1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c|  2 +-
 src/backend/executor/nodeAgg.c | 20 ++---
 src/backend/executor/nodeSort.c|  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++-
 src/backend/utils/sort/tuplesort.c | 74 +++---
 src/include/utils/tuplesort.h  |  4 +-
 6 files changed, 93 insertions(+), 42 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..bb61018 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3024,7 +3024,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-  &ts_val, &ts_isnull);
+  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			indexcursor = (ItemPointer) DatumGetPointer(ts_val);
 		}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..2972180 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+	aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &pertrans->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-			  true, newVal, isNull))
+			  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -950,6 +953,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&pertrans->equalfns[0],
 		 oldVal, *newVal)
 		{
@@ -965,6 +969,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -1002,6 +1007,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = pertrans->uniqslot;
 	int			numTransInputs = per

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-10-10 Thread Peter Geoghegan
On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
 wrote:
> My test cases are:

Note that my text caching and unsigned integer comparison patches have
moved the baseline down quite noticeably. I think that my mobile
processor out-performs the Xeon you used for this, which seems a
little odd even taken the change in baseline performance into account.

> set work_mem ='1GB';
> create table t1 as select md5(random()::text) from
> generate_series(1,1000);
>
> Times are in milliseconds. Median and average over 10 runs.
>
> Test 1
> select count(distinct md5) from t1;
>
>  Master Patched Median 10,965.77 10,986.30 (99.81%) Average
> 10,983.63 11,013.55 (99.73%)

> Are you seeing any speedup from any of these on your hardware?

I gather that 10,965.77 here means 10,965 milliseconds, since that
roughly matches what I get.

For the sake of simplicity, I will focus on your test 1 as a baseline.
Note that I ran VACUUM FREEZE before any testing was performed, just
in case.

On my laptop:

postgres=# \dt+ t1
  List of relations
 Schema | Name | Type  | Owner |  Size  | Description
+--+---+---++-
 public | t1   | table | pg| 651 MB |
(1 row)

Times for "Test 1", "select count(distinct md5) from t1":

Patch:

Time: 10076.870 ms
Time: 10094.873 ms
Time: 10125.253 ms  <-- median
Time: 10222.042 ms
Time: 10269.247 ms

Master:

Time: 10641.142 ms
Time: 10706.181 ms
Time: 10708.860 ms < -- median
Time: 10725.426 ms
Time: 10781.398 ms

So, to answer your question: Yes, I can see a benefit for this query
on my test hardware (laptop), although it is not spectacular. It may
still be quite worthwhile.

I attach a revised version of the patch tested here, following
feedback from Andres. This should not make any difference to the
performance.

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
-ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt

Then diff the file prefetch_disassembly.txt for each build to see what
the differences are in practice. Consider an extract of the output on
my system:

...
   0x28ee <+926>: callq  0x28f3 
   0x28f3 <+931>: nopl   0x0(%rax,%rax,1)
   0x28f8 <+936>: sub$0x1,%eax
   0x28fb <+939>: test   %eax,%eax
   0x28fd <+941>: mov%eax,0xd0(%rdi)
   0x2903 <+947>: jne0x25ce 
   0x2909 <+953>: jmpq   0x2710 
   0x290e <+958>: xchg   %ax,%ax
   0x2910 <+960>: mov0x58(%rdi),%rsi
   0x2914 <+964>: lea(%rax,%rax,2),%rax
   0x2918 <+968>: lea(%rsi,%rax,8),%rax
   0x291c <+972>: mov0x30(%rax),%rax
   0x2920 <+976>: prefetchnta (%rax)
   0x2923 <+979>: mov$0x1,%eax
   0x2928 <+984>: jmpq   0x2712 
   0x292d <+989>: nopl   (%rax)
...

Notably, there is a prefetchnta instruction here.

Note that I'm going away on vacation in about a week. I wanted to give
people feedback on various things before then, since it was overdue.
FYI, after Thursday I will be very unlikely to answer e-mail for a
couple of weeks.

-- 
Peter Geoghegan
From ed1cf0675470c3c198ff140f71f3a40adcd4d02a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH] Prefetch from memtuples array in tuplesort

Testing shows that prefetching the "tuple proper" of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably.  For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e.  The operation following actually "performing" the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)

Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree.  The optimization is only used when sorts fit in
memory, though.

Also, prefetch a few places ahead within the analogous "fetching" point
in tuplestore.c.  This appears to offer similar benefits in certain
cases.  For example, queries involving large common table expressions
significantly benefit.
---
 config/c-compiler.m4| 17 +
 configure   | 31 +++
 configure.in|  1 +
 src/backend/utils/sort/tuplesort.c  | 21 +
 src/backend/utils/sort/tuplestore.c | 13 +
 src/include/c.h | 14 ++
 src/include/pg_config.h.in  |  3 +++
 src/include/pg_config.h.win32   |  3 +++
 src/i

Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Ali Akbar
Greetings,

2015-10-11 0:18 GMT+07:00 Pavel Stehule :

>
> 2015-10-10 18:04 GMT+02:00 Dmitry Vasilyev :
>
>>
>> On Сб, 2015-10-10 at 10:55 -0500, Tom Lane wrote:
>> > Dmitry Vasilyev  writes:
>> > > I have written, what service stopped. This action is repeatable.
>> > > You can run command 'psql -c "do $$ unpack p,1x8 $$ language
>> > > plperlu;"'
>> > > and after this windows service will stop.
>> >
>>
>
> so it is expected behave. After any unexpected client fails, the server is
> restarted
>

I can confirm this too. In linux (i use Fedora 22), this is what happens
when a server is killed:

=== 1. before:
$ sudo systemctl status postgresql.service
postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled)
   Active: active (running) since Jum 2015-10-09 16:25:43 WIB; 1 day 14h ago
  Process: 778 ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o -p
${PGPORT} -w -t 300 (code=exited, status=0/SUCCESS)
  Process: 747 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA}
(code=exited, status=0/SUCCESS)
 Main PID: 783 (postgres)
   CGroup: /system.slice/postgresql.service
   ├─  783 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
   ├─  812 postgres: logger process
   ├─  821 postgres: checkpointer process
   ├─  822 postgres: writer process
   ├─  823 postgres: wal writer process
   ├─  824 postgres: autovacuum launcher process
   ├─  825 postgres: stats collector process
   └─17181 postgres: postgres test [local] idle


=== 2. killing and attempt to reconnect:
$ sudo kill 17181

test=# select 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.


=== 3. service status after:
$ sudo systemctl status postgresql.service
postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled)
   Active: active (running) since Jum 2015-10-09 16:25:43 WIB; 1 day 14h ago
  Process: 778 ExecStart=/usr/bin/pg_ctl start -D ${PGDATA} -s -o -p
${PGPORT} -w -t 300 (code=exited, status=0/SUCCESS)
  Process: 747 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGDATA}
(code=exited, status=0/SUCCESS)
 Main PID: 783 (postgres)
   CGroup: /system.slice/postgresql.service
   ├─  783 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
   ├─  812 postgres: logger process
   ├─  821 postgres: checkpointer process
   ├─  822 postgres: writer process
   ├─  823 postgres: wal writer process
   ├─  824 postgres: autovacuum launcher process
   ├─  825 postgres: stats collector process
   └─17422 postgres: postgres test [local] idle

===

The service status is still active (running), and new process 17422 handles
the client.


But this is what happens in Windows (win 7 32 bit, postgres 9.4):

=== 1. before:
C:\Windows\system32>sc queryex postgresql-9.4

SERVICE_NAME: postgresql-9.4
TYPE   : 10  WIN32_OWN_PROCESS
STATE  : 4  RUNNING
(STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE: 0  (0x0)
SERVICE_EXIT_CODE  : 0  (0x0)
CHECKPOINT : 0x0
WAIT_HINT  : 0x0
PID: 3716
FLAGS  :



=== 2. killing & attempt to reconnect:
postgres=# select pg_backend_pid();
 pg_backend_pid

   2080
(1 row)

C:\Windows\system32>taskkill /F /PID 2080
SUCCESS: The process with PID 2080 has been terminated.

postgres=# select 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

=== 3. service status after:
C:\Windows\system32>sc query postgresql-9.4

SERVICE_NAME: postgresql-9.4
TYPE   : 10  WIN32_OWN_PROCESS
STATE  : 1  STOPPED
WIN32_EXIT_CODE: 0  (0x0)
SERVICE_EXIT_CODE  : 0  (0x0)
CHECKPOINT : 0x0
WAIT_HINT  : 0x0

===

The client cannot reconnect. The service is dead. This is nasty, because
any client can exploit some segfault bug like the one in perl Dmitry
mentoined upthread, and the postgresql service is down.

Note: killing the server process with pg_terminate_backend isn't causing
this behavior to happen. The client reconnects normally, and the service is
still running.

Regards,
Ali Akbar


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Michael Paquier
On Sat, Oct 10, 2015 at 10:52 PM, Amir Rohan  wrote:
> On 10/10/2015 04:32 PM, Michael Paquier wrote:
> I was arguing that it's an on-going task that would do
> better if it had a TODO list, instead of "ideas for tests"
> being scattered across 50-100 messages spanning a year or
> more in one thread or another. You may disagree.

Let's be clear. I am fully in line with your point.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-10-10 Thread Peter Geoghegan
On Fri, Oct 9, 2015 at 5:54 PM, Robert Haas  wrote:
> I think that is true.  I spent some time thinking about whether the
> way you used INT_MIN as a sentinel value should be changed around
> somehow, but ultimately I decided that it wasn't too bad and that
> suggesting something else would be pointless kibitzing.  I also tried
> to think of scenarios in which this would lose, and I'm not totally
> convinced that there aren't any, but I'm convinced that, if they
> exist, I don't know what they are.  Since the patch did deliver a
> small improvement on my test cases and on yours, I think we might as
> well have it in the tree.  If some pathological scenario shows up
> where it turns out to hurt, we can always fix it then, or revert if it
> need be.

That seems very reasonable.

I noticed that there is still one comment that I really should have
removed as part of this work. The comment didn't actually add any new
information for 9.5, but is now obsolete. Attached patch removes it
entirely.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d545c34..3978b1e 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2056,10 +2056,6 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 	int			len;
 	uint32		hash;
 
-	/*
-	 * Abbreviated key representation is a pass-by-value Datum that is treated
-	 * as a char array by the specialized comparator bttextcmp_abbrev().
-	 */
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
 	memset(pres, 0, sizeof(Datum));

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Pavel Stehule
2015-10-10 18:04 GMT+02:00 Dmitry Vasilyev :

> Hello Tom!
>
> On Сб, 2015-10-10 at 10:55 -0500, Tom Lane wrote:
> > Dmitry Vasilyev  writes:
> > > I have written, what service stopped. This action is repeatable.
> > > You can run command 'psql -c "do $$ unpack p,1x8 $$ language
> > > plperlu;"'
> > > and after this windows service will stop.
> >
> > Well, (a) that probably means that your plperl installation is
> > broken,
> > and (b) you still haven't convinced me that you had an actual service
> > stop, and not just that the recovery time was longer than psql would
> > wait before retrying the connection.  Can you start a fresh psql
> > session after waiting a few seconds?
> >
> >   regards, tom lane
>
> This is knowned bug of perl:
>
> perl -e ' unpack p,1x8'
> Segmentation fault (core dumped)
>

so it is expected behave. After any unexpected client fails, the server is
restarted

Regards

Pavel


>
> backend of postgres is crashed, and windows service is stopped:
>
> C:\Users\vadv>sc query postgresql-X64-9.4 | findstr /i "STATE"
> S
> TATE  : 1  STOPPED
>
>
> The log you can see bellow:
>
> 2015-10-10 19:00:13 AST LOG:  database system was interrupted; last
> known up at 2015-10-10 18:54:47 AST
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 2 to 2
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 2 to 2
> 2015-10-10 19:00:13 AST DEBUG:  checkpoint record is at 0/16A01C8
> 2015-10-10 19:00:13 AST DEBUG:  redo record is at 0/16A01C8; shutdown
> TRUE
> 2015-10-10 19:00:13 AST DEBUG:  next transaction ID: 0/678; next OID:
> 16393
> 2015-10-10 19:00:13 AST DEBUG:  next MultiXactId: 1; next
> MultiXactOffset: 0
> 2015-10-10 19:00:13 AST DEBUG:  oldest unfrozen transaction ID: 667, in
> database 1
> 2015-10-10 19:00:13 AST DEBUG:  oldest MultiXactId: 1, in database 1
> 2015-10-10 19:00:13 AST DEBUG:  transaction ID wrap limit is
> 2147484314, limited by database with OID 1
> 2015-10-10 19:00:13 AST DEBUG:  MultiXactId wrap limit is 2147483648,
> limited by database with OID 1
> 2015-10-10 19:00:13 AST DEBUG:  starting up replication slots
> 2015-10-10 19:00:13 AST LOG:  database system was not properly shut
> down; automatic recovery in progress
> 2015-10-10 19:00:13 AST DEBUG:  resetting unlogged relations: cleanup 1
> init 0
> 2015-10-10 19:00:13 AST LOG:  redo starts at 0/16A0230
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
> 2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
> 1663/12135/12057; tid 0/3
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
> 2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
> 1663/12135/12059; tid 1/3
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
> 2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
> 1663/12135/12060; tid 1/2
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
> 2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
> 1663/12135/11979; tid 31/63
> 2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
> 2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
> 1663/12135/11984; tid 16/34
>

Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Dmitry Vasilyev
Hello Tom!

On Сб, 2015-10-10 at 10:55 -0500, Tom Lane wrote:
> Dmitry Vasilyev  writes:
> > I have written, what service stopped. This action is repeatable.
> > You can run command 'psql -c "do $$ unpack p,1x8 $$ language
> > plperlu;"'
> > and after this windows service will stop.
> 
> Well, (a) that probably means that your plperl installation is
> broken,
> and (b) you still haven't convinced me that you had an actual service
> stop, and not just that the recovery time was longer than psql would
> wait before retrying the connection.  Can you start a fresh psql
> session after waiting a few seconds?
> 
>   regards, tom lane

This is knowned bug of perl:

perl -e ' unpack p,1x8'
Segmentation fault (core dumped)

backend of postgres is crashed, and windows service is stopped:

C:\Users\vadv>sc query postgresql-X64-9.4 | findstr /i "STATE"
S
TATE  : 1  STOPPED


The log you can see bellow:

2015-10-10 19:00:13 AST LOG:  database system was interrupted; last
known up at 2015-10-10 18:54:47 AST
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 5 to 13
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 2 to 2
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 2 to 2
2015-10-10 19:00:13 AST DEBUG:  checkpoint record is at 0/16A01C8
2015-10-10 19:00:13 AST DEBUG:  redo record is at 0/16A01C8; shutdown
TRUE
2015-10-10 19:00:13 AST DEBUG:  next transaction ID: 0/678; next OID:
16393
2015-10-10 19:00:13 AST DEBUG:  next MultiXactId: 1; next
MultiXactOffset: 0
2015-10-10 19:00:13 AST DEBUG:  oldest unfrozen transaction ID: 667, in
database 1
2015-10-10 19:00:13 AST DEBUG:  oldest MultiXactId: 1, in database 1
2015-10-10 19:00:13 AST DEBUG:  transaction ID wrap limit is
2147484314, limited by database with OID 1
2015-10-10 19:00:13 AST DEBUG:  MultiXactId wrap limit is 2147483648,
limited by database with OID 1
2015-10-10 19:00:13 AST DEBUG:  starting up replication slots
2015-10-10 19:00:13 AST LOG:  database system was not properly shut
down; automatic recovery in progress
2015-10-10 19:00:13 AST DEBUG:  resetting unlogged relations: cleanup 1
init 0
2015-10-10 19:00:13 AST LOG:  redo starts at 0/16A0230
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/12057; tid 0/3
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/12059; tid 1/3
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/12060; tid 1/2
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/11979; tid 31/63
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/11984; tid 16/34
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/11889; tid 67/5
2015-10-10 19:00:13 AST DEBUG:  mapped win32 error code 80 to 17
2015-10-10 19:00:13 AST CONTEXT:  xlog redo insert: rel
1663/12135/11894; tid 9/132
2015-10-10 19:00:13 AST DEBUG:  mapped win32

Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Tom Lane
Dmitry Vasilyev  writes:
> I have written, what service stopped. This action is repeatable.
> You can run command 'psql -c "do $$ unpack p,1x8 $$ language plperlu;"'
> and after this windows service will stop.

Well, (a) that probably means that your plperl installation is broken,
and (b) you still haven't convinced me that you had an actual service
stop, and not just that the recovery time was longer than psql would
wait before retrying the connection.  Can you start a fresh psql
session after waiting a few seconds?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve the concurency of vacuum full table and select statement on the same relation

2015-10-10 Thread Tom Lane
Jinyu  writes:
> Proposal:  vacuum full table takes an ExclusiveLock on relation instead of 
> AccessExclusiveLock at start. It can' block select statement before call 
> function "finish_heap_swap". and select statement is safe because vacuum full 
> table  copys tuples from old relation to new relation before calling function 
> "finish_heap_swap". But it must take an AccessExclusiveLock on relation when 
> call function "finish_heap_swap" in order to block select statement on the 
> same relation.

> This solution can improve the concurency. the following shows the reasons.

What it's more likely to do is cause the vacuum full to fail altogether,
after doing a lot of work.  Lock upgrade is a bad thing because it tends
to result in deadlocks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Dmitry Vasilyev
I have written, what service stopped. This action is repeatable.
You can run command 'psql -c "do $$ unpack p,1x8 $$ language plperlu;"'
and after this windows service will stop. 

On Сб, 2015-10-10 at 10:23 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Oct 9, 2015 at 5:52 AM, Dmitry Vasilyev
> > > postgres=# select 1;
> > > server closed the connection unexpectedly
> > > This probably means the server terminated abnormally
> > > before or while processing the request.
> > > The connection to the server was lost. Attempting reset: Failed.
> 
> > Hmm.  I'd expect that to cause a crash-and-restart cycle, just like
> > a
> > SIGQUIT would cause a crash-and-restart cycle on Linux.  But I
> > would
> > expect the server to end up running again at the end, not stopped.
> 
> It *is* a crash and restart cycle, or at least no evidence to the
> contrary has been provided.
> 
> Whether psql's attempt to do an immediate reconnect succeeds or not
> is
> very strongly timing-dependent, on both Linux and Windows.  It's easy
> for it to attempt the reconnection before crash recovery is complete,
> and then you get the above symptom.  Personally I get a "Failed"
> result
> more often than not, regardless of platform.
> 
>   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Improve the concurency of vacuum full table and select statement on the same relation

2015-10-10 Thread Jinyu
Now vacuum full table takes an AccessExclusiveLock on relation at start and 
select statement takes an AccessShareLock on relation. So 'vacuum full table' 
blocks select statement on the same table until it is committed and select 
statement block 'vacuum full table' until it is finished. The concurency is 
very very bad.

Proposal:  vacuum full table takes an ExclusiveLock on relation instead of 
AccessExclusiveLock at start. It can' block select statement before call 
function "finish_heap_swap". and select statement is safe because vacuum full 
table  copys tuples from old relation to new relation before calling function 
"finish_heap_swap". But it must take an AccessExclusiveLock on relation when 
call function "finish_heap_swap" in order to block select statement on the same 
relation.

This solution can improve the concurency. the following shows the reasons.
1. The Function 'copy_heap_data' which copys tuples from old relation to new 
relation takes most elapsed time of vacuum full table. And it takes an 
ExclusiveLock on relation when call function "copy_heap_data". So select 
statement on the same relation can't be blocked in the most elapsed time of 
vacuum full table.
2. The elapsed time of "finish_heap_swap" is very short, So the blocking time 
window is very short.

This proposal can also improve the concurency of cluster table and select 
statement. Because the execution steps of cluster table is similar to vacuum 
full table. The select statement is safe before cluster table call function 
"finish_heap_swap".

Please let me know if I miss something.

Jinyu Zhang
thanks

Re: [HACKERS] Postgres service stops when I kill client backend on Windows

2015-10-10 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 9, 2015 at 5:52 AM, Dmitry Vasilyev
>> postgres=# select 1;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.

> Hmm.  I'd expect that to cause a crash-and-restart cycle, just like a
> SIGQUIT would cause a crash-and-restart cycle on Linux.  But I would
> expect the server to end up running again at the end, not stopped.

It *is* a crash and restart cycle, or at least no evidence to the
contrary has been provided.

Whether psql's attempt to do an immediate reconnect succeeds or not is
very strongly timing-dependent, on both Linux and Windows.  It's easy
for it to attempt the reconnection before crash recovery is complete,
and then you get the above symptom.  Personally I get a "Failed" result
more often than not, regardless of platform.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Rework access method interface

2015-10-10 Thread Petr Jelinek

On 2015-10-05 19:25, Alexander Korotkov wrote:

On Sun, Oct 4, 2015 at 4:27 PM, Amit Kapila mailto:amit.kapil...@gmail.com>> wrote:

On Sat, Oct 3, 2015 at 5:07 PM, Petr Jelinek mailto:p...@2ndquadrant.com>> wrote:

On 2015-10-03 08:27, Amit Kapila wrote:

On Fri, Oct 2, 2015 at 8:14 PM, Alexander Korotkov
mailto:a.korot...@postgrespro.ru>
>> wrote:
  >
  >
  > I agree about staying with one SQL-visible function.


Okay, this does not necessarily mean there should be only one
validation function in the C struct though. I wonder if it would
be more future proof to name the C interface as something else
than the current generic amvalidate. Especially considering that
it basically only does opclass validation at the moment (It's
IMHO saner in terms of API evolution to expand the struct with
more validator functions in the future compared to adding
arguments to the existing function).


I also agree with you that adding more arguments in future might
not be a good idea for exposed API.  I don't know how much improvement
we can get if we use structure and then keep on adding more members
to it based on future need, but atleast that way it will be less
prone to
breakage.

I think adding multiple validator functions is another option, but that
also doesn't sound like a good way as it can pose difficulty in
understanding the right version of API to be used.

I think the major priority is to keep compatibility. For now, user can
easily define invalid opclass and he will just get the error runtime.
Thus, the opclass validation looks like improvement which is not
strictly needed. We can add new validator functions in the future but
make them not required. Thus, old access method wouldn't loose
compatibility from this.


Yes that was what I was thinking as well. We don't want to break 
anything in this patch as it's mainly API change, but we want to have 
API that can potentially evolve. I think evolving the API by adding more 
interfaces in the *Routine struct so far worked well for the FDW for 
example and given that those structs are nodes, the individual pointers 
get initialized to NULL automatically so it's easy to add optional 
interfaces (like validators) without breaking anything. Besides, it's 
not unreasonable to expect that custom AM authors will have to check if 
their implementation is compatible with new major version.


But this is also why I don't think it's good idea to call the opclass 
validator just "amvalidate" in the IndexAmRoutine struct because it 
implies to be the only validate function we'll ever have.



Other than the above gripe and the following spurious change, the patch 
seems good to me now.



 RelationInitIndexAccessInfo(Relation relation)
 {
HeapTuple   tuple;
-   Form_pg_am  aform;
Datum   indcollDatum;
Datum   indclassDatum;
Datum   indoptionDatum;
@@ -1178,6 +1243,7 @@ RelationInitIndexAccessInfo(Relation relation)
MemoryContext oldcontext;
int natts;
uint16  amsupport;
+   Form_pg_am  aform;



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Amir Rohan
On 10/10/2015 04:32 PM, Michael Paquier wrote:
> On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
>> Now that v9 fixes the problem, here's a summary from going over the
>> entire thread one last time:
> 
> Thanks a lot for the summary of the events.
> 
>> # Windows and TAP sets
>> Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
>> this would include some work on that.
>> IIUC, the facilities and tests do run on windows, but focus was there
>> and not the preexisting TAP suite.
> 
> They do work on Windows, see 13d856e.
> 

Thanks, I did not know that.

>> # Test coverage (in the future)
>> Andres wanted a test for xid/multixid wraparound which also raises
>> the question of the tests that will need to be written in the future.
> 
> I recall that this would have needed extra functions on the backend...
> 
>> The patch focuses on providing facilities, while providing new coverage
>> for several features. There should be a TODO list on the wiki (bug
>> tracker, actually), where the list of tests to be written can be managed.
>> Some were mentioned in the thread (multi/xid wraparound
>> hot_standby_feedback, max_standby_archive_delay and
>> max_standby_streaming_delay? recovery_target_action? some in your
>> original list?), but threads
>> are precisely where these things get lost in the cracks.
> 
> Sure, that's an on-going task.
> 
>> # Directory structure
>> I suggested keeping backup/log/PGDATA per instance, rejected.
> 
> I guess that I am still flexible on this one, the node information
> (own PGDATA, connection string, port, etc.) is logged as well so this
> is not a big deal to me...
> 
>> # Parallel tests and port collisions
>> Lots about this. Final result is no port races are possible because
>> dedicated dirs are used per test, per instance. And because tcp
>> isn't used for connections on any platform (can you confirm that's
>> true on windows as well? I'm not familiar with sspi and what OSI
>> layer it lives on)
> 
> On Windows you remain with the problem that all nodes initialized
> using TestLib.pm will listen to 127.0.0.1, sspi being used to ensure
> that the connection at user level is secure (additional entries in
> pg_hba.conf are added).
> 
>> # decouple cleanup from node shutdown
>> Added (in latest patches?)
> 
> Yes this was added.
> 
>> Michael, is there anything else to do here or shall I mark this for
>> committer review?
> 
> I have nothing else. Thanks a lot!
> 

Ok, marked for committer, I hope I'm following "correct" cf procedure.

Regards,
Amir




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Amir Rohan
On 10/10/2015 04:32 PM, Michael Paquier wrote:
> On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
>> The patch focuses on providing facilities, while providing new coverage
>> for several features. There should be a TODO list on the wiki (bug
>> tracker, actually), where the list of tests to be written can be managed.
>> Some were mentioned in the thread (multi/xid wraparound
>> hot_standby_feedback, max_standby_archive_delay and
>> max_standby_streaming_delay? recovery_target_action? some in your
>> original list?), but threads
>> are precisely where these things get lost in the cracks.
> 
> Sure, that's an on-going task.
>  

I was arguing that it's an on-going task that would do
better if it had a TODO list, instead of "ideas for tests"
being scattered across 50-100 messages spanning a year or
more in one thread or another. You may disagree.

Amir



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Michael Paquier
On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
> Now that v9 fixes the probkem, here's a summary from going over the
> entire thread one last time:

Thanks a lot for the summary of the events.

> # Windows and TAP sets
> Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
> this would include some work on that.
> IIUC, the facilities and tests do run on windows, but focus was there
> and not the preexisting TAP suite.

They do work on Windows, see 13d856e.

> # Test coverage (in the future)
> Andres wanted a test for xid/multixid wraparound which also raises
> the question of the tests that will need to be written in the future.

I recall that this would have needed extra functions on the backend...

> The patch focuses on providing facilities, while providing new coverage
> for several features. There should be a TODO list on the wiki (bug
> tracker, actually), where the list of tests to be written can be managed.
> Some were mentioned in the thread (multi/xid wraparound
> hot_standby_feedback, max_standby_archive_delay and
> max_standby_streaming_delay? recovery_target_action? some in your
> original list?), but threads
> are precisely where these things get lost in the cracks.

Sure, that's an on-going task.

> # Directory structure
> I suggested keeping backup/log/PGDATA per instance, rejected.

I guess that I am still flexible on this one, the node information
(own PGDATA, connection string, port, etc.) is logged as well so this
is not a big deal to me...

> # Parallel tests and port collisions
> Lots about this. Final result is no port races are possible because
> dedicated dirs are used per test, per instance. And because tcp
> isn't used for connections on any platform (can you confirm that's
> true on windows as well? I'm not familiar with sspi and what OSI
> layer it lives on)

On Windows you remain with the problem that all nodes initialized
using TestLib.pm will listen to 127.0.0.1, sspi being used to ensure
that the connection at user level is secure (additional entries in
pg_hba.conf are added).

> # decouple cleanup from node shutdown
> Added (in latest patches?)

Yes this was added.

> Michael, is there anything else to do here or shall I mark this for
> committer review?

I have nothing else. Thanks a lot!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dangling Client Backend Process

2015-10-10 Thread Amit Kapila
On Sat, Oct 10, 2015 at 3:42 PM, Rajeev rastogi 
wrote:

> I observed one strange behavior today that if postmaster process gets
> crashed/killed, then it kill all background processes but not the client
> backend process.
>
> Moreover it is also allowed to execute query on the connected client
> session without any other background process.
>
> But If I try to execute some command (like checkpoint) from the client
> session which requires any background task to perform, it fails because it
> cannot find the corresponding background process (like checkpoint process).
>
>
>
> I am not sure if this is already known behavior but I found it to be
> little awkward. This may lead to some unknown behavior in user application.
>
>
>

This is a known behaviour and there was some discussion on this
topic [1] previously as well.  I think that thread didn't reach to
conclusion,
but there were couple of other reasons discussed in that thread as well to
have the behaviour as you are proposing here.



> Currently All background process keeps checking if Postmaster is Alive
> while they wait for any event but for client backend process there is no
> such mechanism.
>
>
>
> One way to handle this issue will be to check whether postmaster is alive
> after every command read but it will add extra cost for each query
> execution.
>
>
I don't think that is a good idea as if there is no command execution
it will still stay as it is and doing such operations on each command
doesn't sound to be good idea even though overhead might not be
big.  There are some other ideas discussed in that thread [2] to achieve
this behaviour, but I think we need to find a portable way to achieve it.


[1] - http://www.postgresql.org/message-id/26217.1371851...@sss.pgh.pa.us
[2] -
http://www.postgresql.org/message-id/20130622174922.gd1...@alap2.anarazel.de

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Amir Rohan
On 10/10/2015 02:43 PM, Michael Paquier wrote:
> On Fri, Oct 9, 2015 at 8:53 PM, Michael Paquier
>  wrote:
>> On Fri, Oct 9, 2015 at 8:47 PM, Amir Rohan wrote:
>>> Ok, I've put myself down as reviewer in cfapp. I don't think I can
>>> provide any more useful feedback that would actually result in changes
>>> at this point, but I'll read through the entire discussion once last
>>> time and write down final comments/notes. After that I have no problem
>>> marking this for a committer to look at.
>>
>> OK. If you have any comments or remarks, please do not hesitate at all!
> 
> So, to let everybody know the issue, Amir has reported me offlist a
> bug in one of the tests that can be reproduced more easily on a slow
> machine:
> 

Yeah, I usually stick to the list for discussion, but I ran an earlier
version without issues and thought this might be a problem with my
system as I've changed things a bit this week.

Now that v9 fixes the probkem, here's a summary from going over the
entire thread one last time:

# Windows and TAP sets
Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
this would include some work on that.

IIUC, the facilities and tests do run on windows, but focus was there
and not the preexisting TAP suite.

# Test coverage (in the future)
Andres wanted a test for xid/multixid wraparound which also raises
the question of the tests that will need to be written in the future.

The patch focuses on providing facilities, while providing new coverage
for several features. There should be a TODO list on the wiki (bug
tracker, actually), where the list of tests to be written can be managed.

Some were mentioned in the thread (multi/xid wraparound
hot_standby_feedback, max_standby_archive_delay and
max_standby_streaming_delay? recovery_target_action? some in your
original list?), but threads
are precisely where these things get lost in the cracks.

# Interactive use vs. TAP tests

Early on the goal was also to provide something for interactive use
in order to test scenarios. The shift has focused to the TAP tests
and some of the choices in the API reflect that. Interactive use
is possible, but wasn't a central requirement.

# Directory structure

I suggested keeping backup/log/PGDATA per instance, rejected.

# Parallel tests and port collisions

Lots about this. Final result is no port races are possible because
dedicated dirs are used per test, per instance. And because tcp
isn't used for connections on any platform (can you confirm that's
true on windows as well? I'm not familiar with sspi and what OSI
layer it lives on)

# Allow test to specify shutdown mode

Added

# decouple cleanup from node shutdown

Added (in latest patches?)

# Conveniences for test writing vs. running

My suggestions weren't picked up, but for one thing setting CLEANUP=0
in the lib (which means editing it...) can be useful for writers.

# blocking until server ready

pg_isready wrapper added.

# Multiple masters

back and forth, but supported in latest version.

That's it. I've ran the latest (v9) tests works and passed on my system
(fedora 64bit) and also under docker with --cpu-quota=1, which
simulates a slow machine.

Michael, is there anything else to do here or shall I mark this for
committer review?

Regards,
Amir









-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-10-10 Thread Michael Paquier
On Fri, Oct 9, 2015 at 8:53 PM, Michael Paquier
 wrote:
> On Fri, Oct 9, 2015 at 8:47 PM, Amir Rohan wrote:
>> Ok, I've put myself down as reviewer in cfapp. I don't think I can
>> provide any more useful feedback that would actually result in changes
>> at this point, but I'll read through the entire discussion once last
>> time and write down final comments/notes. After that I have no problem
>> marking this for a committer to look at.
>
> OK. If you have any comments or remarks, please do not hesitate at all!

So, to let everybody know the issue, Amir has reported me offlist a
bug in one of the tests that can be reproduced more easily on a slow
machine:

> Amir wrote:
> Before posting the summary, I ran the latest v8 patch on today's git
> master (9c42727) and got some errors:
> t/004_timeline_switch.pl ...
> 1..1
> # ERROR:  invalid input syntax for type pg_lsn: ""
> # LINE 1: SELECT ''::pg_lsn <= pg_last_xlog_replay_location()
> #^
> # No tests run!

And here is my reply:
This is a timing issue and can happen when standby1, the promoted
standby which standby2 reconnects to to check that recovery works with
a timeline jump, is still in recovery after being restarted. There is
a small windows where this is possible, and this gets easier to
reproduce on slow machines (did so on a VM). So the issue was in test
004. I have updated the script to check pg_is_in_recovery() to be sure
that the node exits recovery before querying it with
pg_current_xlog_location.

It is worth noticing that the following change has saved me a lot of pain:
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -259,6 +259,7 @@ sub psql
my ($stdout, $stderr);
print("# Running SQL command: $sql\n");
run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f',
'-'], '<', \$sql, '>', \$stdout, '2>', \$stderr or die;
+   print "# Error output: $stderr\n" if $stderr ne "";
Perhaps we should consider backpatching it, it helped me find out the
issue I faced.

Attached is an updated patch fixing 004.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index a4c1737..ea219d7 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -125,38 +125,6 @@ sub check_query
 	}
 }
 
-# Run a query once a second, until it returns 't' (i.e. SQL boolean true).
-sub poll_query_until
-{
-	my ($query, $connstr) = @_;
-
-	my $max_attempts = 30;
-	my $attempts = 0;
-	my ($stdout, $stderr);
-
-	while ($attempts < $max_attempts)
-	{
-		my $cmd = [ 'psql', '-At', '-c', "$query", '-d', "$connstr" ];
-		my $result = run $cmd, '>', \$stdout, '2>', \$stderr;
-
-		chomp($stdout);
-		$stdout =~ s/\r//g if $Config{osname} eq 'msys';
-		if ($stdout eq "t")
-		{
-			return 1;
-		}
-
-		# Wait a second before retrying.
-		sleep 1;
-		$attempts++;
-	}
-
-	# The query result didn't change in 30 seconds. Give up. Print the stderr
-	# from the last attempt, hopefully that's useful for debugging.
-	diag $stderr;
-	return 0;
-}
-
 sub append_to_file
 {
 	my ($filename, $str) = @_;
diff --git a/src/test/Makefile b/src/test/Makefile
index b713c2c..d6e51eb 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -17,7 +17,7 @@ SUBDIRS = regress isolation modules
 # We don't build or execute examples/, locale/, or thread/ by default,
 # but we do want "make clean" etc to recurse into them.  Likewise for ssl/,
 # because the SSL test suite is not secure to run on a multi-user system.
-ALWAYS_SUBDIRS = examples locale thread ssl
+ALWAYS_SUBDIRS = examples locale thread ssl recovery
 
 # We want to recurse to all subdirs for all standard targets, except that
 # installcheck and install should not recurse into the subdirectory "modules".
diff --git a/src/test/perl/RecoveryTest.pm b/src/test/perl/RecoveryTest.pm
new file mode 100644
index 000..b60bf5c
--- /dev/null
+++ b/src/test/perl/RecoveryTest.pm
@@ -0,0 +1,412 @@
+package RecoveryTest;
+
+# Set of common routines for recovery regression tests for a PostgreSQL
+# cluster. This includes global variables and methods that can be used
+# by the various set of tests present to set up cluster nodes and
+# configure them according to the test scenario wanted.
+#
+# Cluster nodes can be freely created using initdb or using the existing
+# base backup of another node, with minimum configuration done when the
+# node is created for the first time like having a proper port number.
+# It is then up to the test to decide what to do with the newly-created
+# node.
+#
+# Environment configuration of each node is available through a set
+# of global variables provided by this package, hashed depending on the
+# port number of a node:
+# - connstr_nodes connection string to connect to this node
+# - datadir_nodes to get the data folder of a given node
+# - archive_nodes for the location of the WAL archives of a node
+# - backup_nodes for the location of base backups of a node
+# - applname_nodes, application_nam

[HACKERS] Dangling Client Backend Process

2015-10-10 Thread Rajeev rastogi
I observed one strange behavior today that if postmaster process gets 
crashed/killed, then it kill all background processes but not the client 
backend process.
Moreover it is also allowed to execute query on the connected client session 
without any other background process.
But If I try to execute some command (like checkpoint) from the client session 
which requires any background task to perform, it fails because it cannot find 
the corresponding background process (like checkpoint process).

I am not sure if this is already known behavior but I found it to be little 
awkward. This may lead to some unknown behavior in user application.

Currently All background process keeps checking if Postmaster is Alive while 
they wait for any event but for client backend process there is no such 
mechanism.

One way to handle this issue will be to check whether postmaster is alive after 
every command read but it will add extra cost for each query execution.

Any comments?

Thanks and Regards,
Kumar Rajeev Rastogi