date:20180828

Re: some more error location support

2018-08-28 Thread Fabien COELHO





Even if there is some under-the-hood garbage collection, I'd suggest to
add a free after the call to ComputePartitionAttrs.


Hmm, I didn't know about free_parsestate().  It doesn't seem to be used
consistently.  I suppose you'll want to use it when you have a target
relation that will be closed by it, but otherwise, for DDL commands,
it's not all that useful.


Probably.

The majority rule (34 make & 22 free) suggest that it is more often use 
than not. I'd suggest to stick to that for consistency & homogeneity.


--
Fabien.

Re: Copy function for logical replication slots

2018-08-28 Thread Masahiko Sawada

On Mon, Aug 20, 2018 at 2:53 PM, Michael Paquier  wrote:
> On Tue, Aug 14, 2018 at 01:38:23PM +0900, Masahiko Sawada wrote:
>> On Thu, Jul 12, 2018 at 9:28 PM, Masahiko Sawada  
>> wrote:
>>> Attached new version of patch incorporated the all comments I got from
>>> Michael-san.
>>>
>>> To prevent the WAL segment file of restart_lsn of the origin slot from
>>> removal during creating the target slot, I've chosen a way to copy new
>>> one while holding the origin one. One problem to implement this way is
>>> that the current replication slot code doesn't allow us to do
>>> straightforwardly; the code assumes that the process creating a new
>>> slot is not holding another slot. So I've changed the copy functions
>>> so that it save temporarily MyReplicationSlot and then restore and
>>> release it after creation the target slot. To ensure that both the
>>> origin and target slot are released properly in failure cases I used
>>> PG_ENSURE_ERROR_CLEANUP(). That way, we can keep the code changes of
>>> the logical decoding at a minimum. I've thought that we can change the
>>> logical decoding code so that it can assumes that the process can have
>>> more than one slots at once but it seems overkill to me.
>
> Yeah, we may be able to live with this trick.  For other processes, the
> process doing the copy would be seen as holding the slot so the
> checkpointer would not advance the oldest LSN to retain.
>
>> The previous patch conflicts with the current HEAD. Attached updated
>> version patch.

Thank you for reviewing this patch.

>
> +-- Now we have maximum 8 replication slots. Check slots are properly
> +-- released even when raise error during creating the target slot.
> +SELECT 'copy' FROM pg_copy_logical_replication_slot('orig_slot1',
> 'failed'); -- error
> +ERROR:  all replication slots are in use
>
> installcheck is going to fail on an instance which does not use exactly
> max_replication_slots = 8.  That lacks flexibility, and you could have
> the same coverage by copying, then immediately dropping each new slot.

Will fix.

> +to a physical replicaiton slot named 
> dst_slot_name.
> s/replicaiton/replicaton/.
>
> +The copied physical slot starts to reserve WAL from the same
> LSN as the
> +source slot if the source slot already reserves WAL.
> Or restricting this case?  In what is it useful to allow a copy from a
> slot which has done nothing yet?  This would also simplify the acquire
> and release logic of the source slot.
>

I think the copying from a slot that already reserved WAL would be
helpful for backup cases (maybe you suggested?). Also, either way we
need to make a safe logic of acquring and releasing the source slot
for logical slots cases. Or you meant  to restrict the case where the
copying a slot that doesn't reserve WAL?

> +   /* Check type of replication slot */
> +   if (SlotIsLogical(MyReplicationSlot))
> +   {
> +   ReplicationSlotRelease();
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +(errmsg("cannot copy a logical replication slot to a
> physical replication slot";
> +   }
> No need to call ReplicationSlotRelease for an ERROR code path.
>

Will fix.


> Does it actually make sense to allow copy of temporary slots or change
> their persistence?  Those don't live across sessions so they'd need to
> be copied in the same session which created them.

I think the copying of temporary slots would be an impracticable
feature but the changing their persistence might be helpful for some
cases, especially copying from persistent to temporary.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

Hi,

Tomas provided me with a machine where the problem was reproducible
(Thanks again!). I first started to make sure a54e1f158 is unrelated -
and indeed, the problem appears independently.

I've a few leads that I'm currently testing out. One observation I think
might be crucial is that the problem, in Tomas' testcase with just
VACUUM FULL of pg_class and INSERTs into another rel, appears to only
happen if autovacuum is enabled. The crashes are always immediately
after autovacuum did an analyze of the relation.

What I see on the apply side, I think, that invalidation records for
contents of pg_class and pg_class itself are intermixed.

I'll continue looking into this.

Greetings,

Andres Freund

pg_verify_checksums failure with hash indexes

2018-08-28 Thread Peter Eisentraut

This is reproducible with PG11 and PG12:

initdb -k data
postgres -D data

make installcheck
# shut down postgres with Ctrl-C

pg_verify_checksums data

pg_verify_checksums: checksum verification failed in file
"data/base/16384/28647", block 65: calculated checksum DC70 but expected 0
pg_verify_checksums: checksum verification failed in file
"data/base/16384/28649", block 65: calculated checksum 89D8 but expected 0
pg_verify_checksums: checksum verification failed in file
"data/base/16384/28648", block 65: calculated checksum 9636 but expected 0
Checksum scan completed
Data checksum version: 1
Files scanned:  2493
Blocks scanned: 13172
Bad checksums:  3

The files in question correspond to

hash_i4_index
hash_name_index
hash_txt_index

Discuss. ;-)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread hubert depesz lubaczewski

On Tue, Aug 28, 2018 at 08:33:11AM +0200, Alexander Kukushkin wrote:
> There is the archive_status directory in pg_wal, and if there are
> files with suffixes ".ready", you can archive not only the file which
> was requested, but quite a few more if there are  ".ready" files
> available. After that you have to rename ".ready" to ".done". Postgres
> will not call archive_command for files which already marked as
> ".done".

Ah, I was missing this information. Thanks a lot.

Best regards,

depesz

Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-08-28 Thread Alexander Korotkov

On Thu, Jul 26, 2018 at 8:39 PM Andrey Borodin  wrote:
> I'm not sure in architectural point of view: supporting two ways (list and 
> heap) to store result seems may be a bit heavy, but OK. At least, it has 
> meaningful benefits.

It seems that Nikita have reworked it that way after my suspect that
switching regular scans to pairing heap *might* cause a regression.
However, I didn't insist that we need to support two ways.  For
instance, if we can prove that there is no regression then it's fine
to use just heap...

Links
1. 
https://www.postgresql.org/message-id/CAPpHfdu6Wm4DSAp8Pvwq0uo7fCSzsbrNy7x2v5EKK_g4Nkjx1Q%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Stored procedures and out parameters

2018-08-28 Thread Peter Eisentraut

On 22/08/2018 20:22, Dave Cramer wrote:
> I also agree with David that driver writers made the best out of the
> situation with functions and we are now asking for the server to dual
> purpose the call command.
> 
> Is there a technical reason why this is not possible ?

There are several areas of concern.  These might not be grave issues
now, but they would impede future development in these areas.

First of all, what do you want to do with the function return value
when you use CALL?  CALL doesn't have the capability to process
arbitrary shapes of return values, such as sets.  It could perhaps be
implemented, but it's not.  So right now, CALL could not be a general
replacement for all function invocations.

And would you expect a function that is invoked via CALL to have a
non-atomic execution context, that is, allow transactions?  If not,
why not?  If yes, how would this interact with set returning
functions?  I don't think the implementation can support this.

Similar questions arise if we implement SQL standard dynamic result
sets.  What would you do if a function invoked by CALL runs across one
of those?

Output parameter handling is not compatible between function calls and
procedure calls.  Our implementation of output parameters in functions
is an extension of the SQL standard, and while it's been useful, it's
nonstandard, and I would like to make the output parameter handling in
CALL compatible with the SQL standard.  For example, if you have a
function f1(IN a int, OUT b int), you would call it as SELECT f1(x)
and the "b" would somehow be the return value.  But a procedure call
would be CALL p1(x, y), where x and y could be, say, PL/pgSQL
variables.  So if you want to allow invoking functions using the CALL
statement, you're going to have a hard time defining semantics that
are not wildly confusing.  Moreover, if the intention is to switch the
JDBC driver or similar drivers to use the CALL command always from
PG11 on, then the meaning of {call f1(a, b)} will have changed and a
lot of things will break in dangerous ways.

Always using CALL to invoke a function would also leave performance on
the table.  CALL has to do certain additional work in case a
transaction commit happens in the middle of the procedure, such as
expanding TOAST values.  You don't necessarily want to do that if you
don't have to.

There is also the semi-open question of whether functions and
procedures should be in separate namespaces.  For PostgreSQL 11 we
have settled that they are in the same namespace, for simplicity and
because we ran out of time, but that choice should perhaps not be set
in stone for all times.  In Oracle and DB2, functions and procedures
are in different namespaces, so SELECT x() and CALL x() invoke
different objects.  Whether we ever want to do that is a different
question, but we shouldn't design ourselves into an incompatible
corner in haste.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Problem while setting the fpw with SIGHUP

2018-08-28 Thread Kyotaro HORIGUCHI

Hello.

At Sat, 25 Aug 2018 14:50:53 +0530, Amit Kapila  wrote 
in 
> On Wed, Aug 1, 2018 at 12:56 PM Kyotaro HORIGUCHI
>  wrote:
> >
> > Thank you, Amit, Michael.
> >
> 
> Can you verify the first patch that I have posted above [1]?  We can
> commit it separately.

Thanks for prompting. The difference is in a comment and I'm fine
with the change.


> > It's a long time ago.. Let me have a bit of time to blow dust off..
> >
> > https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyot...@lab.ntt.co.jp
> >
> > ...done..i working..
> >
> > The original problem here was UpdateFullPageWrites() can call
> > RecoveryInProgress(), which does palloc in a critical
> > section. This happens when standy is commanded to reload with
> > change of full_pages_writes during recovery.
> >
> 
> AFAIU from the original problem reported by Dilip, it can happen
> during checkpoint without standby as well.

Yes, standby is not needed but archive (streaming) recovery is
required to start checkpointer.

> > While exploring it, I found that update of fullPageWrite flag is
> > updated concurrently against its design. A race condition happens
> > in the following steps in my diagnosis.
> >
> > 1. While the startup process is running recovery, we didn't
> >  consider that checkpointer may be running concurrently, but it
> >  happens for standby.
> >
> > 2. Checkpointer considers itself (or designed) as the *only*
> >  updator of shared config including fillPageWrites. In reality
> >  the startup is another concurent updator on standby. Addition to
> >  that, checkpointer assumes that it is started under WAL-emitting
> >  state, that is, InitXLogInsert() has been called elsewhere, but
> >  it is not the case on standby.
> >
> >  Note that checkpointer mustn't update FPW flag before recovery
> >  ends because startup will overrides the flag.
> >
> > 3. As the result, when standby gets full_page_writes changed and
> >  SIGHUP during recovery, checkpointer tries to update FPW flag
> >  and calls RecoveryInProgress() on the way. bang!
> >
> >
> > With the 0002-step1.patch, checkpointer really becomes the only
> > updator of the FPW flag after recovery ends. StartupXLOG()
> > updates it just once before checkpointer starts to update it.
> >
> 
> - * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
> - * record before resource manager writes cleanup WAL records or checkpoint
> - * record is written.
> + * Update full_page_writes in shared memory with the lastest value before
> + * resource manager writes cleanup WAL records or checkpoint record is
> + * written. We don't need to write XLOG_FPW_CHANGE since this just
> + * reflects the status at the last redo'ed record. No lock is required
> + * since startup is the only updator of the flag at this
> + * point. Checkpointer will take over after SharedRecoveryInProgress is
> + * turned to false.
>   */
>   Insert->fullPageWrites = lastFullPageWrites;
> - LocalSetXLogInsertAllowed();
> - UpdateFullPageWrites();
> - LocalXLogInsertAllowed = -1;
> 
> lastFullPageWrites will contain the latest value among the replayed
> WAL records.  It can still be different fullPageWrites which is
> updated by UpdateFullPageWrites().  So, I don't think it is advisable
> to remove it and rely on checkpointer to update it.  I think when it
> is called from this code path, it will anyway not write
> XLOG_FPW_CHANGE because of the !RecoveryInProgress() check.

Unfortunately startup doesn't process reloads by SIGHUP. So just
letting startup process set sharedFPW doesn't work correctly. I
don't think reload during redo loop will be apparently
safe. Forcibly reloading config without SIGHUP just before
UpdateFullPageWrites() in StartupXLOG breaks the reload
semantics.

# The comment for StartupPorcessSigHagndler is wrong in the sense..

> > Under the normal(slow?) promotion mode, checkpointer is woken up
> > explicitly to make the config setting effective as soon as
> > possible. (This might be unnecessary.)
> >
> 
> I am not sure this is the right approach.  If we are worried about
> concurrency issues, then we can use lock to protect it.

Since only checkpointer knows the right current value of the
flag, it is responsible for the final (just after recovery end)
setup of the flag. Actually we don't need to wake up checkpoiner
as soon as the end of recovery, but it must
UpdateFullPageWrites() before the first checkpoint starts without
receiving SIGHUP.


> > In checkpointer, RecoveryInProgress() is called preior to
> > UpdateFPW() to disable update FPW during recovery, so the crash
> > that is the issue here is fixed.
> >
> 
> It seems to me that you are trying to resolve two different problems
> in the same patch.  I think we can have a patch to deal with
> concurrency issue if any and a separate patch to call
> RecoveryInProgress outside critical section.

Hmm. Perhaps right. The change of UpdateShareMemoryConfig alone
resolves the initial issue and others are needed to have t

Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-28 Thread Andrew Gierth

contrib/cube has an arbitrary limit of 100 on the number of dimensions
in a cube, but it actually enforces that only in cube_in and
cube_enlarge, with the other cube creation functions happy to create
cubes of more dimensions.

I haven't actually tested, but this implies that one can create cubes
that will break dump/restore.

Should this limit be kept, and if so what should it be? (There's
obviously a limit on the size of indexable cubes)

(Noticed because an irc user was trying to use cubes with 512
dimensions with partial success)

-- 
Andrew (irc:RhodiumToad)

Re: FETCH FIRST clause PERCENT option

2018-08-28 Thread Surafel Temesgen

On Tue, Aug 21, 2018 at 3:50 PM Andres Freund  wrote:

>
> Imagine volatile functions being used. That can be problematic because
> of repeated side-effects, but also will lead to plainly wrong
> results. Consider what happens with your approach where somebody does
> something like WHERE value < random().
>
Thanks for explaining . The attach patch use tuplestore instead
Regards
Surafel
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 4db8142afa..8491b7831a 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1397,7 +1397,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY
 
 In this syntax, the start
 or count value is required by
@@ -1407,7 +1407,8 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+with PERCENT count specifies the maximum number of rows to return 
+in percentage.ROW
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index bb28cf7d1d..0dd514fafb 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -52,6 +52,7 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->ps.ps_ResultTupleSlot;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -60,6 +61,23 @@ ExecLimit(PlanState *pstate)
 	{
 		case LIMIT_INITIAL:
 
+			if (node->limitOption == PERCENTAGE)
+			{
+
+			/*
+			 * Find all rows the plan will return.
+			 */
+for (;;)
+{
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		break;
+	}
+	tuplestore_puttupleslot(node->totalTuple, slot);
+}
+			}
+
 			/*
 			 * First call for this node, so compute limit/offset. (We can't do
 			 * this any earlier, because parameters from upper nodes will not
@@ -87,24 +105,45 @@ ExecLimit(PlanState *pstate)
 return NULL;
 			}
 
-			/*
-			 * Fetch rows from subplan until we reach position > offset.
-			 */
-			for (;;)
+			if (node->limitOption == PERCENTAGE)
 			{
-slot = ExecProcNode(outerPlan);
-if (TupIsNull(slot))
+for (;;)
 {
-	/*
-	 * The subplan returns too few tuples for us to produce
-	 * any output at all.
-	 */
-	node->lstate = LIMIT_EMPTY;
-	return NULL;
+	if (!tuplestore_gettupleslot(node->totalTuple, true, false, slot))
+	{
+		node->lstate = LIMIT_EMPTY;
+		return NULL;
+	}
+	else
+	{
+		node->subSlot = slot;
+		if (++node->position > node->offset)
+		break;
+	}
+}
+			}
+			else if (node->limitOption == EXACT_NUMBER)
+			{
+
+/*
+ * Fetch rows from subplan until we reach position > offset.
+ */
+for (;;)
+{
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		/*
+		 * The subplan returns too few tuples for us to produce
+		 * any output at all.
+		 */
+		node->lstate = LIMIT_EMPTY;
+		return NULL;
+	}
+	node->subSlot = slot;
+	if (++node->position > node->offset)
+		break;
 }
-node->subSlot = slot;
-if (++node->position > node->offset)
-	break;
 			}
 
 			/*
@@ -144,18 +183,34 @@ ExecLimit(PlanState *pstate)
 
 	return NULL;
 }
+if (node->limitOption == PERCENTAGE)
+{
+	if (tuplestore_gettupleslot(node->totalTuple, true, false, slot))
+	{
+		node->subSlot = slot;
+		node->position++;
+	}
+	else
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+}
+else if (node->limitOption == EXACT_NUMBER)
+{
 
-/*
- * Get next tuple from subplan, if any.
- */
-slot = ExecProcNode(outerPlan);
-if (TupIsNull(slot))
-{
-	node->lstate = LIMIT_SUBPLANEOF;
-	return NULL;
+	 /*
+	 * Get next tuple from subplan, if any.
+	 */
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+	node->subSlot = slot;
+	node->position++;
 }
-node->subSlot = slot;
-node->position++;
 			}
 			else
 			{
@@ -168,15 +223,29 @@ ExecLimit(PlanState *pstate)
 	no

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Stephen Frost

Greetings,

* hubert depesz lubaczewski (dep...@depesz.com) wrote:
> I'm in a situation where we quite often generate more WAL than we can
> archive. The thing is - archiving takes long(ish) time but it's
> multi-step process and includes talking to remote servers over network.
> 
> I tested that simply by running archiving in parallel I can easily get
> 2-3 times higher throughput.
> 
> But - I'd prefer to keep postgresql knowing what is archived, and what
> not, so I can't do the parallelization on my own.
> 
> So, the question is: is it technically possible to have parallel
> archivization, and would anyone be willing to work on it (sorry, my
> c skills are basically none, so I can't realistically hack it myself)

Not entirely sure what the concern is around "postgresql knowing what is
archived", but pgbackrest already does exactly this parallel archiving
for environments where the WAL volume is larger than a single thread can
handle, and we've been rewriting it in C specifically to make it fast
enough to be able to keep PG up-to-date regarding what's been pushed
already.

Happy to discuss it further, as well as other related topics and how
backup software could be given better APIs to tell PG what's been
archived, etc.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: pg_verify_checksums failure with hash indexes

2018-08-28 Thread Michael Banck

Hi,

On Tue, Aug 28, 2018 at 11:21:34AM +0200, Peter Eisentraut wrote:
> This is reproducible with PG11 and PG12:
> 
> initdb -k data
> postgres -D data
> 
> make installcheck
> # shut down postgres with Ctrl-C
> 
> pg_verify_checksums data
> 
> pg_verify_checksums: checksum verification failed in file
> "data/base/16384/28647", block 65: calculated checksum DC70 but expected 0
> pg_verify_checksums: checksum verification failed in file
> "data/base/16384/28649", block 65: calculated checksum 89D8 but expected 0
> pg_verify_checksums: checksum verification failed in file
> "data/base/16384/28648", block 65: calculated checksum 9636 but expected 0
> Checksum scan completed
> Data checksum version: 1
> Files scanned:  2493
> Blocks scanned: 13172
> Bad checksums:  3
> 
> The files in question correspond to
> 
> hash_i4_index
> hash_name_index
> hash_txt_index
> 
> Discuss. ;-)

I took a look at hash_name_index, assuming the others are similar.

Page 65 is the last page, pageinspect barfs on it as well:

regression=# SELECT get_raw_page('hash_name_index', 'main', 65);
WARNING:  page verification failed, calculated checksum 18066 but expected 0
ERROR:  invalid page in block 65 of relation base/16384/28638

The pages before that one from page 35 on are empty:

regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 'main', 
1));
lsn| checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
---+--+---+---+---+-+--+-+---
 0/422D890 | 8807 | 0 |   664 |  5616 |8176 | 8192 |   4 |  
   0
(1 Zeile)
[...]
regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 'main', 
34));
lsn| checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
---+--+---+---+---+-+--+-+---
 0/422C690 |18153 | 0 |   580 |  5952 |8176 | 8192 |   4 |  
   0
(1 Zeile)
regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 'main', 
35));
 lsn | checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
-+--+---+---+---+-+--+-+---
 0/0 |0 | 0 | 0 | 0 |   0 |0 |   0 |
 0
[...]
regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 'main', 
64));
 lsn | checksum | flags | lower | upper | special | pagesize | version | 
prune_xid 
-+--+---+---+---+-+--+-+---
 0/0 |0 | 0 | 0 | 0 |   0 |0 |   0 |
 0
regression=# SELECT * FROM page_header(get_raw_page('hash_name_index', 'main', 
65));
WARNING:  page verification failed, calculated checksum 18066 but expected 0
ERROR:  invalid page in block 65 of relation base/16384/28638

Running pg_filedump on the last two pages results in (not sure the
"Invalid header information." are legit; neither about the checksum
failure on block 64):

mba@fock:~/[...]postgresql/build/src/test/regress$ ~/tmp/bin/pg_filedump -R 64 
65 -k -f tmp_check/data/base/16384/28638 

--8<--
***
* PostgreSQL File/Block Formatted Dump Utility - Version 10.1
*
* File: tmp_check/data/base/16384/28638
* Options used: -R 64 65 -k -f 
*
* Dump created on: Tue Aug 28 14:53:37 2018
***

Block   64 
 -
 Block Offset: 0x0008 Offsets: Lower   0 (0x)
 Block: Size0  Version0Upper   0 (0x)
 LSN:  logid  0 recoff 0x  Special 0 (0x)
 Items:0  Free Space:0
 Checksum: 0x  Prune XID: 0x  Flags: 0x ()
 Length (including item array): 24

 Error: Invalid header information.

 Error: checksum failure: calculated 0xc66a.

  :      
  0010:  

 -- 
 Empty block - no items listed 

 -
 Error: Invalid special section encountered.
 Error: Special section points off page. Unable to dump contents.

Block   65 
 -
 Block Offset: 0x00082000 Offsets: Lower  24 (0x0018)
 Block: Size 8192  Version4Upper8176 (0x1ff0)
 LSN:  logid  0 recoff 0x04229c20  Special  8176 (0x1ff0)
 Items:0  Free Space: 8152
 Checksum: 0x  Prune XID: 0x  Flags: 0x ()
 Length (including item array): 24

 Error: checksum failure: calculated 0x4692.

  :  209c2204  1800f01f   .".
  0010: f01f0420 ... 

 -- 
 Empty block - no items listed 

 -
 Hash Index Section:
  Flags: 0x ()

Re: pg_verify_checksums failure with hash indexes

2018-08-28 Thread Michael Paquier

On Tue, Aug 28, 2018 at 11:21:34AM +0200, Peter Eisentraut wrote:
> The files in question correspond to
> 
> hash_i4_index
> hash_name_index
> hash_txt_index

The hash index code has been largely refactored in v10, so I would
imagine that you can see the problem as well there.

[... digging digging ...]

And indeed I can see the problem in 10 as well with my own pg_checksums,
and I can see hash_f8_index with a problem on top of what Peter has
reported.

Amit?
--
Michael

signature.asc
Description: PGP signature

Re: typcache.c typos

2018-08-28 Thread Michael Paquier

On Tue, Aug 28, 2018 at 01:57:12PM +0900, Amit Langote wrote:
> On 2018/08/28 13:42, David Rowley wrote:
>> On 28 August 2018 at 16:40, David Rowley  
>> wrote:
>>> I've attached a patch to fix a typo in typcache.c. I ended up also
>>> rephrasing the sentence since "information about data types that is",
>>> almost made me also want to change "is" to "are", since "types" is
>>> plural. That would have been a mistake since it's talking about the
>>> information and not the types.

Your new phrasing looks correct to me, as information about the type is
what is referenced, and not the data type itself.

> - * The type cache exists to speed lookup of certain information about data
> 
> [ ... ]
> 
> + * The type cache exists to speedup lookups of certain data type information
> 
> Sorry if I'm being ignorant, but shouldn't it be "to speed up or to
> speed-up" instead of "to speedup?

I know "speed up" as a correct verb, and speedup can be used as a noun.
I don't know about "speed-up" though.
--
Michael


signature.asc
Description: PGP signature

Re: Stored procedures and out parameters

2018-08-28 Thread David G. Johnston

On Tuesday, August 28, 2018, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> First of all, what do you want to do with the function return value
> when you use CALL?

> Place it in the result set.

> And would you expect a function that is invoked via CALL to have a
> non-atomic execution context, that is, allow transactions?  If not,
> why not?

No, because functions cannot exercise transaction control.

Similar questions arise if we implement SQL standard dynamic result
> sets.  What would you do if a function invoked by CALL runs across one
> of those?

Runtime error that record returning results are not supported for CALL
based execution.

> Moreover, if the intention is to switch the
> JDBC driver or similar drivers to use the CALL command always from
> PG11 on, then the meaning of {call f1(a, b)} will have changed and a
> lot of things will break in dangerous ways.

This seems like the main blocker.

> Always using CALL to invoke a function would also leave performance on
> the table.  CALL has to do certain additional work in case a
> transaction commit happens in the middle of the procedure, such as
> expanding TOAST values.  You don't necessarily want to do that if you
> don't have to.

This seems solvable since we know if the invoked object is a function or
procedure.  And functions used as procedures are likely less sensitive to
performance concerns than ones performing calculations in tlists.  Not
implementing this optimization in pg11 but supporting functions via call is
something I could live with.

> There is also the semi-open question of whether functions and
> procedures should be in separate namespaces.

I have no problem calling this a closed question answered per the initial
implementation choice.  This matter is sufficient justification fevn if the
original choice was done out of convenience.

David J.

Re: Copy function for logical replication slots

2018-08-28 Thread Michael Paquier

On Tue, Aug 28, 2018 at 04:14:04PM +0900, Masahiko Sawada wrote:
> I think the copying from a slot that already reserved WAL would be
> helpful for backup cases (maybe you suggested?). Also, either way we
> need to make a safe logic of acquring and releasing the source slot
> for logical slots cases. Or you meant  to restrict the case where the
> copying a slot that doesn't reserve WAL?

I mean the latter, as-known-as there is no actual point in being able to
copy WAL which does *not* reserve WAL.

>> Does it actually make sense to allow copy of temporary slots or change
>> their persistence?  Those don't live across sessions so they'd need to
>> be copied in the same session which created them.
> 
> I think the copying of temporary slots would be an impracticable
> feature but the changing their persistence might be helpful for some
> cases, especially copying from persistent to temporary.

The session doing the copy of a permanent slot to the temporary slot has
to be the one also consuming it as the session which created the slot
owns it, and the slot would be dropped when the session ends.  For
logical slots perhaps you have something in mind?  Like copying a slot
which is not active to check where it is currently replaying, and using
the copy for sanity checks?
--
Michael


signature.asc
Description: PGP signature

Re: Fix help option of contrib/oid2name

2018-08-28 Thread Michael Paquier

On Mon, Aug 27, 2018 at 07:03:18PM +0900, Michael Paquier wrote:
> Thanks, I have looked at the patch set.

I have been through the set once again, and pushed both things.  Thanks
a lot Yamada-san.
--
Michael


signature.asc
Description: PGP signature

Re: Why hash OIDs?

2018-08-28 Thread Tom Lane

Andres Freund  writes:
> On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
>> Yeah, it would be a terrible idea as a general hash function for use
>> in contexts where the "avalanche effect" assumption is made about
>> information being spread out over the bits (the HJ batching code
>> wouldn't work for example).  I was wondering specifically about the
>> limited case of hash tables that are used to look things up in caches.

> I don't understand why that'd be ok there? With a simple 1:1 hash
> function, which you seem to advocate, many hash-tables would be much
> fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> other range.  A lot of our tables are populated on-demand, so you'd
> often end up with most of the data in one or two buckets, and a larger
> number largely empty.

Hmm ... but what we actually care about, for dynahash tables, is just
the low order bits of the hash; the original range of the values
isn't interesting.  Some desultory experimentation with queries like

select oid::int % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
group by 1 order by 2 desc;

offers some support for Thomas' idea: the hashoid results are
certainly not better distributed than the raw OIDs, and in some
cases noticeably worse.

I'd supposed that we needed to get the OIDs' high-order bits to
participate in the hash in order to get decent results, but
looking at these numbers makes me wonder.

OTOH, I'm also not convinced it's worth messing with.  In principle
we could shave some cycles with a no-op hash function, but I've not
seen anything to make me think that hashoid() is a measurable
bottleneck.

regards, tom lane

Re: typcache.c typos

2018-08-28 Thread Tom Lane

David Rowley  writes:
> On 28 August 2018 at 16:40, David Rowley  wrote:
>> I've attached a patch to fix a typo in typcache.c. I ended up also
>> rephrasing the sentence since "information about data types that is",
>> almost made me also want to change "is" to "are", since "types" is
>> plural. That would have been a mistake since it's talking about the
>> information and not the types.

> Opps. I mistakenly attached the incorrect patch. The correct one is
> attached to this email.

I don't see any typos here; it looks to me more like a difference of
opinion about whether "information" is singular or plural.  I don't
really feel a need to change the existing wording.

regards, tom lane

Re: pg_verify_checksums failure with hash indexes

2018-08-28 Thread Amit Kapila

On Tue, Aug 28, 2018 at 6:43 PM Michael Paquier  wrote:
>
> On Tue, Aug 28, 2018 at 11:21:34AM +0200, Peter Eisentraut wrote:
> > The files in question correspond to
> >
> > hash_i4_index
> > hash_name_index
> > hash_txt_index
>
> The hash index code has been largely refactored in v10, so I would
> imagine that you can see the problem as well there.
>
> [... digging digging ...]
>
> And indeed I can see the problem in 10 as well with my own pg_checksums,
> and I can see hash_f8_index with a problem on top of what Peter has
> reported.
>
> Amit?
>

I will look into it tomorrow, hope that's okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Why hash OIDs?

2018-08-28 Thread Robert Haas

On Mon, Aug 27, 2018 at 10:12 PM, Andres Freund  wrote:
> Huh? Oids between, say, 1 and FirstNormalObjectId, are vastly more
> common than the rest. And even after that, individual tables get large
> clusters of sequential values to the global oid counter.

Sure, but if you get a large cluster of sequential values, a straight
mod-N bucket mapping works just fine.   I think the bigger problem is
that you might get a large cluster of values separated by exactly a
power of 2.  For instance, say you have one serial column and one
index:

rhaas=# create table a (x serial primary key);
CREATE TABLE
rhaas=# create table b (x serial primary key);
CREATE TABLE
rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
  oid  |  oid
---+---
 16422 | 16430
(1 row)

If you have a lot of tables like that, bad things are going to happen
to your hash table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread David Steele

On 8/28/18 8:32 AM, Stephen Frost wrote:
> 
> * hubert depesz lubaczewski (dep...@depesz.com) wrote:
>> I'm in a situation where we quite often generate more WAL than we can
>> archive. The thing is - archiving takes long(ish) time but it's
>> multi-step process and includes talking to remote servers over network.
>>
>> I tested that simply by running archiving in parallel I can easily get
>> 2-3 times higher throughput.
>>
>> But - I'd prefer to keep postgresql knowing what is archived, and what
>> not, so I can't do the parallelization on my own.
>>
>> So, the question is: is it technically possible to have parallel
>> archivization, and would anyone be willing to work on it (sorry, my
>> c skills are basically none, so I can't realistically hack it myself)
> 
> Not entirely sure what the concern is around "postgresql knowing what is
> archived", but pgbackrest already does exactly this parallel archiving
> for environments where the WAL volume is larger than a single thread can
> handle, and we've been rewriting it in C specifically to make it fast
> enough to be able to keep PG up-to-date regarding what's been pushed
> already.

To be clear, pgBackRest uses the .ready files in archive_status to
parallelize archiving but still notifies PostgreSQL of completion via
the archive_command mechanism.  We do not modify .ready files to .done
directly.

However, we have optimized the C code to provide ~200
notifications/second (3.2GB/s of WAL transfer) which is enough to keep
up with the workloads we have seen.  Larger WAL segment sizes in PG11
will theoretically increase this to 200GB/s, though in practice CPU to
do the compression will become a major bottleneck, not to mention
network, etc.

Regards,
-- 
-David
da...@pgmasters.net

signature.asc
Description: OpenPGP digital signature

Catalog corruption

2018-08-28 Thread Mariel Cherkassky

Hi,
I sent already an email about this topic to pgsql-admins but I think that
it might be more relevant to this mailing list. I'm trying to investigate a
corruption that happened on a machine of one of our clients.

A little background :
-os Centos 6.6
-PostgreSQL v9.2.5
-I was asked to understand why the corruption happened and how can we fix
it. I copied the data dir to my machine on my local pc and started
investigate it. They realized that there is a problem when they tried to
backup the database via pg_dump. The error they got :

pg_dump: query returned 2 rows instead of one: SELECT tableoid, oid,
(SELECT rolname FROM pg_catalog.pg_roles WHERE oid = datdba) AS dba,
pg_encoding_to_char(encoding) AS encoding, datcollate, datctype,
datfrozenxid, (SELECT spcname FROM pg_tablespace t WHERE t.oid =
dattablespace) AS tablespace, shobj_description(oid, 'pg_database') AS
description FROM pg_database WHERE datname = 'db1'

So I checked and indeed there were duplicated rows in the pg_roles /
pg_database tables. Moreover, there were multiple values in all the system
catalogs !
The first think I did as Tom Lane suggested is to upgrade to v9.2.24.
Afterwards I vacuumed all the databases but nothing helped. I tried to
reindex the databases but I got the next error :

2018-08-28 21:50:03 +08 db2 24360  ERROR:  could not access status of
transaction 32212695
2018-08-28 21:50:03 +08 db2 24360  DETAIL:  Could not open file
"pg_subtrans/01EB": No such file or directory.

So I tried to analyze all the system catalogs manualy one by one :

select 'VACUUM ANALYZE pg_catalog.'||table_name from
information_schema.tables where table_schema = 'pg_catalog' and table_type
<> 'VIEW';
 ?column?
---
 VACUUM ANALYZE pg_catalog.pg_statistic
 VACUUM ANALYZE pg_catalog.pg_type
 VACUUM ANALYZE pg_catalog.pg_authid
 VACUUM ANALYZE pg_catalog.pg_proc
 VACUUM ANALYZE pg_catalog.pg_class
 VACUUM ANALYZE pg_catalog.pg_user_mapping
.

and I got error only on the next system table :

afa=#  ANALYZE pg_catalog.pg_shdepend;
ERROR:  could not access status of transaction 32212695
DETAIL:  Could not open file "pg_subtrans/01EB": No such file or directory.


When I try to reindex the table I'm getting the same error.

Solutions I tried :

-vacuum all the databases
-reindex system db_name - > raise the same error (could not access
status...)
-upgrade to v9.2.24 (still got corruption).
-Use zero_damaged_pages=0
-generate an empty subtransaction file with dd
-Trying to reindex system indexes in single mode -> raise same error.

*Tried to delete the duplicated values in some of the system tables by the
ctid value, but nothing is deleted.


Any idea how can I continue ?

Re: Something's busted in plpgsql composite-variable handling

2018-08-28 Thread Jonathan S. Katz

> On Aug 26, 2018, at 4:09 PM, Tom Lane  wrote:
> 
> I wrote:
>> [ dropping and recreating a composite type confuses plpgsql ]
>> That's not very nice.  What's worse is that it works cleanly in v10,
>> making this a regression, no doubt caused by the hacking I did on
>> plpgsql's handling of composite variables.
> 
> So I'm now inclined to withdraw this as an open item.  On the other
> hand, it is a bit worrisome that I happened to hit on a case that
> worked better before.  Maybe I'm wrong to judge this unlikely to
> happen in the field.
> 
> Thoughts?

Typically if you’re creating a composite type, you’re planning to store
data in that type, so you’re probably not going to just drop it without
an appropriate migration strategy around it, which would (hopefully)
prevent the above case from happening.

I wouldn’t let this block the release, so +1 for removing from open
items.

Jonathan

signature.asc
Description: Message signed with OpenPGP

Re: typcache.c typos

2018-08-28 Thread Chapman Flack

On 08/28/18 09:21, Michael Paquier wrote:
> On Tue, Aug 28, 2018 at 01:57:12PM +0900, Amit Langote wrote:
>> - * The type cache exists to speed lookup of certain information about data
>> [ ... ]
>> + * The type cache exists to speedup lookups of certain data type information
>>
>> Sorry if I'm being ignorant, but shouldn't it be "to speed up or to
>> speed-up" instead of "to speedup?
> 
> I know "speed up" as a correct verb, and speedup can be used as a noun.
> I don't know about "speed-up" though.

Merriam-Webster has no problem with the original "to speed" (transitive
verb : to increase the speed of : accelerate), and it didn't bother me
either.

-Chap

Re: Something's busted in plpgsql composite-variable handling

2018-08-28 Thread Pavel Stehule

2018-08-28 16:38 GMT+02:00 Jonathan S. Katz :

>
> > On Aug 26, 2018, at 4:09 PM, Tom Lane  wrote:
> >
> > I wrote:
> >> [ dropping and recreating a composite type confuses plpgsql ]
> >> That's not very nice.  What's worse is that it works cleanly in v10,
> >> making this a regression, no doubt caused by the hacking I did on
> >> plpgsql's handling of composite variables.
> >
> > So I'm now inclined to withdraw this as an open item.  On the other
> > hand, it is a bit worrisome that I happened to hit on a case that
> > worked better before.  Maybe I'm wrong to judge this unlikely to
> > happen in the field.
> >
> > Thoughts?
>
> Typically if you’re creating a composite type, you’re planning to store
> data in that type, so you’re probably not going to just drop it without
> an appropriate migration strategy around it, which would (hopefully)
> prevent the above case from happening.
>
> I wouldn’t let this block the release, so +1 for removing from open
> items.
>

That depends - the question is - what is a reason of this issue, and how to
fix it?

It is not strong issue, but it is issue, that breaks without outage
deployment.

regards

Pavel


> Jonathan
>
>

Re: pg_verify_checksums failure with hash indexes

2018-08-28 Thread Bernd Helmle

Am Dienstag, den 28.08.2018, 11:21 +0200 schrieb Peter Eisentraut:
> This is reproducible with PG11 and PG12:
> 
> initdb -k data
> postgres -D data
> 
> make installcheck
> # shut down postgres with Ctrl-C
> 

I tried to reproduce this and by accident i had a blocksize=4 in my
configure script, and i got immediately failed installcheck results.
They seem hash index related and can easily be reproduced:

SHOW block_size ;
 block_size 

 4096

CREATE TABLE foo(val text);
INSERT INTO foo VALUES('bernd');

CREATE INDEX ON foo USING hash(val);
ERROR:  index "foo_val_idx" contains corrupted page at block 0
HINT:  Please REINDEX it.

I have no idea wether this could be related, but i thought it won't
harm to share this here.

Bernd

Re: Something's busted in plpgsql composite-variable handling

2018-08-28 Thread Jonathan S. Katz

> On Aug 28, 2018, at 10:45 AM, Pavel Stehule  wrote:
> 
> 
> 
> 2018-08-28 16:38 GMT+02:00 Jonathan S. Katz  >:
> 
> > On Aug 26, 2018, at 4:09 PM, Tom Lane  > > wrote:
> >
> > I wrote:
> >> [ dropping and recreating a composite type confuses plpgsql ]
> >> That's not very nice.  What's worse is that it works cleanly in v10,
> >> making this a regression, no doubt caused by the hacking I did on
> >> plpgsql's handling of composite variables.
> >
> > So I'm now inclined to withdraw this as an open item.  On the other
> > hand, it is a bit worrisome that I happened to hit on a case that
> > worked better before.  Maybe I'm wrong to judge this unlikely to
> > happen in the field.
> >
> > Thoughts?
> 
> Typically if you’re creating a composite type, you’re planning to store
> data in that type, so you’re probably not going to just drop it without
> an appropriate migration strategy around it, which would (hopefully)
> prevent the above case from happening.
> 
> I wouldn’t let this block the release, so +1 for removing from open
> items.
> 
> That depends - the question is - what is a reason of this issue, and how to 
> fix it?

Tom explained the cause and a proposed a fix earlier in the thread, and
cautioned that it could involve a performance hit.

> It is not strong issue, but it is issue, that breaks without outage 
> deployment.

Have you encountered this issue in the field? It is a bug, but it seems to
be an edge case based on normal usage of PostgreSQL, and I still don’t
see a reason why it needs to be fixed prior to the release of 11. If there’s
an easier solution for solving it, yes, we could go ahead, but it sounds like
there’s a nontrivial amount of work + testing to do.

I do think it should be fixed for 12 now that we’ve identified it. We could move
it from the “Open Items” to the “Live Issues” list for what it’s worth.

Jonathan

signature.asc
Description: Message signed with OpenPGP

Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-28 Thread Andrey Borodin

Hi!

> 28 авг. 2018 г., в 8:29, Andrew Gierth  
> написал(а):
> 
> contrib/cube has an arbitrary limit of 100 on the number of dimensions
> in a cube, but it actually enforces that only in cube_in and
> cube_enlarge, with the other cube creation functions happy to create
> cubes of more dimensions.
> 
> I haven't actually tested, but this implies that one can create cubes
> that will break dump/restore.
> 
> Should this limit be kept, and if so what should it be? (There's
> obviously a limit on the size of indexable cubes)
> 
> (Noticed because an irc user was trying to use cubes with 512
> dimensions with partial success)
+1
This can cause very unpleasant fails like

postgres=# create table y as  select cube(array(SELECT random() as a FROM 
generate_series(1,1000))) from generate_series(1,1e3,1); 
SELECT 1000
postgres=# create index on y using gist(cube );
ERROR:  index row size 8016 exceeds maximum 8152 for index "y_cube_idx"

postgres=# create table y as  select cube(array(SELECT random() as a FROM 
generate_series(1,800))) from generate_series(1,1e3,1); 
SELECT 1000
postgres=# create index on y using gist(cube );
ERROR:  failed to add item to index page in "y_cube_idx"

I belive cube construction from array\arrays should check size of arrays.

Also there are some unexpected cube dimensionality reduction like in 
cube_enlarge
if (n > CUBE_MAX_DIM)
n = CUBE_MAX_DIM;
You wanted larger cube, but got cube of another dimension.

I think we should something like this

diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c
index dfa8465d74..38739b1df2 100644
--- a/contrib/cube/cube.c
+++ b/contrib/cube/cube.c
@@ -151,6 +151,12 @@ cube_a_f8_f8(PG_FUNCTION_ARGS)
 errmsg("cannot work with arrays containing 
NULLs")));
 
dim = ARRNELEMS(ur);
+   if (dim > CUBE_MAX_DIM)
+   ereport(ERROR,
+   (errcode(ERRCODE_ARRAY_ELEMENT_ERROR),
+errmsg("A cube cannot have more than %d 
dimensions.",
+  CUBE_MAX_DIM)));
+
if (ARRNELEMS(ll) != dim)
ereport(ERROR,
(errcode(ERRCODE_ARRAY_ELEMENT_ERROR),
@@ -208,6 +214,11 @@ cube_a_f8(PG_FUNCTION_ARGS)
 errmsg("cannot work with arrays containing 
NULLs")));
 
dim = ARRNELEMS(ur);
+   if (dim > CUBE_MAX_DIM)
+   ereport(ERROR,
+   (errcode(ERRCODE_ARRAY_ELEMENT_ERROR),
+errmsg("A cube cannot have more than %d 
dimensions.",
+  CUBE_MAX_DIM)));
 
dur = ARRPTR(ur);

Best regards, Andrey Borodin.

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-08-28 Thread Jonathan S. Katz


> On Aug 24, 2018, at 8:38 AM, Etsuro Fujita  
> wrote:
> 
> (2018/08/24 11:47), Michael Paquier wrote:
>> On Thu, Aug 23, 2018 at 10:00:49PM +0900, Etsuro Fujita wrote:
>>> I tried this today, but doing git behind the corporate firewall doesn't
>>> work.  I don't know the clear cause of that, so I'll investigate that
>>> tomorrow.
>> 
>> You may be able to tweak that by using https as origin point or proper
>> git proxy settings?
> 
> Yeah, my proxy settings were not correct.  With the help of my colleagues 
> Horiguchi-san and Yamada-san, I corrected them but still can't clone the 
> master repository.  Running git with GIT_CURL_VERBOSE shows that there is 
> another issue in my terminal environment, so I'm trying to resolve that.

Are there any updates on getting this patch committed?

Thanks,

Jonathan


signature.asc
Description: Message signed with OpenPGP

logical decoding: ABI break in 10.5 et al

2018-08-28 Thread Alvaro Herrera

Hello

In the process of fixing a bug, I inadvertently broke ABI compatibility
in all released branches that sport logical decoding.

[TL;DR: I propose to set up abidiff.postgresql.org.]

This is the commit:

Author: Alvaro Herrera 
Branch: master Release: REL_11_BR [f49a80c48] 2018-06-26 16:48:10 -0400
Branch: REL_10_STABLE Release: REL_10_5 [b767b3f2e] 2018-06-26 16:38:34 -0400
Branch: REL9_6_STABLE Release: REL9_6_10 [da10d6a8a] 2018-06-26 16:38:34 -0400
Branch: REL9_5_STABLE Release: REL9_5_14 [4cb6f7837] 2018-06-26 16:38:34 -0400
Branch: REL9_4_STABLE Release: REL9_4_19 [962313558] 2018-06-26 16:38:34 -0400

Fix "base" snapshot handling in logical decoding
[long commit message elided]
https://git.postgresql.org/pg/commitdiff/f49a80c48

This commit made modules dependent on structs ReorderBuffer and
ReorderBufferTXN compiled for prior minor releases no longer work with
the new minors, because some new struct members were inserted in the
middle of the struct.  Third-party modules such as pglogical, BDR,
wal2json are affected and need to be recompiled; failing to do so can
make them misbehave or crash.

Obviously, there's nothing to be done about it at this point; users must
recompile.  Modules have the option of providing code improvements that
will enable them to work with either API definition, so upgrading the
module immediately enables them to run before the new minors are
installed.  Reverting to the original struct definition in Postgres is
of course no longer an option.

In order to forestall this from happening again, I'm proposing a new
community service that will let us know of ABI incompatibilities
introduced by commits, hopefully before they become full-fledged minor
releases, so that we can apply fixes.  Initially, I'll just use the
tools abidumper and abi-compliance-checker from https://abi-laboratory.pro/
and set up a website that displays relevant reports for each branch.
I'm not yet sure about sending email alerts for failures; I'll
experiment a bit first.

(FWIW I ran the tools to compare 9.6.9 and 9.6.10 and it did indeed
detect the problem.)

I'm thinking of naming this abidiff.postgresql.org.  Bikeshedding
opportunity if you don't like that name -- this week only.

-- 
Álvaro Herrera

Re: Reopen logfile on SIGHUP

2018-08-28 Thread Alexander Korotkov

On Tue, Aug 21, 2018 at 4:48 AM Kyotaro HORIGUCHI
 wrote:
>
> At Fri, 10 Aug 2018 15:33:26 +0300, Alexander Kuzmenkov 
>  wrote in 
> <5142559a-82e6-b3e4-d6ed-8fd2d240c...@postgrespro.ru>
> > On 08/09/2018 10:33 AM, Kyotaro HORIGUCHI wrote:
> > >
> > > - Since I'm not sure unlink is signal-handler safe on all
> > >supported platforms, I moved unlink() call out of
> > >checkLogrotateSignal() to SysLoggerMain, just before rotating
> > >log file.
> >
> > Which platforms specifically do you have in mind? unlink() is required
> > to be async-signal-safe by POSIX.1-2001, which is required by UNIX 03,
> > and these are rather old.
>
> I suspect that something bad can happen on Windows. Another
> reason for the movement I didn't mention was it is not necesarry
> to be there. So I applied the principle that a signal handler
> should be as small and simple as possible.  As the result the
> flow of logrotate signal handling becomes similar to that for
> promote signal.

I went through this thread.  It started from discussion about changing
signal handling in syslogger, which has spotted set of problems.
However, now there is a patch which add new pg_ctl command, which
issues SIGUSR1 to syslogger.  It seems that nobody in the thread
object against this feature.

I've revised this patch a bit.  It appears to me that only postmaster
cares about logrotate file, while syslogger just handles SIGUSR1 as it
did before.  So, I've moved code that deletes logrotate file into
postmaster.c.

Also I found that this new pg_ctl isn't covered with tests at all.  So
I've added very simple tap tests, which ensures that when log file was
renamed, it reappeared again after pg_ctl logrotate.  I wonder how
that would work on Windows.  Thankfully commitfest.cputube.org have
Windows checking facility now.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pg_ctl_logrotate_v6.patch
Description: Binary data

Re: Something's busted in plpgsql composite-variable handling

2018-08-28 Thread Pavel Stehule

2018-08-28 17:04 GMT+02:00 Jonathan S. Katz :

>
> On Aug 28, 2018, at 10:45 AM, Pavel Stehule 
> wrote:
>
>
>
> 2018-08-28 16:38 GMT+02:00 Jonathan S. Katz :
>
>>
>> > On Aug 26, 2018, at 4:09 PM, Tom Lane  wrote:
>> >
>> > I wrote:
>> >> [ dropping and recreating a composite type confuses plpgsql ]
>> >> That's not very nice.  What's worse is that it works cleanly in v10,
>> >> making this a regression, no doubt caused by the hacking I did on
>> >> plpgsql's handling of composite variables.
>> >
>> > So I'm now inclined to withdraw this as an open item.  On the other
>> > hand, it is a bit worrisome that I happened to hit on a case that
>> > worked better before.  Maybe I'm wrong to judge this unlikely to
>> > happen in the field.
>> >
>> > Thoughts?
>>
>> Typically if you’re creating a composite type, you’re planning to store
>> data in that type, so you’re probably not going to just drop it without
>> an appropriate migration strategy around it, which would (hopefully)
>> prevent the above case from happening.
>>
>> I wouldn’t let this block the release, so +1 for removing from open
>> items.
>>
>
> That depends - the question is - what is a reason of this issue, and how
> to fix it?
>
>
> Tom explained the cause and a proposed a fix earlier in the thread, and
> cautioned that it could involve a performance hit.
>
> It is not strong issue, but it is issue, that breaks without outage
> deployment.
>
>
> Have you encountered this issue in the field? It is a bug, but it seems to
> be an edge case based on normal usage of PostgreSQL, and I still don’t
> see a reason why it needs to be fixed prior to the release of 11. If
> there’s
> an easier solution for solving it, yes, we could go ahead, but it sounds
> like
> there’s a nontrivial amount of work + testing to do.
>
> I do think it should be fixed for 12 now that we’ve identified it. We
> could move
> it from the “Open Items” to the “Live Issues” list for what it’s worth.
>

+1

Regards

Pavel


> Jonathan
>

Re: FETCH FIRST clause PERCENT option

2018-08-28 Thread Erik Rijkers


On 2018-08-28 14:14, Surafel Temesgen wrote:
On Tue, Aug 21, 2018 at 3:50 PM Andres Freund  
wrote:




Imagine volatile functions being used. That can be problematic because
of repeated side-effects, but also will lead to plainly wrong
results. Consider what happens with your approach where somebody does
something like WHERE value < random().


Thanks for explaining . The attach patch use tuplestore instead



[fetch-wth-percent-v2.patch]


I had a quick look at this.  Apply, compile, make check are ok.

In straightforward usage seems to work ok.


But I also ran across this crash:

create table if not exists t as
   select i from generate_series(1, 100) as f(i);

select * from t
  offset 60 rows
  fetch next 3 percent rows only
  FOR UPDATE
;

TRAP: FailedAssertion("!(slot != ((void *)0))", File: "execTuples.c", 
Line: 428)





Erik Rijkers

Re: logical decoding: ABI break in 10.5 et al

2018-08-28 Thread Christoph Berg

Re: Alvaro Herrera 2018-08-28 <20180828153806.fgfnul2imeltzmib@alvherre.pgsql>
> In order to forestall this from happening again, I'm proposing a new
> community service that will let us know of ABI incompatibilities
> introduced by commits, hopefully before they become full-fledged minor
> releases, so that we can apply fixes.  Initially, I'll just use the
> tools abidumper and abi-compliance-checker from https://abi-laboratory.pro/
> and set up a website that displays relevant reports for each branch.
> I'm not yet sure about sending email alerts for failures; I'll
> experiment a bit first.

Fwiw, there's already a setup, but it's restricted to the shared
libraries, and of course only for released versions:

https://abi-laboratory.pro/index.php?view=objects_report&l=postgresql&v1=10.4&v2=10.5

A big +1 on setting up an instance.

Christoph

Re: csv format for psql

2018-08-28 Thread Daniel Verite

Fabien COELHO wrote:

> My point was more about the documentation which should be clear about what 
> is the EOL. I understand from your point above that the EOL is the 
> platform-specific one, which is yet again fine with me, but should be said 
> clearly in the documentation?

Okay, I've added a bit in the doc.

>   + else if (strcmp(param, "fieldsep_csv") == 0)
>   + return pset_quoted_string(popt->topt.fieldSepCsv
>   +   ? popt->topt.fieldSepCsv
>   +   : "");
> 
> It is unclear to me when this "" is triggered. Never? If so, maybe a 
> comment should say so?

Currently popt->topt.fieldSepCsv can't be NULL so I've simplified this
to just return pset_quoted_string(popt->topt.fieldSepCsv).

> Why removing "-C"? As we already have "-A" an "-H", I was fine with it.

It was a leftover from v3. Participants in the thread don't seem to
want the short option, to my surprise. Pavel argued first against -C
upthread, I argued quite a bit in favor of it, the "for" had 0 upvote, and
"against" had at least 4 I think, including yours in [1].


> It seems that you changed the indentation in "psql-ref.sgml":
> 
>
>   -
>   + 

Fixed.

>   -  {"unaligned", "aligned", "wrapped", "html", "asciidoc",
>   -   "latex", "latex-longtable", "troff-ms", NULL};
>   +  {"aligned", "asciidoc", "csv", "html", "latex", "latex-longtable",
>   +   "unaligned", "troff-ms", "wrapped", NULL};
> 
> If you want alphabetical, 'u' > 't'.

Fixed.

> 
> While testing I found a small issue if "fieldsep_csv" is set to a strange 
> value:
> 
>\pset format_csv ',,'
>SELECT ',', ',';
>-- gives the ambiguous:
>
> 
> The rule to decide whether to quote should be made safer/smarter. I'd 
> suggest that if the string contains any of the caracters used in format 
> csv it should be quoted.

You meant \pset fieldsep_csv ',,'
If you do that even SELECT 'foo', 'bar' comes out wrong because it looks
like a 3-field row: foo,,bar
If we want to prevent people to shoot themselves in the foot with that
sort of thing, I've added a couple tests: No double quote, no LF or
CR, single character (but multibyte allowed) for the separator.


[1]
https://www.postgresql.org/message-id/alpine.DEB.2.20.1803081004241.2916%40lancre


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 13a8b68..98147ef 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -672,6 +672,10 @@ COPY count
 
   
CSV Format
+   
+CSV
+in COPY
+   
 

 This format option is used for importing and exporting the Comma
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index eb9d93a..7617c5e 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -152,6 +152,20 @@ EOF
 
 
 
+  --csv
+  
+  
+  
+   CSV
+   in psql
+  
+  Switches to CSV output mode. This is equivalent
+  to \pset format csv.
+  
+  
+
+
+
   -d dbname
   --dbname=dbname
   
@@ -2557,6 +2571,19 @@ lo_import 152801
   
 
   
+  fieldsep_csv
+  
+  
+  Specifies the field separator to be used in the
+  CSV format. When the separator appears in a field
+  value, that field is output inside double quotes according to
+  CSV rules. To set a tab as field separator, type
+  \pset fieldsep_csv '\t'. The default is a comma.
+  
+  
+  
+
+  
   fieldsep_zero
   
   
@@ -2584,12 +2611,11 @@ lo_import 152801
   format
   
   
-  Sets the output format to one of unaligned,
-  aligned, wrapped,
-  html, asciidoc,
+  Sets the output format to one of aligned,
+  asciidoc, csv, 
html,
   latex (uses tabular),
-  latex-longtable, or
-  troff-ms.
+  latex-longtable, troff-ms,
+  unaligned, or wrapped.
   Unique abbreviations are allowed.  (That would mean one letter
   is enough.)
   
@@ -2597,14 +2623,27 @@ lo_import 152801
   unaligned format writes all columns of a 
row on one
   line, separated by the currently active field separator. This
   is useful for creating output that might be intended to be read
-  in by other programs (for example, tab-separated or comma-separated
-  format).
+  in by other programs.
   
 
   aligned format is the standard, 
human-readable,
   nicely formatted text output;  this is the default.
   
 
+ csv format writes columns separated by
+ commas, applying the

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 8/28/18 8:32 AM, Stephen Frost wrote:
> > 
> > * hubert depesz lubaczewski (dep...@depesz.com) wrote:
> >> I'm in a situation where we quite often generate more WAL than we can
> >> archive. The thing is - archiving takes long(ish) time but it's
> >> multi-step process and includes talking to remote servers over network.
> >>
> >> I tested that simply by running archiving in parallel I can easily get
> >> 2-3 times higher throughput.
> >>
> >> But - I'd prefer to keep postgresql knowing what is archived, and what
> >> not, so I can't do the parallelization on my own.
> >>
> >> So, the question is: is it technically possible to have parallel
> >> archivization, and would anyone be willing to work on it (sorry, my
> >> c skills are basically none, so I can't realistically hack it myself)
> > 
> > Not entirely sure what the concern is around "postgresql knowing what is
> > archived", but pgbackrest already does exactly this parallel archiving
> > for environments where the WAL volume is larger than a single thread can
> > handle, and we've been rewriting it in C specifically to make it fast
> > enough to be able to keep PG up-to-date regarding what's been pushed
> > already.
> 
> To be clear, pgBackRest uses the .ready files in archive_status to
> parallelize archiving but still notifies PostgreSQL of completion via
> the archive_command mechanism.  We do not modify .ready files to .done
> directly.

Right, we don't recommend mucking around with that directory of files.
Even if that works today (which you'd need to test extensively...),
there's no guarantee that it'll work and do what you want in the
future...

> However, we have optimized the C code to provide ~200
> notifications/second (3.2GB/s of WAL transfer) which is enough to keep
> up with the workloads we have seen.  Larger WAL segment sizes in PG11
> will theoretically increase this to 200GB/s, though in practice CPU to
> do the compression will become a major bottleneck, not to mention
> network, etc.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: TupleTableSlot abstraction

2018-08-28 Thread Ashutosh Bapat

On Fri, Aug 24, 2018 at 6:46 AM, Andres Freund  wrote:
> Hi,
>
> On 2018-08-20 19:51:33 +0530, Ashutosh Bapat wrote:
>> Sorry, forgot about that. Here's the patch set with that addressed.
>
> Btw, you attach files as tar.zip, but they're actually gzip
> compressed...

I am using "tar -zcvf" to create the files where -z indicates gzip.
But if I use .gz as extension for some reason it's not able to
recognize the format. So, I just keep .zip. Do you see any problem
with that extension? The attached file has extension .tar.gz.

>
>
>> From 838a463646a048b3dccff95079a514fdc86effb3 Mon Sep 17 00:00:00 2001
>> From: Ashutosh Bapat 
>> Date: Mon, 13 Aug 2018 11:27:57 +0530
>> Subject: [PATCH 01/11] Split ExecStoreTuple into ExecStoreHeapTuple and
>>  ExecStoreBufferHeapTuple
>>
>> ExecStoreTuple() accepts a heap tuple from a buffer or constructed
>> on-the-fly. In the first case the caller passed a valid buffer and in
>> the later case it passes InvalidBuffer. In the first case,
>> ExecStoreTuple() pins the given buffer and in the later case it
>> records shouldFree flag. The function has some extra checks to
>> differentiate between the two cases.  The usecases never overlap thus
>> spending extra cycles in checks is useless. Hence separate these
>> usecases into separate functions ExecStoreHeapTuple() to store
>> on-the-fly tuple and ExecStoreBufferHeapTuple() to store an on-disk
>> tuple from a buffer. This allows to shave some extra cycles while
>> storing a tuple in the slot.
>
> It doesn't *yet* allow shaving extra cycles, no?

By segregating those two functions we are already saving some cycles
by not checking for a valid buffer for an on the fly tuple. That's not
huge but some. More saving will come with the complete abstraction.

>
>> *  SLOT ACCESSORS
>>   *   ExecSetSlotDescriptor   - set a slot's tuple descriptor
>> - *   ExecStoreTuple  - store a physical tuple in 
>> the slot
>> + *   ExecStoreHeapTuple  - store an on-the-fly heap 
>> tuple in the slot
>> + *   ExecStoreBufferHeapTuple - store an on-disk heap tuple in the 
>> slot
>>   *   ExecStoreMinimalTuple   - store a minimal physical tuple in 
>> the slot
>>   *   ExecClearTuple  - clear contents of a slot
>>   *   ExecStoreVirtualTuple   - mark slot as containing a virtual
>>   *   tuple
>
> I'd advocate for a separate patch ripping these out, they're almost
> always out of date.

Done. Added as 0001 patch.

>
>> + * Return value is just the passed-in slot pointer.
>> + * 
>> + */
>> +TupleTableSlot *
>> +ExecStoreHeapTuple(HeapTuple tuple,
>> +TupleTableSlot *slot,
>> +bool shouldFree)
>> +{
>> +
>> + return slot;
>> +}
>
> Uh, there could very well be a buffer previously stored in the slot, no?
> This can't currently be applied independently afaict.

Sorry, I missed that when splitting the function. Added code to unpin
any buffer that was pinned for an earlier tuple.

>
>> From c4c55bc0d501400ffca2e7393039b2d38660cd2d Mon Sep 17 00:00:00 2001
>> From: Ashutosh Bapat 
>> Date: Mon, 13 Aug 2018 11:27:57 +0530
>> Subject: [PATCH 02/11] tts_nvalid in TupleTableSlot is AttrNumber
>
>> @@ -125,7 +125,7 @@ typedef struct TupleTableSlot
>>   MemoryContext tts_mcxt; /* slot itself is in this context */
>>   Buffer  tts_buffer; /* tuple's buffer, or 
>> InvalidBuffer */
>>  #define FIELDNO_TUPLETABLESLOT_NVALID 9
>> - int tts_nvalid; /* # of valid values 
>> in tts_values */
>> + AttrNumber  tts_nvalid; /* # of valid values in 
>> tts_values */
>>  #define FIELDNO_TUPLETABLESLOT_VALUES 1
>>   Datum  *tts_values; /* current per-attribute values */
>>  #define FIELDNO_TUPLETABLESLOT_ISNULL 11
>
> Shouldn't this be adapting at least a *few* more things, like
> slot_getattr's argument?

That's not something I had in mind for this patch. There are many
other places where we declare attribute number variables as "int"
instead of "AttrNumber". I don't think that's what we should do in
this patch set. The idea of this patch is to reduce the size of
TupleTableSlot by 2 bytes. Changing signatures of the functions,
though has some readability improvement, doesn't achieve any such
benefit.

Even then I went ahead and change signatures of those functions, but
server built with LLVM crashed. So, I have reverted back all those
changes. I think we should change the JIT code automatically when
function signatures/structure definitions change OR at least
compilation should fail indicating the difference between JIT code and
normal code. Investigating a crash takes a hell lot of time and is not
easy without a custom LLVM build. I doubt if every PostgreSQL
developer would want to do that.

>
>>   /*
>> - * Fill in missing values for a TupleTableSlot.
>> - *

Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-28 Thread Alexander Korotkov

Hi!

On Tue, Aug 28, 2018 at 6:21 PM Andrey Borodin  wrote:
> I belive cube construction from array\arrays should check size of arrays.

Makes sense to me.

> Also there are some unexpected cube dimensionality reduction like in 
> cube_enlarge
> if (n > CUBE_MAX_DIM)
> n = CUBE_MAX_DIM;
> You wanted larger cube, but got cube of another dimension.
>
> I think we should something like this

OK, but I think cube_c_f8() and cube_c_f8_f8() also need to be
revised.  Also, I think this behavior should be covered by regression
tests.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Some pgq table rewrite incompatibility with logical decoding?

2018-08-28 Thread Jeremy Finzel

>
> Jeremy, are you able to reproduce the issue locally, using pgq? That would
>> be very valuable.
>
>
Tomas et al:

We have hit this error again, and we plan to snapshot the database as to be
able to do whatever troubleshooting we can.  If someone could provide me
guidance as to what exactly you would like me to do, please let me know.  I
am able to provide an xlog dump and also debugging information upon request.

This is actually a different database system that also uses skytools, and
the exact same table (pgq.event_58_1) is again the cause of the relfilenode
error.  I did a point-in-time recovery to a point after this relfilenode
appears using pg_xlogdump, and verified this was the table that appeared,
then disappeared.

Here is the exact issue again I am having:

Provider logs:
2018-08-28 08:29:30.334
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,2,"authentication",2018-08-28
08:29:30 CDT,289/78900643,0,LOG,0,"replication connection authorized:
user=foo_logical SSL enabled (protocol=TLSv1.2,
cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)",""
2018-08-28 08:29:30.366
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,3,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,0,"starting logical decoding for slot
""pgl_foo_prod_providerb97b25d_foo336ddc1""","streaming transactions
committing after C0A5/EEBBD4E8, reading WAL from
C0A5/EC723AC8""bar_foo_foo_a"
2018-08-28 08:29:30.366
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,4,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,0,"logical decoding found consistent point at
C0A5/EC723AC8","There are no running transactions.""bar_foo_foo_a"
2018-08-28 08:29:30.448
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,5,"idle",2018-08-28
08:29:30 CDT,289/0,0,ERROR,XX000,"could not map filenode
""base/16418/2800559918"" to relation OID","bar_foo_foo_a"
2018-08-28 08:29:30.463
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,6,"idle",2018-08-28
08:29:30 CDT,289/0,0,LOG,08006,"could not receive data from client:
Connection reset by peer","bar_foo_foo_a"
2018-08-28 08:29:30.463
CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,7,"idle",2018-08-28
08:29:30 CDT,,0,LOG,0,"disconnection: session time: 0:00:00.135
user=foo_logical database=foo_prod host=0.0.0.0
port=48314","bar_foo_foo_a"


Subscriber logs:
2018-08-28 08:32:10.040 CDT,,,71810,,5b854eda.11882,1,,2018-08-28 08:32:10
CDT,7/0,0,LOG,0,"starting apply for subscription
bar_foo_foo_a","pglogical apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,2,,2018-08-28 08:32:10
CDT,7/242099,123886525,ERROR,XX000,"data stream ended","pglogical
apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,3,,2018-08-28 08:32:10
CDT,7/0,0,LOG,0,"apply worker [71810] at slot 1 generation 366 exiting
with error","pglogical apply 16418:862837778"
2018-08-28 08:32:10.179 CDT,,,27173,,5b61d336.6a25,373,,2018-08-01 10:35:18
CDT,,0,LOG,0,"worker process: pglogical apply 16418:862837778 (PID
71810) exited with exit code 1",""


Thanks,
Jeremy

Re: pg_dump test instability

2018-08-28 Thread Tom Lane

Stephen Frost  writes:
> Parallel *restore* can be done from either a custom-format dump or from
> a directory-format dump.  I agree that we should seperate the concerns
> and perform independent sorting on the restore side of things based on
> the relative sizes of tables in the dump (be it custom format or
> directory format).  While compression might make us not exactly correct
> on the restore side, I expect that we'll generally be close enough to
> avoid most cases where a single worker gets stuck working on a large
> table at the end after all the other work is done.

Here's a proposed patch for this.  It removes the hacking of the TOC list
order, solving Peter's original problem, and instead sorts-by-size
in the actual parallel dump or restore control code.  There are a
number of ensuing performance benefits:

* The BLOBS entry, if any, gets to participate in the ordering decision
during parallel dumps.  As the code stands, all the effort to avoid
scheduling a long job last is utterly wasted if you've got a lot of
blobs, because that entry stayed at the end.  I didn't work real hard
on that, just gave it a large size so it would go first not last.  If
you just have a few blobs, that's not necessary, but I doubt it hurts
either.

* During restore, we insert actual size numbers into the BLOBS and
TABLE DATA items, and then anything that depends on a TABLE DATA item
inherits its size.  This results in size-based prioritization not just
for simple indexes as before, but also for constraint indexes (UNIQUE
or PRIMARY KEY), foreign key verifications, delayed CHECK constraints,
etc.  It also means that stuff like triggers and rules get reinstalled
in size-based order, which doesn't really help, but again I don't
think it hurts.

* Parallel restore scheduling by size works for custom dumps as well
as directory ones (as long as the dump file was seekable when created,
but you'll be hurting anyway if it wasn't).

I have not really tried to demonstrate performance benefits, because
the results would depend a whole lot on your test case; but at least
in principle this should result in far more intelligent scheduling
of parallel restores.

While I haven't done so here, I'm rather tempted to rename the
par_prev/par_next fields and par_list_xxx functions to pending_prev,
pending_next, pending_list_xxx, since they now have only one use.
(BTW, I tried really hard to get rid of par_prev/par_next altogether,
in favor of keeping the pending entries in the unused space in the
"ready" TocEntry* array.  But it didn't work out well --- seems like
a list really is the natural data structure for that.)

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 42cf441..ba79821 100644
*** a/src/bin/pg_dump/pg_backup.h
--- b/src/bin/pg_dump/pg_backup.h
*** extern void ConnectDatabase(Archive *AH,
*** 252,269 
  extern void DisconnectDatabase(Archive *AHX);
  extern PGconn *GetConnection(Archive *AHX);

- /* Called to add a TOC entry */
- extern void ArchiveEntry(Archive *AHX,
- 			 CatalogId catalogId, DumpId dumpId,
- 			 const char *tag,
- 			 const char *namespace, const char *tablespace,
- 			 const char *owner, bool withOids,
- 			 const char *desc, teSection section,
- 			 const char *defn,
- 			 const char *dropStmt, const char *copyStmt,
- 			 const DumpId *deps, int nDeps,
- 			 DataDumperPtr dumpFn, void *dumpArg);
- 
  /* Called to write *data* to the archive */
  extern void WriteData(Archive *AH, const void *data, size_t dLen);

--- 252,257 
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 45a391b..fce014f 100644
*** a/src/bin/pg_dump/pg_backup_archiver.c
--- b/src/bin/pg_dump/pg_backup_archiver.c
*** typedef struct _outputContext
*** 49,54 
--- 49,72 
  	int			gzOut;
  } OutputContext;

+ /*
+  * State for tracking TocEntrys that are ready to process during a parallel
+  * restore.  (This used to be a list, and we still call it that, though now
+  * it's really an array so that we can apply qsort to it.)
+  *
+  * tes[] is sized large enough that we can't overrun it.
+  * The valid entries are indexed first_te .. last_te inclusive.
+  * We periodically sort the array to bring larger-by-dataLength entries to
+  * the front; "sorted" is true if the valid entries are known sorted.
+  */
+ typedef struct _parallelReadyList
+ {
+ 	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
+ 	int			first_te;		/* index of first valid entry in tes[] */
+ 	int			last_te;		/* index of last valid entry in tes[] */
+ 	bool		sorted;			/* are valid entries currently sorted? */
+ } ParallelReadyList;
+ 
  /* translator: this is a module name */
  static const char *modulename = gettext_noop("archiver");

*** static void restore_toc_entries_postfork
*** 98,107 
  static void par_list_header_init(TocEntry *l);
  static void par_list_append(TocEntry *l, TocEntry *te);
  s

Re: csv format for psql

2018-08-28 Thread Fabien COELHO




Bonjour Daniel,


Currently popt->topt.fieldSepCsv can't be NULL so I've simplified this
to just return pset_quoted_string(popt->topt.fieldSepCsv).


Ok.


While testing I found a small issue if "fieldsep_csv" is set to a strange
value:

   \pset format_csv ',,'


You meant \pset fieldsep_csv ',,'


Indeed.


If you do that even SELECT 'foo', 'bar' comes out wrong because it looks
like a 3-field row: foo,,bar


Yes and no. I asked for ",," as a separator, so probably I really want 
that and I'd be okay with the result.



If we want to prevent people to shoot themselves in the foot with that
sort of thing, I've added a couple tests: No double quote, no LF or
CR, single character (but multibyte allowed) for the separator.


Ok, why not.

Patch applies cleanly, compiles, "make check" ok.

I tried "\pset fieldsep_csv '\0'" which could be seen as one character, 
but it does not want it. I'm okay with this behavior.


I'd suggest to add a test about rejected fieldsep_csv values, which raises 
both errors.


--
Fabien.

Re: More parallel pg_dump bogosities

2018-08-28 Thread Tom Lane

... just when you thought it was safe to go back in the water ...

Doesn't pg_backup_archiver.c's identify_locking_dependencies() need to
treat POLICY and ROW SECURITY items as requiring exclusive lock on
the referenced table?  Those commands definitely acquire
AccessExclusiveLock in a quick test.

I haven't looked hard, but I'm suspicious that other recently-added
dump object types may have been missed here too, and even more
suspicious that we'll forget this again in future.  I wonder if we
shouldn't invert the logic, so that instead of a blacklist of object
types that we assume need exclusive lock, we keep a whitelist of
object types that are known not to (which might be just INDEX,
not sure).  That way, we'd at least be failing in a safe direction.

regards, tom lane

Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-28 Thread Andrey Borodin



> 28 авг. 2018 г., в 14:18, Alexander Korotkov  
> написал(а):
> 
> OK, but I think cube_c_f8() and cube_c_f8_f8() also need to be
> revised.  Also, I think this behavior should be covered by regression
> tests.
True. Also there's one case in cube_subset.

Best regards, Andrey Borodin.


cube_check_dim.diff
Description: Binary data

Re: More parallel pg_dump bogosities

2018-08-28 Thread Alvaro Herrera

On 2018-Aug-28, Tom Lane wrote:

> ... just when you thought it was safe to go back in the water ...
> 
> Doesn't pg_backup_archiver.c's identify_locking_dependencies() need to
> treat POLICY and ROW SECURITY items as requiring exclusive lock on
> the referenced table?  Those commands definitely acquire
> AccessExclusiveLock in a quick test.
> 
> I haven't looked hard, but I'm suspicious that other recently-added
> dump object types may have been missed here too,

I hadn't come across this locking dependency before, so it's pretty
likely that partitioned index attachment has a problem here.

> and even more suspicious that we'll forget this again in future.

... yeah, it seems easy to overlook the need to edit this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Catalog corruption

2018-08-28 Thread Asim R P

On Tue, Aug 28, 2018 at 7:36 AM, Mariel Cherkassky
 wrote:
> Afterwards I vacuumed all the databases but nothing helped. I tried to
> reindex the databases but I got the next error :
>
> 2018-08-28 21:50:03 +08 db2 24360  ERROR:  could not access status of
> transaction 32212695
> 2018-08-28 21:50:03 +08 db2 24360  DETAIL:  Could not open file
> "pg_subtrans/01EB": No such file or directory.
>

Are you sure you created a checkpoint before copying the data
directory over to your PC?

pg_dump --load-via-partition-root vs. parallel restore

2018-08-28 Thread Tom Lane

Parallel pg_restore generally assumes that the archive file is telling it
the truth about data dependencies; in particular, that a TABLE DATA item
naming a particular target table is loading data into exactly that table.
--load-via-partition-root creates a significant probability that that
assumption is wrong, at least in scenarios where the data really does get
redirected into other partitions than the original one.  This can result
in inefficiencies (e.g., index rebuild started before a table's data is
really all loaded) or outright failures (foreign keys or RLS policies
applied before the data is all loaded).  I suspect that deadlock failures
during restore are also possible, since identify_locking_dependencies
is not going to be nearly close to the truth about which operations
might hold which locks.

This could possibly be fixed by changing around the dependencies shown
in the archive file so that POST_DATA objects that're nominally dependent
on any one of a partitioned table's members are shown as dependent on all
of them.  I'm not particularly eager to write that patch though.

For the moment I'm inclined to just document the problem, e.g. "It's
recommended that parallel restore not be used with archives generated
with this option."

regards, tom lane

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Andrey Borodin



> 28 авг. 2018 г., в 14:08, Stephen Frost  написал(а):
> 
> Greetings,
> 
> * David Steele (da...@pgmasters.net ) wrote:
>> On 8/28/18 8:32 AM, Stephen Frost wrote:
>> To be clear, pgBackRest uses the .ready files in archive_status to
>> parallelize archiving but still notifies PostgreSQL of completion via
>> the archive_command mechanism.  We do not modify .ready files to .done
>> directly.
> 
> Right, we don't recommend mucking around with that directory of files.
> Even if that works today (which you'd need to test extensively...),
> there's no guarantee that it'll work and do what you want in the
> future...
WAL-G modifies archive_status files.
This path was chosen to limit state preserved between WAL-G runs (archiving to 
S3) and further push archiving performance.
Indeed, it was very hard to test. Also, this makes impossible to use two 
archiving system simultaneously for transit period.

Best regards, Andrey Borodin.

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Stephen Frost

Greetings,

* Andrey Borodin (x4...@yandex-team.ru) wrote:
> > 28 авг. 2018 г., в 14:08, Stephen Frost  написал(а):
> > * David Steele (da...@pgmasters.net ) wrote:
> >> On 8/28/18 8:32 AM, Stephen Frost wrote:
> >> To be clear, pgBackRest uses the .ready files in archive_status to
> >> parallelize archiving but still notifies PostgreSQL of completion via
> >> the archive_command mechanism.  We do not modify .ready files to .done
> >> directly.
> > 
> > Right, we don't recommend mucking around with that directory of files.
> > Even if that works today (which you'd need to test extensively...),
> > there's no guarantee that it'll work and do what you want in the
> > future...
> WAL-G modifies archive_status files.

Frankly, I've heard far too many concerns and issues with WAL-G to
consider anything it does at all sensible.

> This path was chosen to limit state preserved between WAL-G runs (archiving 
> to S3) and further push archiving performance.

I still don't think it's a good idea and I specifically recommend
against making changes to the archive status files- those are clearly
owned and managed by PG and should not be whacked around by external
processes.

> Indeed, it was very hard to test. Also, this makes impossible to use two 
> archiving system simultaneously for transit period.

The testing in WAL-G seems to be rather lacking from what I've seen.

It's not clear to me what you're suggesting wrt 'two archiving system
simultaneously for transit period', but certainly whacking the archive
status files around doesn't make it easier to have multiple archive
commands trying to simultaneously archive WAL files.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Andrey Borodin




> 28 авг. 2018 г., в 17:07, Stephen Frost  написал(а):
> 
> Greetings,
> 
> * Andrey Borodin (x4...@yandex-team.ru) wrote:
>>> 28 авг. 2018 г., в 14:08, Stephen Frost  написал(а):
>>> * David Steele (da...@pgmasters.net ) wrote:
 On 8/28/18 8:32 AM, Stephen Frost wrote:
 To be clear, pgBackRest uses the .ready files in archive_status to
 parallelize archiving but still notifies PostgreSQL of completion via
 the archive_command mechanism.  We do not modify .ready files to .done
 directly.
>>> 
>>> Right, we don't recommend mucking around with that directory of files.
>>> Even if that works today (which you'd need to test extensively...),
>>> there's no guarantee that it'll work and do what you want in the
>>> future...
>> WAL-G modifies archive_status files.
> 
> Frankly, I've heard far too many concerns and issues with WAL-G to
> consider anything it does at all sensible.
Umm.. very interesting. What kind of issues? There are few on github repo, all 
of them will be addressed. Do you have some other reports? Can you share it?
I'm open to discuss any concerns.

> 
>> This path was chosen to limit state preserved between WAL-G runs (archiving 
>> to S3) and further push archiving performance.
> 
> I still don't think it's a good idea and I specifically recommend
> against making changes to the archive status files- those are clearly
> owned and managed by PG and should not be whacked around by external
> processes.
If you do not write to archive_status, you basically have two options:
1. On every archive_command recheck that archived file is identical to file 
that is already archived. This hurts performance.
2. Hope that files match. This does not add any safety compared to whacking 
archive_status. This approach is prone to core  changes as writes are.

Well, PostgreSQL clearly have the problem which can be solved by good parallel 
archiving API. Anything else - is whacking around, just reading archive_status 
is nothing better that reading and writing.

> 
>> Indeed, it was very hard to test. Also, this makes impossible to use two 
>> archiving system simultaneously for transit period.
> 
> The testing in WAL-G seems to be rather lacking from what I've seen.
Indeed, WAL-G still lacks automatic integration tests, I hope that some 
dockerized tests will be added soon.
By now I'm doing automated QA in Yandex infrastructure.

Best regards, Andrey Borodin.

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Stephen Frost

Greetings,

* Andrey Borodin (x4...@yandex-team.ru) wrote:
> > 28 авг. 2018 г., в 17:07, Stephen Frost  написал(а):
> > I still don't think it's a good idea and I specifically recommend
> > against making changes to the archive status files- those are clearly
> > owned and managed by PG and should not be whacked around by external
> > processes.
> If you do not write to archive_status, you basically have two options:
> 1. On every archive_command recheck that archived file is identical to file 
> that is already archived. This hurts performance.

It's absolutely important to make sure that the files PG is asking to
archive have actually been archived, yes.

> 2. Hope that files match. This does not add any safety compared to whacking 
> archive_status. This approach is prone to core  changes as writes are.

This blindly assumes that PG won't care about some other process
whacking around archive status files and I don't think that's a good
assumption to be operating under, and certainly not under the claim
that it's simply a 'performance' improvement.

> Well, PostgreSQL clearly have the problem which can be solved by good 
> parallel archiving API. Anything else - is whacking around, just reading 
> archive_status is nothing better that reading and writing.

Pushing files which are indicated by archive status as being ready is
absolutely an entirely different thing from whacking around the status
files themselves which PG is managing itself.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Some pgq table rewrite incompatibility with logical decoding?

2018-08-28 Thread Jeremy Finzel

>
> We have hit this error again, and we plan to snapshot the database as to
> be able to do whatever troubleshooting we can.
>

I am happy to report that we were able to get replication working again by
running snapshots of the systems in question on servers running the latest
point release 9.6.10, and replication simply works and skips over these
previously erroring relfilenodes.  So whatever fixes were made in this
point release to logical decoding seems to have fixed the issue.

Thanks,
Jeremy

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Andrey Borodin




> 28 авг. 2018 г., в 17:41, Stephen Frost  написал(а):
> 
> Pushing files which are indicated by archive status as being ready is
> absolutely an entirely different thing from whacking around the status
> files themselves which PG is managing itself.
I disagree. 
Returning archive_command exit code as a result of prior reading archive_status 
is not safe.
"absolutely an entirely different thing" is a speculation just like "jumping 
out of 5th floor is safer than jumping out 10th". If archive is not to be 
monitored properly - do not whack with archive_status at all. pgBackRest is no 
safer that WAL-G in this aspect. They are prone to the same conditions, 
changing behavior of archive_status will affect them both.
I'm aware of the issue and monitor PG changes in this aspect. I do not pretend 
that there cannot be any problem at all and this method will stay safe forever. 
But now it is.

Best regards, Andrey Borodin.

some pg_dump query code simplification

2018-08-28 Thread Peter Eisentraut

There is a lot of version dependent code in pg_dump now, especially
per-version queries.  The way they were originally built was that we
just have an entirely separate query per version.  But as the number of
versions and the size of the query grows, this becomes unwieldy.

So I tried, as an example, to write the queries in getTableAttrs()
differently.  Instead of repeating the almost same large query in each
version branch, use one query and add a few columns to the SELECT list
depending on the version.  This saves a lot of duplication and is easier
to deal with.  I think this patch is useful in its own right, but there
is also the larger question of whether people think this is a better
style going forward, for pg_dump and psql.  (It's already in use in psql
in some places.)  It won't always work, if the query structure is very
different between versions, but as this example shows, it can be useful.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 7aad405482cc70958cdba41e20ff1c2e4ffd3baf Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 28 Aug 2018 22:59:07 +0200
Subject: [PATCH] pg_dump: Reorganize getTableAttrs()

Instead of repeating the almost same large query in each version branch,
use one query and add a few columns to the SELECT list depending on the
version.  This saves a lot of duplication.
---
 src/bin/pg_dump/pg_dump.c | 189 ++
 1 file changed, 47 insertions(+), 142 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 6763a899d8..9a2ca37475 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8162,150 +8162,56 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
 
resetPQExpBuffer(q);
 
+   appendPQExpBuffer(q,
+ "SELECT\n"
+ "a.attnum,\n"
+ "a.attname,\n"
+ "a.atttypmod,\n"
+ "a.attstattarget,\n"
+ "a.attstorage,\n"
+ "t.typstorage,\n"
+ "a.attnotnull,\n"
+ "a.atthasdef,\n"
+ "a.attisdropped,\n"
+ "a.attlen,\n"
+ "a.attalign,\n");
if (fout->remoteVersion >= 11)
-   {
-   /* atthasmissing and attmissingval are new in 11 */
-   appendPQExpBuffer(q, "SELECT a.attnum, a.attname, 
a.atttypmod, "
- "a.attstattarget, 
a.attstorage, t.typstorage, "
- "a.attnotnull, 
a.atthasdef, a.attisdropped, "
- "a.attlen, 
a.attalign, a.attislocal, "
- 
"pg_catalog.format_type(t.oid,a.atttypmod) AS atttypname, "
- 
"array_to_string(a.attoptions, ', ') AS attoptions, "
- "CASE WHEN 
a.attcollation <> t.typcollation "
- "THEN a.attcollation 
ELSE 0 END AS attcollation, "
- "a.attidentity, "
- 
"pg_catalog.array_to_string(ARRAY("
- "SELECT 
pg_catalog.quote_ident(option_name) || "
- "' ' || 
pg_catalog.quote_literal(option_value) "
- "FROM 
pg_catalog.pg_options_to_table(attfdwoptions) "
- "ORDER BY option_name"
- "), E',\n') AS 
attfdwoptions ,"
+   appendPQExpBuffer(q,
  "CASE WHEN 
a.atthasmissing AND NOT a.attisdropped "
- "THEN a.attmissingval 
ELSE null END AS attmissingval "
- "FROM 
pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t "
- "ON a.atttypid = 
t.oid "
- "WHERE a.attrelid = 
'%u'::pg_catalog.oid "
- "AND

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread David Steele

On 8/28/18 4:34 PM, Andrey Borodin wrote:
>>
>> I still don't think it's a good idea and I specifically recommend
>> against making changes to the archive status files- those are clearly
>> owned and managed by PG and should not be whacked around by external
>> processes.
> If you do not write to archive_status, you basically have two options:
> 1. On every archive_command recheck that archived file is identical to file 
> that is already archived. This hurts performance.
> 2. Hope that files match. This does not add any safety compared to whacking 
> archive_status. This approach is prone to core  changes as writes are.

Another option is to maintain the state of what has been safely archived
(and what has errored) locally.  This allows pgBackRest to rapidly
return the status to Postgres without rechecking against the repository,
which as you note would be very slow.

This allows more than one archive_command to be safely run since all
archive commands must succeed before Postgres will mark the segment as done.

It's true that reading archive_status is susceptible to core changes but
the less interaction the better, I think.

Regards,
-- 
-David
da...@pgmasters.net

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Stephen Frost

Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 8/28/18 4:34 PM, Andrey Borodin wrote:
> >>
> >> I still don't think it's a good idea and I specifically recommend
> >> against making changes to the archive status files- those are clearly
> >> owned and managed by PG and should not be whacked around by external
> >> processes.
> > If you do not write to archive_status, you basically have two options:
> > 1. On every archive_command recheck that archived file is identical to file 
> > that is already archived. This hurts performance.
> > 2. Hope that files match. This does not add any safety compared to whacking 
> > archive_status. This approach is prone to core  changes as writes are.
> 
> Another option is to maintain the state of what has been safely archived
> (and what has errored) locally.  This allows pgBackRest to rapidly
> return the status to Postgres without rechecking against the repository,
> which as you note would be very slow.
> 
> This allows more than one archive_command to be safely run since all
> archive commands must succeed before Postgres will mark the segment as done.
> 
> It's true that reading archive_status is susceptible to core changes but
> the less interaction the better, I think.

Absolutely.  External processes shouldn't be changing the files written
out and managed by the core system.  pgbackrest is *much* safer than
alternatives which hack around files inside of PGDATA.  We've been
working to move things forward to the point where pgbackrest is able to
be run as a user who hasn't even got access to modify those files (which
has now landed in PG11) and for good reason- it's outright dangerous to
do.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Would it be possible to have parallel archiving?

2018-08-28 Thread Michael Paquier

On Tue, Aug 28, 2018 at 05:20:10PM -0400, Stephen Frost wrote:
> Absolutely.  External processes shouldn't be changing the files written
> out and managed by the core system.

+1.  Archiving commands should not mess up directly with the contents
the backend is managing.
--
Michael


signature.asc
Description: PGP signature

Re: some pg_dump query code simplification

2018-08-28 Thread Tom Lane

Peter Eisentraut  writes:
> There is a lot of version dependent code in pg_dump now, especially
> per-version queries.  The way they were originally built was that we
> just have an entirely separate query per version.  But as the number of
> versions and the size of the query grows, this becomes unwieldy.

Agreed, it's becoming a bit of a mess.

> So I tried, as an example, to write the queries in getTableAttrs()
> differently.  Instead of repeating the almost same large query in each
> version branch, use one query and add a few columns to the SELECT list
> depending on the version.  This saves a lot of duplication and is easier
> to deal with.

I think I had this discussion already with somebody, but ... I do not
like this style at all:

tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, j, 
i_attidentity)) : '\0');

It's basically throwing away all the cross-checking that exists now to
ensure that you didn't forget to deal with a column, misspell the column's
AS name in one place or the other, etc etc.  The code could be completely
wrong and it'd still fail silently, producing a default result that might
well be the right answer, making it hard to debug.  I think the way to do
this is to continue to require the query to produce the same set of
columns regardless of server version.  So the version-dependent bits would
look like, eg,

if (fout->remoteVersion >= 11)
appendPQExpBufferStr(q,
 "CASE WHEN a.atthasmissing AND NOT 
a.attisdropped "
 "THEN a.attmissingval ELSE null END AS 
attmissingval,\n");
else
appendPQExpBufferStr(q,
 "null AS attmissingval,\n");

and the data extraction code would remain the same as now.

Other than that nit, I agree that this shows a lot of promise.

regards, tom lane

Re: Some pgq table rewrite incompatibility with logical decoding?

2018-08-28 Thread Tomas Vondra

Hi Jeremy,

On 08/28/2018 10:46 PM, Jeremy Finzel wrote:
> We have hit this error again, and we plan to snapshot the database
> as to be able to do whatever troubleshooting we can. 
> 
> 
> I am happy to report that we were able to get replication working again
> by running snapshots of the systems in question on servers running the
> latest point release 9.6.10, and replication simply works and skips over
> these previously erroring relfilenodes.  So whatever fixes were made in
> this point release to logical decoding seems to have fixed the issue.
> 

Interesting.

So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?

I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):

  6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
  Fix bugs in vacuum of shared rels, by keeping their relcache entries
  current.

  da10d6a8a94eec016fa072d007bced9159a28d39
  Fix "base" snapshot handling in logical decoding

  0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
  Add table relcache invalidation to index builds.

But it's hard to say if/which of those commits did the trick, without
more information.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: some pg_dump query code simplification

2018-08-28 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Peter Eisentraut  writes:
> > There is a lot of version dependent code in pg_dump now, especially
> > per-version queries.  The way they were originally built was that we
> > just have an entirely separate query per version.  But as the number of
> > versions and the size of the query grows, this becomes unwieldy.
> 
> Agreed, it's becoming a bit of a mess.
> 
> > So I tried, as an example, to write the queries in getTableAttrs()
> > differently.  Instead of repeating the almost same large query in each
> > version branch, use one query and add a few columns to the SELECT list
> > depending on the version.  This saves a lot of duplication and is easier
> > to deal with.
> 
> I think I had this discussion already with somebody, but ... I do not
> like this style at all:
> 
> tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, 
> j, i_attidentity)) : '\0');
> 
> It's basically throwing away all the cross-checking that exists now to
> ensure that you didn't forget to deal with a column, misspell the column's
> AS name in one place or the other, etc etc.  The code could be completely
> wrong and it'd still fail silently, producing a default result that might
> well be the right answer, making it hard to debug.  I think the way to do
> this is to continue to require the query to produce the same set of
> columns regardless of server version.  So the version-dependent bits would
> look like, eg,
> 
> if (fout->remoteVersion >= 11)
> appendPQExpBufferStr(q,
>  "CASE WHEN a.atthasmissing AND NOT 
> a.attisdropped "
>  "THEN a.attmissingval ELSE null END AS 
> attmissingval,\n");
> else
> appendPQExpBufferStr(q,
>  "null AS attmissingval,\n");
> 
> and the data extraction code would remain the same as now.
> 
> Other than that nit, I agree that this shows a lot of promise.

+1 to Tom's thoughts on this- seems like a much better approach.

Overall, I definitely agree that it'd be nice to reduce the amount of
duplication happening today.  Working out some way to actually test all
of those queries would be awful nice too.

I wonder- what if we had an option to pg_dump to explicitly tell it what
the server's version is and then have TAP tests to run with different
versions?  There may be some cases where that doesn't work, of course,
and we'd need to address that, but having those tests might reduce the
number of times we end up with something committed which doesn't work at
all against some older version of PG because it wasn't tested...

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Some pgq table rewrite incompatibility with logical decoding?

2018-08-28 Thread Tomas Vondra

On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
> Jeremy, are you able to reproduce the issue locally, using pgq?
> That would be very valuable.
> 
> 
> Tomas et al:
> 
> We have hit this error again, and we plan to snapshot the database as to
> be able to do whatever troubleshooting we can.  If someone could provide
> me guidance as to what exactly you would like me to do, please let me
> know.  I am able to provide an xlog dump and also debugging information
> upon request.
> 
> This is actually a different database system that also uses skytools,
> and the exact same table (pgq.event_58_1) is again the cause of the
> relfilenode error.  I did a point-in-time recovery to a point after this
> relfilenode appears using pg_xlogdump, and verified this was the table
> that appeared, then disappeared.
> 

Great!

Can you attach to the decoding process using gdb, and set a breakpoint
to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
record it fails?

https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/replication/logical/reorderbuffer.c#L1599

If it fails too fast, making it difficult to attach gdb before the
crash, adding the LSN to the log message might be easier.

Once we have the LSN, it would be useful to see the pg_xlogdump
before/around that position.

Another interesting piece of information would be to know the contents
of the relmapper cache (which essentially means stepping through
RelationMapFilenodeToOid or something like that).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: some pg_dump query code simplification

2018-08-28 Thread Tom Lane

Stephen Frost  writes:
> I wonder- what if we had an option to pg_dump to explicitly tell it what
> the server's version is and then have TAP tests to run with different
> versions?

Uh ... telling it what the version is doesn't make that true, so I'd
have no confidence in a test^H^H^H^Hkluge done that way.  The way
to test is to point it at an *actual* back-branch server.

Andrew has a buildfarm module that does precisely that, although
I'm not sure what its test dataset is --- probably the regression
database from each branch.  I also have a habit of doing such testing
manually whenever I touch version-sensitive parts of pg_dump.

Dunno about the idea of running the pg_dump TAP tests against back
branches.  I find that code sufficiently unreadable that maintaining
several more copies of it doesn't sound like fun at all.

regards, tom lane

Re: some pg_dump query code simplification

2018-08-28 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I wonder- what if we had an option to pg_dump to explicitly tell it what
> > the server's version is and then have TAP tests to run with different
> > versions?
> 
> Uh ... telling it what the version is doesn't make that true, so I'd
> have no confidence in a test^H^H^H^Hkluge done that way.  The way
> to test is to point it at an *actual* back-branch server.

I certainly agree that this would be ideal, but nonetheless, I've seen
multiple cases where just trying to run the query, even against a
current version, would have shown that it's malformed or has some issue
which needs fixing and today we haven't even got that.

> Andrew has a buildfarm module that does precisely that, although
> I'm not sure what its test dataset is --- probably the regression
> database from each branch.  I also have a habit of doing such testing
> manually whenever I touch version-sensitive parts of pg_dump.

I've gotten better about doing that back-branch testing myself and
certainly prefer it, but I think we should also have buildfarm coverage.
I don't think we have the full matrix covered, or, really, anything
anywhere near it, so I'm looking for other options to at least get that
code exercised.

> Dunno about the idea of running the pg_dump TAP tests against back
> branches.  I find that code sufficiently unreadable that maintaining
> several more copies of it doesn't sound like fun at all.

I hadn't been thinking we'd need more copies of it, simply a few more
runs which have different version values specified, though even just
doing a pg_dump of the regression suite for each combination would be
something.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: More parallel pg_dump bogosities

2018-08-28 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Aug-28, Tom Lane wrote:
>> Doesn't pg_backup_archiver.c's identify_locking_dependencies() need to
>> treat POLICY and ROW SECURITY items as requiring exclusive lock on
>> the referenced table?  Those commands definitely acquire
>> AccessExclusiveLock in a quick test.

> I hadn't come across this locking dependency before, so it's pretty
> likely that partitioned index attachment has a problem here.

Hm, it looks like

ALTER INDEX public.at_partitioned_a_idx ATTACH PARTITION public.at_part_1_a_idx;

takes these locks:

   relation   |mode 
--+-
 at_part_1| AccessShareLock
 at_partitioned   | AccessShareLock
 at_part_1_a_idx  | AccessExclusiveLock
 at_partitioned_a_idx | AccessExclusiveLock

I'm not aware of exactly what this does to catalog entries, but is it
*really* safe to have only AccessShareLock on the tables?  That sounds
like a bit of wishful thinking :-(.

In any case, the exclusive locks on the indexes are likely sufficient
to block other operations on the tables (maybe leading to deadlocks),
so I'm inclined to think that yeah, parallel restore should refrain
from running this in parallel with other DDL on either table.

regards, tom lane

Re: some pg_dump query code simplification

2018-08-28 Thread Tom Lane

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Stephen Frost  writes:
>>> I wonder- what if we had an option to pg_dump to explicitly tell it what
>>> the server's version is and then have TAP tests to run with different
>>> versions?

>> Uh ... telling it what the version is doesn't make that true, so I'd
>> have no confidence in a test^H^H^H^Hkluge done that way.  The way
>> to test is to point it at an *actual* back-branch server.

> I certainly agree that this would be ideal, but nonetheless, I've seen
> multiple cases where just trying to run the query, even against a
> current version, would have shown that it's malformed or has some issue
> which needs fixing and today we haven't even got that.

Yeah, but cases where we need to touch column C in one version and column
D in another can't be made to pass when pg_dump is operating under a false
assumption about the server version.  So this seems like a nonstarter.

regards, tom lane

Re: some pg_dump query code simplification

2018-08-28 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Stephen Frost  writes:
> >>> I wonder- what if we had an option to pg_dump to explicitly tell it what
> >>> the server's version is and then have TAP tests to run with different
> >>> versions?
> 
> >> Uh ... telling it what the version is doesn't make that true, so I'd
> >> have no confidence in a test^H^H^H^Hkluge done that way.  The way
> >> to test is to point it at an *actual* back-branch server.
> 
> > I certainly agree that this would be ideal, but nonetheless, I've seen
> > multiple cases where just trying to run the query, even against a
> > current version, would have shown that it's malformed or has some issue
> > which needs fixing and today we haven't even got that.
> 
> Yeah, but cases where we need to touch column C in one version and column
> D in another can't be made to pass when pg_dump is operating under a false
> assumption about the server version.  So this seems like a nonstarter.

Yes, there are cases that wouldn't work.

I'd be much happier with a complete solution, were that forthcoming.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Why hash OIDs?

2018-08-28 Thread Thomas Munro

On Wed, Aug 29, 2018 at 1:37 AM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2018-08-28 14:45:49 +1200, Thomas Munro wrote:
> >> Yeah, it would be a terrible idea as a general hash function for use
> >> in contexts where the "avalanche effect" assumption is made about
> >> information being spread out over the bits (the HJ batching code
> >> wouldn't work for example).  I was wondering specifically about the
> >> limited case of hash tables that are used to look things up in caches.
>
> > I don't understand why that'd be ok there? With a simple 1:1 hash
> > function, which you seem to advocate, many hash-tables would be much
> > fuller in the 1000-3000 (where pg_class, etc all reside) than in any
> > other range.  A lot of our tables are populated on-demand, so you'd
> > often end up with most of the data in one or two buckets, and a larger
> > number largely empty.
>
> Hmm ... but what we actually care about, for dynahash tables, is just
> the low order bits of the hash; the original range of the values
> isn't interesting.  Some desultory experimentation with queries like
>
> select oid::int % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> select (hashoid(oid)+2^32)::int8 % 1024, count(*) from pg_proc
> group by 1 order by 2 desc;
>
> offers some support for Thomas' idea: the hashoid results are
> certainly not better distributed than the raw OIDs, and in some
> cases noticeably worse.

Yeah, I noticed that too.  In that example we'd presumably be using
4096 buckets for 2128 procedures, but the results are similar.  Of
course a bad example can also be contrived (as it can with for any
hash function, with varying amounts of work).  The question is just
whether such a situation is likely to be encountered unintentionally
with trivial identity hashes.  An argument against this idea is that
it's one of those situations where even though you can say "oh, that's
really unlikely to happen", if it does happen in your schema for
whatever reason it's persistent (cf https://xkcd.com/221/).

> I'd supposed that we needed to get the OIDs' high-order bits to
> participate in the hash in order to get decent results, but
> looking at these numbers makes me wonder.
>
> OTOH, I'm also not convinced it's worth messing with.  In principle
> we could shave some cycles with a no-op hash function, but I've not
> seen anything to make me think that hashoid() is a measurable
> bottleneck.

I do think there is probably a little bit more performance to squeeze
out of syscache though.  For example, I ran Andres's
pgbench-many-cols.sql[1] and saw a ~3% TPS increase just by sticking
pg_attribute_always_inline in front of SearchCatCacheInternal()
(single threaded pgbench, clang, -O2).  That may be in the range of
random performance changes caused by code movement, but the logical
extreme of inlining/specialisation seems like it should be a win: the
fast path of a syscache lookup could be inlined by the caller,
including the hash and eq functions (rather than going through
function pointers).  If we had identity OID hashing, I suppose you'd
be left with just a few instructions: mask the OID, index into the
bucket array, compare pointer, unlikely jump, compare the OID,
unlikely jump.  That said, a few extra instructions for murmur hash
hoop jumping might not make much difference compared to the benefits
of inlining/specialisation.

My real interest here is something similar to OIDs but not the same:
numbered undo logs.  I'll have more to say about that soon, but
basically I went from looking things up by index in arrays in DSM
segments in my previous version to a new system that doesn't require
DSM segments anymore but does require a hash table.  I found myself
wondering why I need to even use a hash function for my integer keys,
which seem to be pretty close to 'perfect' hashes and the generated
code is pleasingly close to the array index lookup I had before.
Apparently it's good enough for the designers of C++ standard
libraries, the Java standard library, etc etc to use trivial identity
hashing for integers, while not knowing anything at all about key
distribution.  I suspect it probably works out pretty well in a lot of
cases as long as you don't have a sequence generator that makes gaps
with big power of two strides, don't play tricks that read arbitrary
subsets of the bits, and use a strong hash_combine for compound keys.
OIDs seem to come close...

[1] 
https://www.postgresql.org/message-id/20170922061545.wkbzt7b6p6x47bzg%40alap3.anarazel.de

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Why hash OIDs?

2018-08-28 Thread Thomas Munro

On Wed, Aug 29, 2018 at 2:09 AM Robert Haas  wrote:
> On Mon, Aug 27, 2018 at 10:12 PM, Andres Freund  wrote:
> > Huh? Oids between, say, 1 and FirstNormalObjectId, are vastly more
> > common than the rest. And even after that, individual tables get large
> > clusters of sequential values to the global oid counter.
>
> Sure, but if you get a large cluster of sequential values, a straight
> mod-N bucket mapping works just fine.   I think the bigger problem is
> that you might get a large cluster of values separated by exactly a
> power of 2.  For instance, say you have one serial column and one
> index:
>
> rhaas=# create table a (x serial primary key);
> CREATE TABLE
> rhaas=# create table b (x serial primary key);
> CREATE TABLE
> rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
>   oid  |  oid
> ---+---
>  16422 | 16430
> (1 row)
>
> If you have a lot of tables like that, bad things are going to happen
> to your hash table.

Right.  I suppose that might happen accidentally when creating a lot
of partitions.

Advance the OID generator by some prime number after every CREATE TABLE?

/me ducks

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Postmaster doesn't send SIGTERM to bgworker during fast shutdown when pmState == PM_STARTUP

2018-08-28 Thread Michael Paquier

On Mon, Aug 27, 2018 at 07:34:55PM -0700, Michael Paquier wrote:
> That seems like a good catch and a correct fix to me.  The handling of
> SIGINT is inconsistent with SIGTERM in pmdie().  I would just add a
> comment to mention that at this stage only the startup process is
> running, and that it has been signaled already.  I'll commit that
> tomorrow.

I have been studying your patch, but it seems to me that this is not
complete as other processes could have been started before switching
from PM_STARTUP to PM_RECOVERY.  I am talking here about the bgwriter
and the checkpointer as well.  Shouldn't we switch pmState to
PM_WAIT_BACKENDS?  Your patch is missing that.
--
Michael


signature.asc
Description: PGP signature

Re: Why hash OIDs?

2018-08-28 Thread Thomas Munro

On Wed, Aug 29, 2018 at 11:05 AM Thomas Munro
 wrote:
> On Wed, Aug 29, 2018 at 2:09 AM Robert Haas  wrote:
> > rhaas=# create table a (x serial primary key);
> > CREATE TABLE
> > rhaas=# create table b (x serial primary key);
> > CREATE TABLE
> > rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
> >   oid  |  oid
> > ---+---
> >  16422 | 16430
> > (1 row)
> >
> > If you have a lot of tables like that, bad things are going to happen
> > to your hash table.
>
> Right.  I suppose that might happen accidentally when creating a lot
> of partitions.
>
> Advance the OID generator by some prime number after every CREATE TABLE?
>
> /me ducks

Erm, s/prime/random/.   Or use a different OID generator for each
catalogue so that attributes etc don't create gaps in pg_class OIDs.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Why hash OIDs?

2018-08-28 Thread Tom Lane

Thomas Munro  writes:
> On Wed, Aug 29, 2018 at 11:05 AM Thomas Munro
>  wrote:
>> On Wed, Aug 29, 2018 at 2:09 AM Robert Haas  wrote:
>>> rhaas=# create table a (x serial primary key);
>>> CREATE TABLE
>>> rhaas=# create table b (x serial primary key);
>>> CREATE TABLE
>>> rhaas=# select 'a'::regclass::oid, 'b'::regclass::oid;
>>> oid  |  oid
>>> ---+---
>>> 16422 | 16430
>>> (1 row)
>>> If you have a lot of tables like that, bad things are going to happen
>>> to your hash table.

>> Right.  I suppose that might happen accidentally when creating a lot
>> of partitions.
>> Advance the OID generator by some prime number after every CREATE TABLE?

> Erm, s/prime/random/.   Or use a different OID generator for each
> catalogue so that attributes etc don't create gaps in pg_class OIDs.

I think this argument is a red herring TBH.  The example Robert shows is
of *zero* interest for dynahash or catcache, unless it's taking only the
low order 3 bits of the OID for the bucket number.  But actually we'll
increase the table size proportionally to the number of entries, so
that you can't have say 1000 table entries without at least 10 bits
being used for the bucket number.  That means that you'd only have
trouble if those 1000 tables all had OIDs exactly 1K (or some multiple
of that) apart.  Such a case sounds quite contrived from here.

regards, tom lane

Re: Reopen logfile on SIGHUP

2018-08-28 Thread Kyotaro HORIGUCHI

At Tue, 21 Aug 2018 11:27:46 +0900, Michael Paquier  wrote 
in <20180821022745.ge2...@paquier.xyz>
> On Tue, Aug 21, 2018 at 09:26:54AM +0900, Kyotaro HORIGUCHI wrote:
> > I suspect that something bad can happen on Windows.
> 
> [troll mode]
> More and even worse things than that could happen on Windows.
> [/troll mode]

I don't have a candy for you just now:p

Well, I take that as it's a kind of no-problem to remove files in
signal handlers on Windows.

thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Copy function for logical replication slots

2018-08-28 Thread Masahiko Sawada

On Tue, Aug 28, 2018 at 10:34 PM, Michael Paquier  wrote:
> On Tue, Aug 28, 2018 at 04:14:04PM +0900, Masahiko Sawada wrote:
>> I think the copying from a slot that already reserved WAL would be
>> helpful for backup cases (maybe you suggested?). Also, either way we
>> need to make a safe logic of acquring and releasing the source slot
>> for logical slots cases. Or you meant  to restrict the case where the
>> copying a slot that doesn't reserve WAL?
>
> I mean the latter, as-known-as there is no actual point in being able to
> copy WAL which does *not* reserve WAL.

Agreed. I'll restrict that case in the next version

>
>>> Does it actually make sense to allow copy of temporary slots or change
>>> their persistence?  Those don't live across sessions so they'd need to
>>> be copied in the same session which created them.
>>
>> I think the copying of temporary slots would be an impracticable
>> feature but the changing their persistence might be helpful for some
>> cases, especially copying from persistent to temporary.
>
> The session doing the copy of a permanent slot to the temporary slot has
> to be the one also consuming it as the session which created the slot
> owns it, and the slot would be dropped when the session ends.  For
> logical slots perhaps you have something in mind?  Like copying a slot
> which is not active to check where it is currently replaying, and using
> the copy for sanity checks?

Yeah, I imagined such case. If we want to do backup/logical decoding
from the same point as the source permanent slot we might want to use
temporary slots so that it will be dropped surely after that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

2018-08-28 Thread Michael Paquier

Hi all,

Here is a summary of what has happened since this thread has been
created.  Three problems reported on this thread have been solved and
resulted in different commits for early lock lookups:
- VACUUM FULL, patched on 12~:
https://www.postgresql.org/message-id/2018081142.ga6...@paquier.xyz
Commit a556549: Improve VACUUM and ANALYZE by avoiding early lock queue
- TRUNCATE, patched on 12~:
https://www.postgresql.org/message-id/20180806165816.ga19...@paquier.xyz
Commit f841ceb: Improve TRUNCATE by avoiding early lock queue
- REINDEX, patched on 11~:
https://www.postgresql.org/message-id/20180805211059.ga2...@paquier.xyz
Commit 661dd23: Restrict access to reindex of shared catalogs for
non-privileged users

Please note that I have been very conservative with the different fixes
as v11 is getting very close to release.  The patch for REINDEX is a
behavior change which will not get further down anyway.  It would still
be nice to get a second lookup at the code and look if there are other
suspicious calls of relation_open or such which could allow
non-privileged users to pile up locks and cause more DOS problems.

Thanks,
--
Michael


signature.asc
Description: PGP signature

Re: Reopen logfile on SIGHUP

2018-08-28 Thread Kyotaro HORIGUCHI

At Tue, 28 Aug 2018 18:50:31 +0300, Alexander Korotkov 
 wrote in 

> On Tue, Aug 21, 2018 at 4:48 AM Kyotaro HORIGUCHI
>  wrote:
> >
> > At Fri, 10 Aug 2018 15:33:26 +0300, Alexander Kuzmenkov 
> >  wrote in 
> > <5142559a-82e6-b3e4-d6ed-8fd2d240c...@postgrespro.ru>
> > > On 08/09/2018 10:33 AM, Kyotaro HORIGUCHI wrote:
> > > >
> > > > - Since I'm not sure unlink is signal-handler safe on all
> > > >supported platforms, I moved unlink() call out of
> > > >checkLogrotateSignal() to SysLoggerMain, just before rotating
> > > >log file.
> > >
> > > Which platforms specifically do you have in mind? unlink() is required
> > > to be async-signal-safe by POSIX.1-2001, which is required by UNIX 03,
> > > and these are rather old.
> >
> > I suspect that something bad can happen on Windows. Another
> > reason for the movement I didn't mention was it is not necesarry
> > to be there. So I applied the principle that a signal handler
> > should be as small and simple as possible.  As the result the
> > flow of logrotate signal handling becomes similar to that for
> > promote signal.
> 
> I went through this thread.  It started from discussion about changing
> signal handling in syslogger, which has spotted set of problems.
> However, now there is a patch which add new pg_ctl command, which
> issues SIGUSR1 to syslogger.  It seems that nobody in the thread
> object against this feature.

Agreed.

> I've revised this patch a bit.  It appears to me that only postmaster
> cares about logrotate file, while syslogger just handles SIGUSR1 as it
> did before.  So, I've moved code that deletes logrotate file into
> postmaster.c.

As replied to Michael's commnet, I agree to the change.

> Also I found that this new pg_ctl isn't covered with tests at all.  So
> I've added very simple tap tests, which ensures that when log file was
> renamed, it reappeared again after pg_ctl logrotate.  I wonder how
> that would work on Windows.  Thankfully commitfest.cputube.org have
> Windows checking facility now.

Thanks for the test. Documentaion and help message looks fine
including the changed ordering. (180 seconds retry may be a bit
too long but I'm fine with it.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

Hi Everyone,

Tom, I think this could use your eyes.

On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> I've a few leads that I'm currently testing out. One observation I think
> might be crucial is that the problem, in Tomas' testcase with just
> VACUUM FULL of pg_class and INSERTs into another rel, appears to only
> happen if autovacuum is enabled. The crashes are always immediately
> after autovacuum did an analyze of the relation.

I've run this with autovacuum disabled for quite a while, without the
problem re-occuring. It does however re-occur with pgbench running an
ANALYZE on the user defined table.

> What I see on the apply side, I think, that invalidation records for
> contents of pg_class and pg_class itself are intermixed.

I've not 100% figured this out yet, and I need to switch to another task
for a bit, but what I think is happening here is that we have a
relatively fundamental issue with how rewrites of catalog tables are
handled.  A slightly instrumented version of postgres shows the
following:

2018-08-29 04:30:17.515 CEST [19551] LOG:  statement: INSERT INTO t(b,c) SELECT 
i,i FROM generate_series(1,59) s(i);
...
2018-08-29 04:30:17.559 CEST [19554] LOG:  statement: vacuum full pg_class;
2018-08-29 04:30:17.571 CEST [19557] LOG:  swapping mapped 1259/1281520 w 
1281526/1281526
2018-08-29 04:30:17.571 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.572 CEST [19557] LOG:  reindexing 2662 to 1281529 on 
1259/1281526
2018-08-29 04:30:17.572 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.576 CEST [19557] LOG:  reindexing 2663 to 1281530 on 
1259/1281526
2018-08-29 04:30:17.576 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.578 CEST [19557] LOG:  reindexing 3455 to 1281531 on 
1259/1281526
2018-08-29 04:30:17.578 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.582 CEST [19557] LOG:  releasing lock on 1259
2018-08-29 04:30:17.582 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.582 CEST [19557] LOG:  duration: 87.389 ms
2018-08-29 04:30:17.582 CEST [19557] LOG:  statement: vacuum full pg_class;
2018-08-29 04:30:17.583 CEST [19563] LOG:  duration: 630.481 ms

2018-08-29 04:30:17.584 CEST [19563] LOG:  statement: analyze t;
2018-08-29 04:30:17.597 CEST [19555] LOG:  swapping mapped 1259/1281526 w 
1281532/1281532
2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.597 CEST [19555] LOG:  reindexing 2662 to 1281535 on 
1259/1281532
2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.599 CEST [19555] LOG:  reindexing 2663 to 1281536 on 
1259/1281532
2018-08-29 04:30:17.599 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.601 CEST [19555] LOG:  reindexing 3455 to 1281537 on 
1259/1281532
2018-08-29 04:30:17.601 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.605 CEST [19555] LOG:  releasing lock on 1259
2018-08-29 04:30:17.605 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.605 CEST [19551] WARNING:  could not read block 3 in file 
"base/16384/1281529": read only 0 of 8192 bytes
(it's a warning because the process SIGSTOPs itself afterwards, for
easier debugging)

So what we can see is that a first VACUUM FULL of pg_class builds a new
pg_class_oid_index/2662 index under filefilenode 1281526.  Then a
*second* VACUUM FULL again reindexes 2663, this time to 1281532.

But *after* the VACUUM FULL has finished and released the lock (that's
printed just before the UnlockRelationIdForSession in vacuum_rel()), we
can see that the INSERT from pid 19551 still accesses the OLD index,
from *before* the second VACUUM FULL.  Which at that point doesn't exist
anymore.

I think the reason for that is that an invalidation for the relmapper
changes due to the second VACUUM has been queued, but currently the
invalidation processing hasn't yet gotten to that entry yet. Instead
currently the entry for the user defined table is being rebuilt, which
then tries to access pg_class, but under its "old" relfilenode. Which
then fails.  The fact that inex scans trigger this problem isn't
relevant - the heap entry for pg_class also refers to the old entry, as
I've verified using gdb.

I'm unclear on how can fix this, as long as we rebuild cache entries
during invalidation, rather than during access. For most cache entries
it's not a problem to rebuild them relying on other outdated content,
we'll just rebuild them again later.  But for things like the relation
mapper that's not true, unfortunately.

Greetings,

Andres Freund

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

On 2018-08-28 19:56:58 -0700, Andres Freund wrote:
> Hi Everyone,
> 
> 
> Tom, I think this could use your eyes.
> 
> 
> On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> > I've a few leads that I'm currently testing out. One observation I think
> > might be crucial is that the problem, in Tomas' testcase with just
> > VACUUM FULL of pg_class and INSERTs into another rel, appears to only
> > happen if autovacuum is enabled. The crashes are always immediately
> > after autovacuum did an analyze of the relation.
> 
> I've run this with autovacuum disabled for quite a while, without the
> problem re-occuring. It does however re-occur with pgbench running an
> ANALYZE on the user defined table.
> 
> 
> > What I see on the apply side, I think, that invalidation records for
> > contents of pg_class and pg_class itself are intermixed.
> 
> I've not 100% figured this out yet, and I need to switch to another task
> for a bit, but what I think is happening here is that we have a
> relatively fundamental issue with how rewrites of catalog tables are
> handled.  A slightly instrumented version of postgres shows the
> following:
> 
> 2018-08-29 04:30:17.515 CEST [19551] LOG:  statement: INSERT INTO t(b,c) 
> SELECT i,i FROM generate_series(1,59) s(i);
> ...
> 2018-08-29 04:30:17.559 CEST [19554] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.571 CEST [19557] LOG:  swapping mapped 1259/1281520 w 
> 1281526/1281526
> 2018-08-29 04:30:17.571 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.572 CEST [19557] LOG:  reindexing 2662 to 1281529 on 
> 1259/1281526
> 2018-08-29 04:30:17.572 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.576 CEST [19557] LOG:  reindexing 2663 to 1281530 on 
> 1259/1281526
> 2018-08-29 04:30:17.576 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.578 CEST [19557] LOG:  reindexing 3455 to 1281531 on 
> 1259/1281526
> 2018-08-29 04:30:17.578 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.582 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  duration: 87.389 ms
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.583 CEST [19563] LOG:  duration: 630.481 ms
> 
> 2018-08-29 04:30:17.584 CEST [19563] LOG:  statement: analyze t;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  swapping mapped 1259/1281526 w 
> 1281532/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  reindexing 2662 to 1281535 on 
> 1259/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.599 CEST [19555] LOG:  reindexing 2663 to 1281536 on 
> 1259/1281532
> 2018-08-29 04:30:17.599 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.601 CEST [19555] LOG:  reindexing 3455 to 1281537 on 
> 1259/1281532
> 2018-08-29 04:30:17.601 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19555] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.605 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19551] WARNING:  could not read block 3 in file 
> "base/16384/1281529": read only 0 of 8192 bytes
> (it's a warning because the process SIGSTOPs itself afterwards, for
> easier debugging)
> 
> So what we can see is that a first VACUUM FULL of pg_class builds a new
> pg_class_oid_index/2662 index under filefilenode 1281526.  Then a
> *second* VACUUM FULL again reindexes 2663, this time to 1281532.
> 
> But *after* the VACUUM FULL has finished and released the lock (that's
> printed just before the UnlockRelationIdForSession in vacuum_rel()), we
> can see that the INSERT from pid 19551 still accesses the OLD index,
> from *before* the second VACUUM FULL.  Which at that point doesn't exist
> anymore.
> 
> I think the reason for that is that an invalidation for the relmapper
> changes due to the second VACUUM has been queued, but currently the
> invalidation processing hasn't yet gotten to that entry yet. Instead
> currently the entry for the user defined table is being rebuilt, which
> then tries to access pg_class, but under its "old" relfilenode. Which
> then fails.  The fact that inex scans trigger this problem isn't
> relevant - the heap entry for pg_class also refers to the old entry, as
> I've verified using gdb.
> 
> 
> I'm unclear on how can fix this, as long as we rebuild cache entries
> during invalidation, rather than during access. For most cache entries
> it's not a problem to rebuild them relying on other outdated content,
> we'll just rebuild them again later.  But for things like the relation
> mapper that's not true, unfortunately.

Fwiw, if I use gdb to issue InvalidateSystemCaches(),
RelationInitPhysicalAddr(sysscan->irel) makes the relcache entry have
the correct relfilenode, and restarting th

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Tom Lane

Andres Freund  writes:
> Tom, I think this could use your eyes.

I've had no luck reproducing it locally ... do you have a recipe
for that?

regards, tom lane

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

On 2018-08-28 23:18:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Tom, I think this could use your eyes.
> 
> I've had no luck reproducing it locally ... do you have a recipe
> for that?

It can reproduce reliably with the three scripts attached:

psql -c' drop table if exists t; create table t (a serial primary key, b int, c 
int);' && pgbench -n -c 4 -T300 -f /tmp/insert.sql
pgbench -n -c 4 -T300 -f /tmp/vacuum.sql
pgbench -n -c 4 -T300 -f /tmp/analyze.sql

Locally that triggers the problem within usually a few seconds.

Greetings,

Andres Freund


insert.sql
Description: application/sql


vacuum.sql
Description: application/sql


analyze.sql
Description: application/sql

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> Locally that triggers the problem within usually a few seconds.

FWIW, it does so including versions as old as 9.2.

Now I need to look for power for my laptop and some for me ;)

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Tom Lane

Andres Freund  writes:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
>> Locally that triggers the problem within usually a few seconds.

> FWIW, it does so including versions as old as 9.2.

Interesting.  One thing I'd like to know is why this only started
showing up in the buildfarm a few weeks ago.  Maybe that's just a
timing effect, but it seems odd.

> Now I need to look for power for my laptop and some for me ;)

Yeah, I'm not going to look harder till tomorrow either.

regards, tom lane

Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

2018-08-28 Thread Andres Freund

On 2018-08-28 23:32:51 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> >> Locally that triggers the problem within usually a few seconds.
> 
> > FWIW, it does so including versions as old as 9.2.

9.0 as well, so it's definitely not some recently backpatched change.

> Interesting.  One thing I'd like to know is why this only started
> showing up in the buildfarm a few weeks ago.  Maybe that's just a
> timing effect, but it seems odd.

I suspect it's largely timing and invalidation traffic.  It's been
really hard to reproduce with core regression tests after all.  If my
theory is right - and I'm getting more and more certain - that we're
trying to access a remapped relation during invalidation, before reading
the relevant relmapper invalidation, that also makes sense: You need to
be unlucky enough that there's a relmapper invalidation *after* an
invalidation for a currently "open" relation (otherwise we'd not do
inval processing). Most of the time we'll likely just be "too fast" to
process all the invalidations (thereby not getting to the point that
pg_class has been remapped later in the queue).

I think it might just be related to some modifed tests changing when
exactly autovacuum is triggered, which then in turn triggers the
invalidation.

Unfortunately the only "real" fix I can think of is to change the
relcache invalidation logic to only ever mark entries as 'invalid', and
not do any eager rebuilding of the entries.  That might be a good idea
reasons for performance anyway, but is quite scary to imagine doing in a
minor release.

Greetings,

Andres Freund

Re: Refactor textToQualifiedNameList()

2018-08-28 Thread Yugo Nagata

On Tue, 28 Aug 2018 11:49:26 +0900 (Tokyo Standard Time)
Kyotaro HORIGUCHI  wrote:

> At Fri, 24 Aug 2018 20:44:12 +0900, Yugo Nagata  wrote 
> in <20180824204412.150979ae6b283ddb639f9...@sraoss.co.jp>
> > When working on other patch[1], I found there are almost same
> > functions, texttoQualifiedNameList() and stringToQualifiedNameList().
> > The only difference is the argument type, text or char*. I don't know
> > why these functions are defined seperately, but I think the former 
> > function can be rewritten using the latter code as the attached patch.
> > Is this reasonable fix?
> 
> The functions were introduced within a month for different
> objectives in March and April, 2002. I supppose that they are
> intentionally added as separate functions for simplicitly since
> the second one is apparent CnP'ed from the first one.
> 
> commit 5f4745adf4fb2a1f933b25d7a2bc72b39fa9edfd
> commit 52200befd04b9fa71da83231c808764867079226

Thank you for your comments. I also looked at the commit history.
stringToQNL is added after textToQNL as a static function originally,
but this is now public and reffered from other modules, too.  So, I 
propose to mote this from regproc.c to verlena.c and rewrite textToQNL
to call stringToQNL.  This is just for reducing redundancy of the code. 
Attached is the updated patch.
 
> Returning to the patch, the downside of it is that textToQNL
> makes an extra and unused copy of the parameter string. (It's a
> kind of bug that it is forgetting to free rawname.)

I also fixed to free rawname in the texttoQNL.

Regards,

-- 
Yugo Nagata 
diff --git a/src/backend/utils/adt/regproc.c b/src/backend/utils/adt/regproc.c
index a0079821fe..ebc087ed3f 100644
--- a/src/backend/utils/adt/regproc.c
+++ b/src/backend/utils/adt/regproc.c
@@ -1680,43 +1680,6 @@ text_regclass(PG_FUNCTION_ARGS)
 }
 
 
-/*
- * Given a C string, parse it into a qualified-name list.
- */
-List *
-stringToQualifiedNameList(const char *string)
-{
-	char	   *rawname;
-	List	   *result = NIL;
-	List	   *namelist;
-	ListCell   *l;
-
-	/* We need a modifiable copy of the input string. */
-	rawname = pstrdup(string);
-
-	if (!SplitIdentifierString(rawname, '.', &namelist))
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_NAME),
- errmsg("invalid name syntax")));
-
-	if (namelist == NIL)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_NAME),
- errmsg("invalid name syntax")));
-
-	foreach(l, namelist)
-	{
-		char	   *curname = (char *) lfirst(l);
-
-		result = lappend(result, makeString(pstrdup(curname)));
-	}
-
-	pfree(rawname);
-	list_free(namelist);
-
-	return result;
-}
-
 /*
  *	 SUPPORT ROUTINES		 *
  */
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index a5e812d026..1846ddcae8 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -3219,26 +3219,19 @@ name_text(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(NameStr(*s)));
 }
 
-
 /*
- * textToQualifiedNameList - convert a text object to list of names
- *
- * This implements the input parsing needed by nextval() and other
- * functions that take a text parameter representing a qualified name.
- * We split the name at dots, downcase if not double-quoted, and
- * truncate names if they're too long.
+ * Given a C string, parse it into a qualified-name list.
  */
 List *
-textToQualifiedNameList(text *textval)
+stringToQualifiedNameList(const char *string)
 {
 	char	   *rawname;
 	List	   *result = NIL;
 	List	   *namelist;
 	ListCell   *l;
 
-	/* Convert to C string (handles possible detoasting). */
-	/* Note we rely on being able to modify rawname below. */
-	rawname = text_to_cstring(textval);
+	/* We need a modifiable copy of the input string. */
+	rawname = pstrdup(string);
 
 	if (!SplitIdentifierString(rawname, '.', &namelist))
 		ereport(ERROR,
@@ -3263,6 +3256,31 @@ textToQualifiedNameList(text *textval)
 	return result;
 }
 
+
+/*
+ * textToQualifiedNameList - convert a text object to list of names
+ *
+ * This implements the input parsing needed by nextval() and other
+ * functions that take a text parameter representing a qualified name.
+ * We split the name at dots, downcase if not double-quoted, and
+ * truncate names if they're too long.
+ */
+List *
+textToQualifiedNameList(text *textval)
+{
+	char	   *rawname;
+	List	   *result = NIL;
+
+	/* Convert to C string (handles possible detoasting). */
+	/* Note we rely on being able to modify rawname below. */
+	rawname = text_to_cstring(textval);
+
+	result = stringToQualifiedNameList(rawname);
+	pfree(rawname);
+
+	return result;
+}
+
 /*
  * SplitIdentifierString --- parse a string containing identifiers
  *
diff --git a/src/include/utils/regproc.h b/src/include/utils/regproc.h
index 5b9a8cbee8..ace74ed49f 100644
--- a/src/include/utils/regproc.h
+++ b/src/include/utils/regpro

Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-08-28 Thread Peter Eisentraut

Do you plan to submit this patch to the upcoming commit fest perhaps?  I
have done some testing on it and it seems worth pursuing further.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

84 matches

Mail list logo