Re: POC: Cleaning up orphaned files using undo logs

2019-07-23 Thread Rushabh Lathia
Hi,

I have stated review of
0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
quick comments.

1) README.undointerface should provide more information like API details or
the sequence in which API should get called.

2) Information about the API's in the undoaccess.c file header block would
good.  For reference please look at heapam.c.

3) typo

+ * Later, during insert phase we will write actual records into thse
buffers.
+ */

%s/thse/these

4) UndoRecordUpdateTransInfo() comments says that this must be called under
the critical section, but seems like undo_xlog_apply_progress() do call it
outside of critical section?  Is there exception, then should add comments?
or Am I missing anything?


5) In function UndoBlockGetFirstUndoRecord() below code:

/* Calculate the size of the partial record. */
partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
   phdr->tuple_len + phdr->payload_len -
   phdr->record_offset;

can directly use UndoPagePartialRecSize().

6)

+static int
+UndoGetBufferSlot(UndoRecordInsertContext *context,
+  RelFileNode rnode,
+  BlockNumber blk,
+  ReadBufferMode rbm)
+{
+inti;

In the above code variable "i" is mean "block index".  It would be good
to give some valuable name to the variable, maybe "blockIndex" ?

7)

* We will also keep a previous undo record pointer to the first and last
undo
 * record of the transaction in the previous log.  The last undo record
 * location is used find the previous undo record pointer during rollback.


%s/used fine/used to find

8)

/*
 * Defines the number of times we try to wait for rollback hash table to get
 * initialized.  After these many attempts it will return error and the user
 * can retry the operation.
 */
#define ROLLBACK_HT_INIT_WAIT_TRY  60

%s/error/an error

9)

 * we can get the exact size of partial record in this page.
 */

%s/of partial/of the partial"

10)

* urecptr - current transaction's undo record pointer which need to be set
in
* the previous transaction's header.

%s/need/needs

11)

/*
 * If we are writing first undo record for the page the we can set the
 * compression so that subsequent records from the same transaction can
 * avoid including common information in the undo records.
 */


%s/the page the/the page then

12)

/*
 * If the transaction's undo records are split across the undo logs.  So
 * we need to  update our own transaction header in the previous log.
 */

double space between "to" and "update"

13)

* The undo record should be freed by the caller by calling
ReleaseUndoRecord.
 * This function will old the pin on the buffer where we read the previous
undo
 * record so that when this function is called repeatedly with the same
context

%s/old/hold

I will continue further review for the same patch.

Regards,
-- 
Rushabh Lathia
www.EntepriseDB.com


Re: On the stability of TAP tests for LDAP

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 5:26 PM Michael Paquier  wrote:
> > Does this look about right?
>
> Some comments from here.  I have not tested the patch.
>
> I would recommend using TestLib::system_log instead of plain system().
> The command should be a list of arguments with one element per
> argument (see call of system_log in PostgresNode.pm for example).  The
> indentation is incorrect, and that I would make the retry longer as I
> got the feeling that on slow machines we could still have issues.  We
> also usually tend to increase the timeout up to 5 minutes, and the
> sleep phases make use of Time::HiRes::usleep.

Thanks, here's v2.


--
Thomas Munro
https://enterprisedb.com


wait-for-slapd-v2.patch
Description: Binary data


Re: On the stability of TAP tests for LDAP

2019-07-23 Thread Michael Paquier
On Wed, Jul 24, 2019 at 04:41:05PM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 3:52 PM Thomas Munro  wrote:
> > I guess we should do that too.  I don't know how to write Perl but I'll 
> > try...
> 
> Does this look about right?

Some comments from here.  I have not tested the patch.

I would recommend using TestLib::system_log instead of plain system().
The command should be a list of arguments with one element per
argument (see call of system_log in PostgresNode.pm for example).  The
indentation is incorrect, and that I would make the retry longer as I
got the feeling that on slow machines we could still have issues.  We
also usually tend to increase the timeout up to 5 minutes, and the
sleep phases make use of Time::HiRes::usleep.
--
Michael


signature.asc
Description: PGP signature


Re: Change atoi to strtol in same place

2019-07-23 Thread Michael Paquier
On Wed, Jul 24, 2019 at 04:57:42PM +1200, David Rowley wrote:
> I'd like to put my vote not to add this complex code to each option
> validation that requires an integer number. I'm not sure there
> currently is a home for it, but if there was, wouldn't it be better
> writing a function that takes a lower and upper bound and sets some
> output param with the value and returns a bool to indicate if it's
> within range or not?

Perhaps.  When I see this patch calling strtol basically only for 10
as base, this reminds me of Fabien Coelho's patch refactor all the
strtoint routines we have in the code:
https://commitfest.postgresql.org/23/2099/

The conclusion that we are reaching on the thread is to remove more
dependencies on strtol that we have in the code, and replace it with
our own, more performant wrappers.  This thread makes me wondering
that we had better wait before doing this move.
--
Michael


signature.asc
Description: PGP signature


Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane  wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection.  We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of postmaster failures not involving that is epsilon.

> I don't have access to the build farm history in searchable format
> (I'll go and ask for that).

Yeah, it's definitely handy to be able to do SQL searches in the
history.  I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.

> Do you have an example to hand?  Is this
> failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those.  To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

  sysname  |branch |  snapshot   
---+---+-
 lorikeet  | HEAD  | 2019-06-16 20:28:25
 lorikeet  | HEAD  | 2019-07-07 14:58:38
 lorikeet  | HEAD  | 2019-07-02 10:38:08
 lorikeet  | HEAD  | 2019-06-14 14:58:24
 lorikeet  | HEAD  | 2019-07-04 20:28:44
 lorikeet  | HEAD  | 2019-04-30 11:00:49
 lorikeet  | HEAD  | 2019-06-19 20:29:27
 lorikeet  | HEAD  | 2019-05-21 08:28:26
 lorikeet  | REL_11_STABLE | 2019-07-11 08:29:08
 lorikeet  | REL_11_STABLE | 2019-07-09 08:28:41
 lorikeet  | REL_12_STABLE | 2019-07-16 08:28:37
 lorikeet  | REL_12_STABLE | 2019-07-02 21:46:47
 lorikeet  | REL9_6_STABLE | 2019-07-02 20:28:14
 vulpes| HEAD  | 2019-06-14 09:18:18
 vulpes| HEAD  | 2019-06-27 09:17:19
 vulpes| HEAD  | 2019-07-21 09:01:45
 vulpes| HEAD  | 2019-06-12 09:11:02
 vulpes| HEAD  | 2019-07-05 08:43:29
 vulpes| HEAD  | 2019-07-15 08:43:28
 vulpes| HEAD  | 2019-07-19 09:28:12
 wobbegong | HEAD  | 2019-06-09 20:43:22
 wobbegong | HEAD  | 2019-07-02 21:17:41
 wobbegong | HEAD  | 2019-06-04 21:06:07
 wobbegong | HEAD  | 2019-07-14 20:43:54
 wobbegong | HEAD  | 2019-06-19 21:05:04
 wobbegong | HEAD  | 2019-07-08 20:55:18
 wobbegong | HEAD  | 2019-06-28 21:18:46
 wobbegong | HEAD  | 2019-06-02 20:43:20
 wobbegong | HEAD  | 2019-07-04 21:01:37
 wobbegong | HEAD  | 2019-06-14 21:20:59
 wobbegong | HEAD  | 2019-06-23 21:36:51
 wobbegong | HEAD  | 2019-07-18 21:31:36
(32 rows)

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem.  This pattern looks way more
like a platform-specific issue.  Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

regards, tom lane




Re: range_agg

2019-07-23 Thread Paul A Jungwirth
On Tue, Jul 23, 2019 at 3:32 PM Alvaro Herrera  wrote:
> Just checking if you've had a chance to make progress on this.

Not a lot. :-) But I should have more time for it the next few weeks
than I did the last few. I do have some code for creating concrete
multirange types (used when you create a concrete range type) and
filling in a TypeCacheEntry based on the range type oid---which I know
is all very modest progress. I've been working on a multirange_in
function and mostly just learning about Postgres varlena and TOASTed
objects by reading the code for range_in & array_in.

Here is something from my multirangetypes.h:

/*
 * Multiranges are varlena objects, so must meet the varlena convention that
 * the first int32 of the object contains the total object size in bytes.
 * Be sure to use VARSIZE() and SET_VARSIZE() to access it, though!
 */
typedef struct
{
int32   vl_len_;/* varlena header (do not touch
directly!) */
Oid multirangetypid;/* multirange type's own OID */
/*
 * Following the OID are the range objects themselves.
 * Note that ranges are varlena too,
 * depending on whether they have lower/upper bounds
 * and because even their base types can be varlena.
 * So we can't really index into this list.
 */
} MultirangeType;

I'm working on parsing a multirange much like we parse an array,
although it's a lot simpler because it's a single dimension and there
are no nulls.

I know that's not much to go on, but let me know if any of it worries you. :-)

Paul




Re: Change atoi to strtol in same place

2019-07-23 Thread David Rowley
On Wed, 24 Jul 2019 at 16:02, Joe Nelson  wrote:
> > In general, I think it's a good idea to fix those places, but I wonder
> > if we need to change the error messages too.
>
> I'll leave that decision for the community to debate. I did, however,
> remove the newlines for the new error messages being passed to
> pg_log_error().

I'd like to put my vote not to add this complex code to each option
validation that requires an integer number. I'm not sure there
currently is a home for it, but if there was, wouldn't it be better
writing a function that takes a lower and upper bound and sets some
output param with the value and returns a bool to indicate if it's
within range or not?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: On the stability of TAP tests for LDAP

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 3:52 PM Thomas Munro  wrote:
> I guess we should do that too.  I don't know how to write Perl but I'll try...

Does this look about right?

-- 
Thomas Munro
https://enterprisedb.com


wait-for-slapd.patch
Description: Binary data


Re: POC: Cleaning up orphaned files using undo logs

2019-07-23 Thread Dilip Kumar
On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar  wrote:
>
Please find my review comments for
0013-Allow-foreground-transactions-to-perform-undo-action

+ /* initialize undo record locations for the transaction */
+ for (i = 0; i < UndoLogCategories; i++)
+ {
+ s->start_urec_ptr[i] = InvalidUndoRecPtr;
+ s->latest_urec_ptr[i] = InvalidUndoRecPtr;
+ s->undo_req_pushed[i] = false;
+ }

Can't we just memset this memory?


+ * We can't postpone applying undo actions for subtransactions as the
+ * modifications made by aborted subtransaction must not be visible even if
+ * the main transaction commits.
+ */
+ if (IsSubTransaction())
+ return;

I am not completely sure but is it possible that the outer function
CommitTransactionCommand/AbortCurrentTransaction can avoid
calling this function in the switch case based on the current state,
so that under subtransaction this will never be called?


+ /*
+ * Prepare required undo request info so that it can be used in
+ * exception.
+ */
+ ResetUndoRequestInfo();
+ urinfo.dbid = dbid;
+ urinfo.full_xid = fxid;
+ urinfo.start_urec_ptr = start_urec_ptr[per_level];
+

I see that we are preparing urinfo before execute_undo_actions so that
in case of an error in CATCH we can use that to
insert into the queue, but can we just initialize urinfo right there
before inserting into the queue, we have all the information
Am I missing something?

+
+ /*
+ * We need the locations of the start and end undo record pointers when
+ * rollbacks are to be performed for prepared transactions using undo-based
+ * relations.  We need to store this information in the file as the user
+ * might rollback the prepared transaction after recovery and for that we
+ * need it's start and end undo locations.
+ */
+ UndoRecPtr start_urec_ptr[UndoLogCategories];
+ UndoRecPtr end_urec_ptr[UndoLogCategories];

it's -> its

+ bool undo_req_pushed[UndoLogCategories]; /* undo request pushed
+ * to worker? */
+ bool performUndoActions;
+
  struct TransactionStateData *parent; /* back link to parent */

We must have some comments to explain how performUndoActions is used,
where it's set.  If it's explained somewhere else then we can
give reference to that code.

+ for (i = 0; i < UndoLogCategories; i++)
+ {
+ if (s->latest_urec_ptr[i])
+ {
+ s->performUndoActions = true;
+ break;
+ }
+ }

I think we should chek UndoRecPtrIsValid(s->latest_urec_ptr[i])

+ PG_TRY();
+ {
+ /*
+ * Prepare required undo request info so that it can be used in
+ * exception.
+ */
+ ResetUndoRequestInfo();
+ urinfo.dbid = dbid;
+ urinfo.full_xid = fxid;
+ urinfo.start_urec_ptr = start_urec_ptr[per_level];
+
+ /* for subtransactions, we do partial rollback. */
+ execute_undo_actions(urinfo.full_xid,
+ end_urec_ptr[per_level],
+ start_urec_ptr[per_level],
+ !isSubTrans);
+ }
+ PG_CATCH();

Wouldn't it be good to explain in comments that we are not rethrowing
the error in PG_CATCH but because we don't want the main
transaction to get an error if there is an error while applying to
undo action for the main transaction and we will abort the transaction
in the caller of this function?

+tables are only accessible in the backend that has created them.  We can't
+postpone applying undo actions for subtransactions as the modifications
+made by aborted subtransaction must not be visible even if the main transaction
+commits.

I think we need to give detail reasoning why subtransaction changes
will be visible if we don't apply it's undo and the main
the transaction commits by mentioning that we don't use separate
transaction id for the subtransaction and that will make all the
changes of the transaction id visible when it commits.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
On Wed, 24 Jul 2019 at 15:05, David Rowley  wrote:
> To be able to reduce the threshold down again we'd need to make a
> hash_get_num_entries(LockMethodLocalHash) call before performing the
> guts of LockReleaseAll(). We could then weight that onto some running
> average counter with a weight of, say... 10, so we react to changes
> fairly quickly, but not instantly. We could then have some sort of
> logic like "rebuild the hash table if running average 4 times less
> than max_bucket"
>
> I've attached a spreadsheet of that idea and the algorithm we could
> use to track the running average.  Initially, I've mocked it up a
> series of transactions that use 1000 locks, then at row 123 dropped
> that to 10 locks. If we assume the max_bucket is 1000, then it takes
> until row 136 for the running average to drop below the max_bucket
> count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> the average went up again, then we'd automatically expand the table as
> we do now.  To make this work we'd need an additional call to
> hash_get_num_entries(), before we release the locks, so there is more
> overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

There will be a bit of additional overhead to the reset detection
logic over the v7 patch. Namely: additional hash_get_num_entries()
call before releasing the locks, and more complex and floating-point
maths instead of more simple integer maths in v7.

Here's a demo with the debug NOTICE in there to show us what's going on.

-- setup
create table a (a int) partition by range (a);
select 'create table a'||x||' partition of a for values from('||x||')
to ('||x+1||');' from generate_Series(1,1000) x;
\gexec

$ psql postgres
NOTICE:  max_bucket = 15, threshold = 64.00, running_avg_locks
0.10 Reset? No
psql (13devel)
# \o /dev/null
# select * from a where a < 100;
NOTICE:  max_bucket = 101, threshold = 64.00, running_avg_locks
10.09 Reset? Yes
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 76.324005, running_avg_locks
19.081001 Reset? Yes
# select * from a where a < 100;

A couple of needless resets there... Maybe we can get rid of those by
setting the initial running average up to something higher than 0.0.

NOTICE:  max_bucket = 99, threshold = 108.691605, running_avg_locks
27.172901 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 137.822449, running_avg_locks
34.455612 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 164.040207, running_avg_locks
41.010052 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 187.636185, running_avg_locks
46.909046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 208.872559, running_avg_locks
52.218140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 227.985306, running_avg_locks
56.996326 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 245.186768, running_avg_locks
61.296692 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 260.668091, running_avg_locks
65.167023 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 274.601288, running_avg_locks
68.650322 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 287.141174, running_avg_locks
71.785294 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 298.427063, running_avg_locks
74.606766 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 308.584351, running_avg_locks
77.146088 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 317.725922, running_avg_locks
79.431480 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 325.953339, running_avg_locks
81.488335 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 333.358002, running_avg_locks
83.339500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 340.022217, running_avg_locks
85.005554 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 346.019989, running_avg_locks
86.504997 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 351.417999, running_avg_locks
87.854500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 356.276184, running_avg_locks
89.069046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 360.648560, running_avg_locks
90.162140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 364.583710, running_avg_locks
91.145927 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 368.125336, running_avg_locks
92.031334 Reset? No
# select * from a 

Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-23 Thread Ian Barwick

On 7/23/19 5:10 PM, Michael Paquier wrote:

On Mon, Jul 22, 2019 at 12:58:40PM -0400, Alvaro Herrera wrote:

Maybe it's just me, but it seems weird to try to forestall a problem
that cannot occur by definition.  I would rather remove the escaping,
and add a one-line comment explaining why we don't do it?


No objections with doing that either, really.  Perhaps you would
prefer pushing a patch among those lines by yourself?

One argument that I got in mind to justify the escaping would be if we
add a new feature in pg_basebackup to write a new set of recovery
options on an existing data folder, which does not require an option.
In this case, if the escaping does not exist, starting the server
would fail with a confusing parsing error if a quote is added to the
slot name.  But if the escaping is done, then we would get a correct
error that the replication slot value includes an incorrect character.
If such an hypothetical option is added, most likely this would be
noticed anyway, so that's mainly nannyism from my side.


It'd be better if such a hypothetical option validated the provided
slot name anwyay, to prevent later surprises.

Revised patch attached, which as Alvaro suggests removes the escaping
and adds a comment explaining why the raw value can be passed as-is.


Regards

Ian Barwick


--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
new file mode 100644
index 77a7c14..9c910cb
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
*** GenerateRecoveryConf(PGconn *conn)
*** 1716,1724 
  
  	if (replication_slot)
  	{
! 		escaped = escape_quotes(replication_slot);
  		appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n", replication_slot);
- 		free(escaped);
  	}
  
  	if (PQExpBufferBroken(recoveryconfcontents) ||
--- 1716,1727 
  
  	if (replication_slot)
  	{
! 		/*
! 		 * The slot name does not need escaping as it can only consist of [a-zA-Z0-9_].
! 		 * The validity of the name has already been proven, as a slot must have been
! 		 * successfully created with that name to reach this point.
! 		 */
  		appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n", replication_slot);
  	}
  
  	if (PQExpBufferBroken(recoveryconfcontents) ||


Re: block-level incremental backup

2019-07-23 Thread vignesh C
Thanks Jeevan.

1) If relation file has changed due to truncate or vacuum.
During incremental backup the new files will be copied.
There are chances that both the old  file and new file
will be present. I'm not sure if cleaning up of the
old file is handled.
2) Just a small thought on building the bitmap,
can the bitmap be built and maintained as
and when the changes are happening in the system.
If we are building the bitmap while doing the incremental backup,
Scanning through each file might take more time.
This can be a configurable parameter, the system can run
without capturing this information by default, but if there are some
of them who will be taking incremental backup frequently this
configuration can be enabled which should track the modified blocks.

What is your thought on this?
-- 
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Tue, Jul 23, 2019 at 11:19 PM Jeevan Ladhe
 wrote:
>
> Hi Vignesh,
>
> This backup technology is extending the pg_basebackup itself, which means we 
> can
> still take online backups. This is internally done using pg_start_backup and
> pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is
> used in the recovery process while starting the cluster from a backup image. 
> What
> incremental backup will just modify (as compared to traditional pg_basebackup)
> is - After doing the checkpoint, instead of copying the entire relation files,
> it takes an input LSN and scan all the blocks in all relation files, and store
> the blocks having LSN >= InputLSN. This means it considers all the changes
> that are already written into relation files including insert/update/delete 
> etc
> up to the checkpoint performed by pg_start_backup internally, and as Jeevan 
> Chalke
> mentioned upthread the incremental backup will also contain copy of WAL files.
> Once this incremental backup is combined with the parent backup by means of 
> new
> combine process (that will be introduced as part of this feature itself) 
> should
> ideally look like a full pg_basebackup. Note that any changes done by these
> insert/delete/update operations while the incremental backup was being taken
> will be still available via WAL files and as normal restore process, will be
> replayed from the checkpoint onwards up to a consistent point.
>
> My two cents!
>
> Regards,
> Jeevan Ladhe
>
> On Sat, Jul 20, 2019 at 11:22 PM vignesh C  wrote:
>>
>> Hi Jeevan,
>>
>> The idea is very nice.
>> When Insert/update/delete and truncate/drop happens at various
>> combinations, How the incremental backup handles the copying of the
>> blocks?
>>
>>
>> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
>>  wrote:
>> >
>> >
>> >
>> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke 
>> >>  wrote:
>> >>>
>> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  
>> >>> wrote:
>> 
>> 
>>  At what stage you will apply the WAL generated in between the 
>>  START/STOP backup.
>> >>>
>> >>>
>> >>> In this design, we are not touching any WAL related code. The WAL files 
>> >>> will
>> >>> get copied with each backup either full or incremental. And thus, the 
>> >>> last
>> >>> incremental backup will have the final WAL files which will be copied 
>> >>> as-is
>> >>> in the combined full-backup and they will get apply automatically if that
>> >>> the data directory is used to start the server.
>> >>
>> >>
>> >> Ok, so you keep all the WAL files since the first backup, right?
>> >
>> >
>> > The WAL files will anyway be copied while taking a backup (full or 
>> > incremental),
>> > but only last incremental backup's WAL files are copied to the combined
>> > synthetic full backup.
>> >
>> >>>
>> 
>>  --
>>  Ibrar Ahmed
>> >>>
>> >>>
>> >>> --
>> >>> Jeevan Chalke
>> >>> Technical Architect, Product Development
>> >>> EnterpriseDB Corporation
>> >>>
>> >>
>> >>
>> >> --
>> >> Ibrar Ahmed
>> >
>> >
>> >
>> > --
>> > Jeevan Chalke
>> > Technical Architect, Product Development
>> > EnterpriseDB Corporation
>> >
>>
>>
>> --
>> Regards,
>> vignesh
>>
>>
>>




Re: Change atoi to strtol in same place

2019-07-23 Thread Joe Nelson
> Surafel Temesgen wrote:
> > we use atoi for user argument processing in same place which return
> > zero for both invalid input and input value zero. [...] in same
> > place where we accept zero as valued input it case a problem by
> > preceding for invalid input as input value zero.  strtol which can
> > handle invalid input

Not only that, but atoi causes Undefined Behavior on erroneous input.
The C99 standard says this:

7.20.1 Numeric conversion functions
  The functions atof, atoi, atol, and atoll need not affect the value of the
  integer expression errno on an error. If the value of the result cannot be
  represented, the behavior is undefined.

Tomas Vondra wrote:
> This seems to have bit-rotted (due to minor changes to pg_basebackup).
> Can you fix that and post an updated version?

I adjusted the patch to apply cleanly on a0555ddab9.

> In general, I think it's a good idea to fix those places, but I wonder
> if we need to change the error messages too.

I'll leave that decision for the community to debate. I did, however,
remove the newlines for the new error messages being passed to
pg_log_error(). 

As discussed in message [0], the logging functions in common/logging.c
now contain an assertion that messages do not end in newline:

  Assert(fmt[strlen(fmt) - 1] != '\n');

(in pg_log_error via pg_log_generic via pg_log_generic_v)

I also added limits.h to some places it was missing, so the patch would
build.

0: https://postgr.es/m/6a609b43-4f57-7348-6480-bd022f924...@2ndquadrant.com
diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index 23f706b21d..2bcacbfbb5 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -638,6 +639,14 @@ sigquit_handler(int sig)
 int
 main(int argc, char **argv)
 {
+	char	   *keepfilesendptr;
+	char	   *maxretriesendptr;
+	char	   *sleeptimeendptr;
+	char	   *maxwaittimeendptr;
+	long		numkeepfiles;
+	long		nummaxretries;
+	long		numsleeptime;
+	long		nummaxwaittime;
 	int			c;
 
 	progname = get_progname(argv[0]);
@@ -688,12 +697,17 @@ main(int argc, char **argv)
 debug = true;
 break;
 			case 'k':			/* keepfiles */
-keepfiles = atoi(optarg);
-if (keepfiles < 0)
+errno = 0;
+numkeepfiles = strtol(optarg, , 10);
+
+if (keepfilesendptr == optarg || *keepfilesendptr != '\0' ||
+	numkeepfiles < 0 || numkeepfiles > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -k keepfiles must be >= 0\n", progname);
+	fprintf(stderr, "%s: -k keepfiles must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+keepfiles = (int) numkeepfiles;
 break;
 			case 'l':			/* Use link */
 
@@ -707,31 +721,46 @@ main(int argc, char **argv)
 #endif
 break;
 			case 'r':			/* Retries */
-maxretries = atoi(optarg);
-if (maxretries < 0)
+errno = 0;
+nummaxretries = strtol(optarg, , 10);
+
+if (maxretriesendptr == optarg || *maxretriesendptr != '\0' ||
+	nummaxretries < 0 || nummaxretries > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -r maxretries must be >= 0\n", progname);
+	fprintf(stderr, "%s: -r maxretries must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+maxretries = (int) nummaxretries;
 break;
 			case 's':			/* Sleep time */
-sleeptime = atoi(optarg);
-if (sleeptime <= 0 || sleeptime > 60)
+errno = 0;
+numsleeptime = strtol(optarg, , 10);
+
+if (sleeptimeendptr == optarg || *sleeptimeendptr != '\0' ||
+	numsleeptime <= 0 || numsleeptime > 60 ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -s sleeptime incorrectly set\n", progname);
+	fprintf(stderr, "%s: -s sleeptime must be in range %d..%d\n", progname, 1, 59);
 	exit(2);
 }
+sleeptime = (int) numsleeptime;
 break;
 			case 't':			/* Trigger file */
 triggerPath = pg_strdup(optarg);
 break;
 			case 'w':			/* Max wait time */
-maxwaittime = atoi(optarg);
-if (maxwaittime < 0)
+errno = 0;
+nummaxwaittime = strtol(optarg, , 10);
+
+if (maxwaittimeendptr == optarg || *maxwaittimeendptr != '\0' ||
+	nummaxwaittime < 0 || nummaxwaittime > INT_MAX ||
+	errno == ERANGE)
 {
-	fprintf(stderr, "%s: -w maxwaittime incorrectly set\n", progname);
+	fprintf(stderr, "%s: -w maxwaittime must be in range %d..%d\n", progname, 0, INT_MAX);
 	exit(2);
 }
+maxwaittime = (int) nummaxwaittime;
 break;
 			default:
 fprintf(stderr, "Try \"%s --help\" for more information.\n", progname);
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 77a7c148ba..6ed523fdd6 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2197,6 +2198,10 @@ main(int argc, char **argv)

Re: On the stability of TAP tests for LDAP

2019-07-23 Thread Thomas Munro
On Fri, Jul 19, 2019 at 3:30 PM Michael Paquier  wrote:
> # Running: /usr/sbin/slapd -f
> /home/bf/bfr/root/HEAD/pgsql.build/src/test/ldap/tmp_check/slapd.conf
>   -h ldap://localhost:55306 ldaps://localhost:55307
> # loading LDAP data
> # Running: ldapadd -x -y
> /home/bf/bfr/root/HEAD/pgsql.build/src/test/ldap/tmp_check/ldappassword
> -f authdata.ldif
> ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
>
> And FWIW, when running a parallel check-world to make my laptop busy
> enough, it is rather usual to face failures with this test, which is
> annoying.  Shouldn't we have at least a number of retries with
> intermediate sleeps for the commands run in 001_auth.pl?  As far as I
> recall, I think that we can run into failures when calling ldapadd and
> ldappasswd.

Yeah, it seems we need to figure out a way to wait for it to be ready
to accept connections.  I wondered how other people do this, and found
one example that polls for the .pid file:

https://github.com/tiredofit/docker-openldap/blob/master/install/etc/cont-init.d/10-openldap#L347

That looks nice and tidy but I'm not sure it can be trusted, given
that OpenLDAP's own tests poll a trivial ldapsearch (also based on a
cursory glance at slapd/main.c which appears to write the .pid file a
bit too early, though I may have misread):

https://github.com/openldap/openldap/blob/master/tests/scripts/test039-glue-ldap-concurrency#L59

I guess we should do that too.  I don't know how to write Perl but I'll try...

-- 
Thomas Munro
https://enterprisedb.com




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
Thanks for having a look at this.

On Wed, 24 Jul 2019 at 04:13, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm pretty happy with v7 now. If anyone has any objections to it,
> > please speak up very soon.  Otherwise, I plan to push it about this
> > time tomorrow.
>
> I dunno, this seems close to useless in this form.  As it stands,
> once hash_get_max_bucket has exceeded the threshold, you will
> arbitrarily reset the table 1000 transactions later (since the
> max bucket is certainly not gonna decrease otherwise).  So that's
> got two big problems:
>
> 1. In the assumed-common case where most of the transactions take
> few locks, you wasted cycles for 999 transactions.
>
> 2. You'll reset the table independently of subsequent history,
> even if the session's usage is pretty much always over the
> threshold.  Admittedly, if you do this only once per 1K
> transactions, it's probably not a horrible overhead --- but you
> can't improve point 1 without making it a bigger overhead.

This is true, but I think you might be overestimating just how much
effort is wasted with #1. We're only seeing this overhead in small
very fast to execute xacts. In the test case in [1], I was getting
about 10k TPS unpatched and about 15k patched. This means that, on
average, 1 unpatched xact takes 100 microseconds and 1 average patched
xact takes 66 microseconds, so the additional time spent doing the
hash_seq_search() must be about 34 microseconds. So we'll waste a
total of 34 *milliseconds* if we wait for 1000 xacts before we reset
the table.  With 10k TPS we're going to react to the change in 0.1
seconds.

I think you'd struggle to measure that 1 xact is taking 34
microseconds longer without running a few thousand queries. My view is
that nobody is ever going to notice that it takes 1k xacts to shrink
the table, and I've already shown that the overhead of the shrink
every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is
likely a bit smaller in v7 due to swapping the order of the if
condition to put the less likely case first. Since the overhead of
rebuilding the table was 7% when done every xact, then it stands to
reason that this has become 0.007% doing it every 1k xats and that the
additional overhead to make up that 0.34% is from testing if the reset
condition has been met (or noise).  That's not something we can remove
completely with any solution that resets the hash table.

> I did complain about the cost of tracking the stats proposed by
> some earlier patches, but I don't think the answer to that is to
> not track any stats at all.  The proposed hash_get_max_bucket()
> function is quite cheap, so I think it wouldn't be out of line to
> track the average value of that at transaction end over the
> session's lifespan, and reset if the current value is more than
> some-multiple of the average.
>
> The tricky part here is that if some xact kicks that up to
> say 64 entries, and we don't choose to reset, then the reading
> for subsequent transactions will be 64 even if they use very
> few locks.  So we probably need to not use a plain average,
> but account for that effect somehow.  Maybe we could look at
> how quickly the number goes up after we reset?
>
> [ thinks for awhile... ]  As a straw-man proposal, I suggest
> the following (completely untested) idea:
>
> * Make the table size threshold variable, not constant.
>
> * If, at end-of-transaction when the table is empty,
> the table bucket count exceeds the threshold, reset
> immediately; but if it's been less than K transactions
> since the last reset, increase the threshold (by say 10%).
>
> I think K can be a constant; somewhere between 10 and 100 would
> probably work.  At process start, we should act as though the last
> reset were more than K transactions ago (i.e., don't increase the
> threshold at the first reset).

I think the problem with this idea is that there is no way once the
threshold has been enlarged to recover from that to work better
workloads that require very few locks again. If we end up with some
large value for the variable threshold, there's no way to decrease
that again.  All this method stops is the needless hash table resets
if the typical case requires many locks. The only way to know if we
can reduce the threshold again is to count the locks released during
LockReleaseAll().  Some version of the patch did that, and you
objected.

> The main advantage this has over v7 is that we don't have the
> 1000-transaction delay before reset, which ISTM is giving up
> much of the benefit of the whole idea.  Also, if the session
> is consistently using lots of locks, we'll adapt to that after
> awhile and not do useless table resets.

True, but you neglected to mention the looming and critical drawback,
which pretty much makes that idea useless. All we'd need to do is give
this a workload that throws that variable threshold up high so that it
can't recover. It would be pretty simple then to show that
LockReleaseAll() is still slow with 

Re: pgbench tests vs Windows

2019-07-23 Thread Michael Paquier
On Tue, Jul 23, 2019 at 07:13:51PM -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I propose to prepare a patch along these lines. Alternatively we could
>> just drop it - I don't think the test matters all that hugely.
> 
> I'd say try that, but if it doesn't work right away, just skip the
> test on Windows.

+1.  I don't see exactly why we should drop it either.
--
Michael


signature.asc
Description: PGP signature


Re: pg_receivewal documentation

2019-07-23 Thread Michael Paquier
On Tue, Jul 23, 2019 at 08:00:41AM -0400, Jesper Pedersen wrote:
> Sure.

Thanks.  Applied down to 9.6 where remote_apply has been introduced,
with tweaks for 9.6 as the tool is named pg_receivexlog there.
--
Michael


signature.asc
Description: PGP signature


pg_basebackup delays closing of stdout

2019-07-23 Thread Jeff Janes
Ever since pg_basebackup was created, it had a comment like this:

 * End of chunk. If requested, and this is the base tablespace
 * write configuration file into the tarfile. When done, close the
 * file (but not stdout).

But, why make the exception for output going to stdout?  If we are done
with it, why not close it?

After a massive maintenance operation, I want to re-seed a streaming
standby, which I start to do by:

pg_basebackup -D - -Ft -P -X none  | pxz > base.tar.xz

But the archiver is way behind, so when it finishes the basebackup part, I
get:

NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to
be archived
WARNING:  pg_stop_backup still waiting for all required WAL segments to be
archived (60 seconds elapsed)
...

The base backup file is not finalized, because pg_basebackup has not closed
its stdout while waiting for the WAL segment to be archived. The file is
incomplete due to data stuck in buffers, so I can't copy it to where I want
and bring up a new streaming replica (which bypasses the WAL archive, so
would otherwise work). Also, if pg_basebackup gets interupted somehow while
it is waiting for WAL archiving, the backup will be invalid, as it won't
flush the last bit of data.  Of course if it gets interupted, I would have
to test the backup to make sure it is valid.  But testing it and finding
that it is valid is better than testing it and finding that it is not.

I think it makes sense for pg_basebackup to wait for the WAL to be
archived, but there is no reason for it to hold the base.tar.xz file
hostage while it does so.

If I simply remove the test for strcmp(basedir, "-"), as in the attached, I
get the behavior I desire, and nothing bad seems to happen.  Meaning, "make
check -C src/bin/pg_basebackup/" still passes (but only tested on Linux).

Is there a reason not to do this?

Cheers,

Jeff


pg_basebackup_close_stdout.patch
Description: Binary data


Re: Race conditions with TAP test for syncrep

2019-07-23 Thread Michael Paquier
On Tue, Jul 23, 2019 at 05:04:32PM +0900, Michael Paquier wrote:
> Thanks Noah for the review.  I have reviewed the thread and improved a
> couple of comments based on Alvaro's previous input.  Attached is v2.
> If there are no objections, I would be fine to commit it.

Applied and back-patched down to 9.6 where it applies.  Thanks for the
reviews.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-23 Thread Peter Geoghegan
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan  wrote:
> Hmm. So, the attached test case fails amcheck verification for me with
> the latest version of the patch:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Other changes:

* We now support system catalog indexes. There is no reason not to support them.

* Removed unnecessary code from _bt_buildadd().

* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.

* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.


I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be

Re: Memory Accounting

2019-07-23 Thread Melanie Plageman
On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis  wrote:

> Previous discussion:
> https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
>
> This patch introduces a way to ask a memory context how much memory it
> currently has allocated. Each time a new block (not an individual
> chunk, but a new malloc'd block) is allocated, it increments a struct
> member; and each time a block is free'd, it decrements the struct
> member. So it tracks blocks allocated by malloc, not what is actually
> used for chunks allocated by palloc.
>
>
Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?

I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.


> The purpose is for Memory Bounded Hash Aggregation, but it may be
> useful in more cases as we try to better adapt to and control memory
> usage at execution time.
>
>
This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?

-- 
Melanie Plageman


RE: [PATCH] Speedup truncates of relation forks

2019-07-23 Thread Jamison, Kirk
Hi,

I repeated the recovery performance test before, and found out that I made a
wrong measurement.
Using the same steps indicated in the previous email (24GB shared_buffers for 
my case),
the recovery time still significantly improved compared to head
from "13 minutes" to "4 minutes 44 seconds"  //not 30 seconds.
It's expected because the measurement of vacuum execution time (no failover)
which I reported in the first email is about the same (although 5 minutes):
> HEAD results
> 3) 24GB shared_buffers = 14 min 13.598 s
> PATCH results
> 3) 24GB shared_buffers = 5 min 35.848 s


Reattaching the patch here again. The V5 of the patch fixed the compile error
mentioned before and mainly addressed the comments/advice of Sawada-san.
- updated more accurate comments describing only current behavior, not history
- updated function name: visibilitymap_truncate_prepare()
- moved the setting of values for smgr_{fsm,vm}_nblocks inside the 
smgrtruncate()

I'd be grateful if anyone could provide comments, advice, or insights.
Thank you again in advance.

Regards,
Kirk Jamison


v5-0001-Speedup-truncates-of-relation-forks.patch
Description: v5-0001-Speedup-truncates-of-relation-forks.patch


Re: Fetching timeline during recovery

2019-07-23 Thread Michael Paquier
On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:
> Please, find in attachment a first trivial patch to support pg_walfile_name()
> and pg_walfile_name_offset() on a standby.
> Previous restriction on this functions seems related to ThisTimeLineID not
> being safe on standby. This patch is fetching the timeline from
> WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
> this is updated each time some data are flushed to the WAL.

FWIW, I don't have any objections to lift a bit the restrictions on
those functions if we can make that reliable enough.  Now during
recovery you cannot rely on ThisTimeLineID as you say, per mostly the
following bit in xlog.c (the comment block a little bit up also has
explanations):
   /*
* ThisTimeLineID is normally not set when we're still in recovery.
* However, recycling/preallocating segments above needed ThisTimeLineID
* to determine which timeline to install the segments on. Reset it now,
* to restore the normal state of affairs for debugging purposes.
*/
if (RecoveryInProgress())
ThisTimeLineID = 0;

Your patch does not count for the case of archive recovery, where
there is no WAL receiver, and as the shared memory state of the WAL
receiver is not set 0 would be set.  The replay timeline is something
we could use here instead via GetXLogReplayRecPtr().
CreateRestartPoint actually takes the latest WAL receiver or replayed
point for its end LSN position, whichever is newer.

> Last, I plan to produce an extension to support this on older release. Is
> it something that could be integrated in official source tree during a minor
> release or should I publish it on eg. pgxn?

Unfortunately no.  This is a behavior change so it cannot find its way
into back branches.  The WAL receiver state is in shared memory and
published, so that's easy enough to get.  We don't do that for XLogCtl
unfortunately.  I think that there are arguments for being more
flexible with it, and perhaps have a system-level view to be able to
look at some of its fields.

There is also a downside with get_controlfile(), which is that it
fetches directly the data from the on-disk pg_control, and
post-recovery this only gets updated at the first checkpoint.
--
Michael


signature.asc
Description: PGP signature


Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby  wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when 
> > this
> > core was dumped (due to having not touched it since scheduling transition to
> > this VM last week).
> >
> > I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> > failing to find log output, I ran df right after the failure.

I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance.  It occured to me to check inodes, which can also cause
ENOSPC.  This is mkfs -T largefile, so running out of inodes is not an
impossibility.  But seems an unlikely culprit, unless something made tens of
thousands of (small) files.  

[pryzbyj@alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem   Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
  65536  5605 599319% /var/lib/pgsql

> Ok, cool, so the ENOSPC thing we understand, and the postmaster death
> thing is probably something entirely different.  Which brings us to
> the question: what is killing your postmaster or causing it to exit
> silently and unexpectedly, but leaving no trace in any operating
> system log?  You mentioned that you couldn't see any signs of the OOM
> killer.  Are you in a situation to test an OOM failure so you can
> confirm what that looks like on your system?

$ command time -v python -c "'x'*49" |wc
Traceback (most recent call last):
  File "", line 1, in 
MemoryError
Command exited with non-zero status 1
...
Maximum resident set size (kbytes): 4276

$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, 
file-rss:1624kB

I wouldn't burn too much more time on it until I can reproduce it.  The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.

> Admittedly it is quite hard for to distinguish between a web browser
> and a program designed to eat memory as fast as possible...

Browsers making lots of progress here but still clearly 2nd place.

Justin




Re: SQL/JSON path issues/questions

2019-07-23 Thread Steven Pousty
Ok I have the toolset.
Where do I find the PR for the doc on this work. I only feel qualified to
review the doc.
Thanks
Steve

On Sat, Jul 20, 2019 at 11:48 AM Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi Steven,
>
> On Fri, Jul 19, 2019 at 9:53 PM Steven Pousty 
> wrote:
> > I would like to help review this documentation. Can you please point me
> in the right direction?
>
> Thank you for your interest.  You're welcome to do review.
>
> Please take a look at instruction for reviewing a patch [1] and
> working with git [2].  Also, in order to build a doc you will need to
> setup a toolset first [3].
>
> Links
>
> 1. https://wiki.postgresql.org/wiki/Reviewing_a_Patch
> 2. https://wiki.postgresql.org/wiki/Working_with_git#Testing_a_patch
> 3. https://www.postgresql.org/docs/devel/docguide-toolsets.html
>
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


Re: Parallel Append subplan order instability on aye-aye

2019-07-23 Thread Thomas Munro
On Tue, Jul 16, 2019 at 12:21 PM Tom Lane  wrote:
> In the meantime, we've had *lots* of buildfarm failures in the
> added pg_stat_all_tables query, which indicate that indeed the
> stats collector mechanism isn't terribly reliable.  But that
> doesn't directly prove anything about the original problem,
> since the planner doesn't look at stats collector data.

I noticed that if you look at the list of failures of this type, there
are often pairs of animals belonging to Andres that failed at the same
time.  I wonder if he might be running a bunch of animals on one
kernel, and need to increase net.core.rmem_max and
net.core.rmem_default (or maybe the write side variants, or both, or
something like that).

-- 
Thomas Munro
https://enterprisedb.com




Re: stress test for parallel workers

2019-07-23 Thread Alvaro Herrera
On 2019-Jul-23, Justin Pryzby wrote:

> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

I'm not sure that this proves much, since I expect temporary files to be
deleted on failure; by the time you run 'df' the condition might have
already been cleared.  You'd need to be capturing diskspace telemetry
with sufficient granularity ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:11 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > *I suspect that the only thing implicating parallelism in this failure
> > is that parallel leaders happen to print out that message if the
> > postmaster dies while they are waiting for workers; most other places
> > (probably every other backend in your cluster) just quietly exit.
> > That tells us something about what's happening, but on its own doesn't
> > tell us that parallelism plays an important role in the failure mode.
>
> I agree that there's little evidence implicating parallelism directly.
> The reason I'm suspicious about a possible OOM kill is that parallel
> queries would appear to the OOM killer to be eating more resources
> than the same workload non-parallel, so that we might be at more
> hazard of getting OOM'd just because of that.
>
> A different theory is that there's some hard-to-hit bug in the
> postmaster's processing of parallel workers that doesn't apply to
> regular backends.  I've looked for one in a desultory way but not
> really focused on it.
>
> In any case, the evidence from the buildfarm is pretty clear that
> there is *some* connection.  We've seen a lot of recent failures
> involving "postmaster exited during a parallel transaction", while
> the number of postmaster failures not involving that is epsilon.

I don't have access to the build farm history in searchable format
(I'll go and ask for that).  Do you have an example to hand?  Is this
failure always happening on Linux?

--
Thomas Munro
https://enterprisedb.com




Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby  wrote:
> I ought to have remembered that it *was* in fact out of space this AM when 
> this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

Ok, cool, so the ENOSPC thing we understand, and the postmaster death
thing is probably something entirely different.  Which brings us to
the question: what is killing your postmaster or causing it to exit
silently and unexpectedly, but leaving no trace in any operating
system log?  You mentioned that you couldn't see any signs of the OOM
killer.  Are you in a situation to test an OOM failure so you can
confirm what that looks like on your system?  You might try typing
this into Python:

x = [42]
for i in range(1000):
  x = x + x

On my non-Linux system, it ran for a while and then was killed, and
dmesg showed:

pid 15956 (python3.6), jid 0, uid 1001, was killed: out of swap space
pid 40238 (firefox), jid 0, uid 1001, was killed: out of swap space

Admittedly it is quite hard for to distinguish between a web browser
and a program designed to eat memory as fast as possible...  Anyway on
Linux you should see stuff about killed processes and/or OOM in one of
dmesg, syslog, messages.

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ?  Including by parallel workers?  Probably
> WALs, even after they're rotated ?  If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers don't do anything with WAL files, but they can create
temporary files.  If you're building humongous indexes with parallel
workers, you'll get some of those, but I don't think it'd be more than
you'd get without parallelism.  If you were using up all of your disk
space with temporary files, wouldn't this be reproducible?  I think
you said you were testing this repeatedly, so if that were the problem
I'd expect to see some non-panicky out-of-space errors when the temp
files blow out your disk space, and only rarely a panic if a
checkpoint happens to run exactly at a moment where the create index
hasn't yet written the byte that breaks the camel's back, but the
checkpoint pushes it over edge in one of these places where it panics
on failure.

-- 
Thomas Munro
https://enterprisedb.com




Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Justin Pryzby  writes:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.

The fact that you're not finding log output matching what was reported
to the client seems to me to be a mighty strong indication that there
*was* an ENOSPC problem.  Can you reconfigure to put the postmaster
log on a different volume?

> But that gives me an idea: is it possible there's an issue with files being
> held opened by worker processes ?  Including by parallel workers?  Probably
> WALs, even after they're rotated ?  If there were worker processes holding
> opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
> obvious after they die, since the space would then be freed.

Parallel workers aren't ever allowed to write, in the current
implementation, so it's not real obvious why they'd have any
WAL log files open at all.

regards, tom lane




Re: pgbench tests vs Windows

2019-07-23 Thread Tom Lane
Andrew Dunstan  writes:
> In commit ed8a7c6fcf9 we added some extra tests to pgbench, including
> this snippet:
> \setshell two\
>   expr \
>     1 + :one
> Unfortunately, this isn't portable, as I've just discovered at the cost
> of quite a bit of time. In particular, you can't assume expr is present
> and in the path on Windows.

Ugh.

> The Windows equivalent would be something like:
> \setshell two\
>   @set /a c = 1 + :one  && echo %c%

I wonder how universal that is, either.

> I propose to prepare a patch along these lines. Alternatively we could
> just drop it - I don't think the test matters all that hugely.

I'd say try that, but if it doesn't work right away, just skip the
test on Windows.

regards, tom lane




Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby  wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby  
> > > wrote:
> > > > #2  0x0085ddff in errfinish (dummy=) at 
> > > > elog.c:555
> > > > edata = 
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
> 
> How about something like this:
> 
> print errorData[errordata_stack_depth]

Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, 
show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 
0x27b3790 "< %m %u >", lineno = 41745456, 
  funcname = 0x3030313335 , domain = 0x0, 
context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe880001 
, 
  detail = 0x297a , detail_log = 0x0, hint = 
0xe88 , context = 0x297a , message_id = 0x0, schema_name = 0x0, 
  table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 
0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, 
assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, 
show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030 
"origin.c", lineno = 591, 
funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 
"postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293, 
message = 0x27b0e40 "could not write to file 
\"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail = 
0x0, detail_log = 0x0, hint = 0x0, context = 0x0, 
message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ?  Including by parallel workers?  Probably
WALs, even after they're rotated ?  If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin




Re: Support for jsonpath .datetime() method

2019-07-23 Thread Nikita Glukhov

On 23.07.2019 16:44, Peter Eisentraut wrote:


I think the best way forward here is to focus first on patch 0002 and
get the additional format templates in, independent of any surrounding
JSON functionality.

In particular, remove parse_datetime() and all the related API changes,
then it becomes much simpler.

The codes FF1..FF6 that you added appear to be correct, but reading the
spec I find there is more missing, specifically

-  and RR


It seems that our YY works like RR should:

SELECT to_date('69', 'YY');
  to_date

 2069-01-01
(1 row)

SELECT to_date('70', 'YY');
  to_date

 1970-01-01
(1 row)

But by the standard first two digits of current year should be used in YY.


Oracle follows the standard but its implementation has the different
rounding algorithm:

SELECT TO_CHAR(TO_DATE('99', 'YY'), '') from dual;
2099

SELECT TO_CHAR(TO_DATE('49', 'RR'), '') from dual;
2049

SELECT TO_CHAR(TO_DATE('50', 'RR'), '') from dual;
1950


So it's unclear what we should do:
 - implement YY and RR strictly following the standard only in .datetime()
 - fix YY implementation in to_date()/to_timestamp() and implement RR
 - use our non-standard templates in .datetime()


- S (currently only  is supported, but that's not standard)


S template can be easily added as alias to .


Also in some cases we allow timestamps with seven digits of fractional
precision, so perhaps FF7 should be supported as well.  I'm not quite
sure about the details here.  You tests only cover 6 and 9 digits.  It
would be good to cover 7 and perhaps 8 as well, since those are the
boundary cases.


FF7-FF9 weer present in earlier versions of the jsonpath patches, but they
had been removed (see [1]) because they were not completely supported due
to the limited precision of timestamp.


Some concrete pieces of review:

+   
+FF1
+decisecond (0-9)
+   

Let's not use such weird terms as "deciseconds".  We could say
"fractional seconds, 1 digit" etc. or something like that.

And what about "tenths of seconds", "hundredths of seconds"?

+/* Return flags for DCH_from_char() */
+#define DCH_DATED  0x01
+#define DCH_TIMED  0x02
+#define DCH_ZONED  0x04

I think you mean do_to_timestamp() here.  These terms "dated" etc. are
from the SQL standard text, but they should be explained somewhere for
the readers of the code.


[1] 
https://www.postgresql.org/message-id/885de241-5a51-29c8-a6b3-f1dda22aba13%40postgrespro.ru


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby  wrote:
> On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby  wrote:
> > > #2  0x0085ddff in errfinish (dummy=) at 
> > > elog.c:555
> > > edata = 
> >
> > If you have that core, it might be interesting to go to frame 2 and
> > print *edata or edata->saved_errno.
>
> As you saw..unless someone you know a trick, it's "optimized out".

How about something like this:

print errorData[errordata_stack_depth]

If you can't find errordata_stack_depth, maybe look at the whole array
and try to find the interesting bit?


--
Thomas Munro
https://enterprisedb.com




Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby  wrote:
> > #2  0x0085ddff in errfinish (dummy=) at 
> > elog.c:555
> > edata = 
> 
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno.

As you saw..unless someone you know a trick, it's "optimized out".

> Could it have been fleetingly full due to some other thing happening on the
> system that rapidly expands and contracts?

It's not impossible, especially while loading data, and data_dir is only 64GB;
it may have happened that way sometimes; but it's hard to believe I that's been
the case 5-10 times now.  If I don't forget to drop the old database previously
loaded, when loading old/historic data, it should have ~40GB free on data_dir,
and no clients connected other than pg_restore.

$ df -h /var/lib/pgsql
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/data-postgres
   64G   26G   38G  41% /var/lib/pgsql

> | ereport(PANIC,  
> | (errcode_for_file_access(),
> |  errmsg("could not write to file \"%s\": %m",
> | tmppath)));
> 
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

I realized this probably is *not* an issue with zfs, since it's failing to log
(for one reason or another) to /var/lib/pgsql (ext4).

> Perhaps it would be clearer what's going on if you could put the PostgreSQL
> log onto a different filesystem, so we get a better chance of collecting
> evidence?

I didn't mention it but last weekend I'd left a loop around the restore process
running overnight, and had convinced myself the issue didn't recur since their
faulty blade was taken out of service...  My plan was to leave the server
running in the foreground with logging_collector=no, which I hope is enough,
unless logging is itself somehow implicated.  I'm trying to stress test that
way now.

> But then... the parallel leader process was apparently able
> to log something -- maybe it was just lucky, but you said this
> happened this way more than once.  I'm wondering how it could be that
> you got some kind of IO failure and weren't able to log the PANIC
> message AND your postmaster was killed, and you were able to log a
> message about that.  Perhaps we're looking at evidence from two
> unrelated failures.

The messages from the parallel leader (building indices) were visible to the
client, not via the server log.  I was loading their data and the errors were
visible when pg_restore failed.

On Wed, Jul 24, 2019 at 09:10:41AM +1200, Thomas Munro wrote:
> Just by the way, parallelism in CREATE INDEX is controlled by
> max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

Thank you.

Justin




Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro  wrote:
> > edata = 

> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno. ...

Rats.  We already saw that it's optimised out so unless we can find
that somewhere else in a variable that's present in the core, we
probably can't find out what the operating system said.  So my other
idea for getting this information next time is to try putting the
PostgreSQL logs somewhere that's more likely to be still working when
that thing fails.

--
Thomas Munro
https://enterprisedb.com




Re: range_agg

2019-07-23 Thread Alvaro Herrera
Hi Paul,

Just checking if you've had a chance to make progress on this.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgbench tests vs Windows

2019-07-23 Thread Andrew Dunstan


In commit ed8a7c6fcf9 we added some extra tests to pgbench, including
this snippet:


\setshell two\
  expr \
    1 + :one

Unfortunately, this isn't portable, as I've just discovered at the cost
of quite a bit of time. In particular, you can't assume expr is present
and in the path on Windows. The Windows equivalent would be something like:


\setshell two\
  @set /a c = 1 + :one  && echo %c%


I propose to prepare a patch along these lines. Alternatively we could
just drop it - I don't think the test matters all that hugely.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pgbench - allow to create partitioned tables

2019-07-23 Thread Simon Riggs
On Tue, 23 Jul 2019 at 19:26, Fabien COELHO  wrote:

>
> Hello devs,
>
> While doing some performance tests and reviewing patches, I needed to
> create partitioned tables. Given the current syntax this is time
> consumming.
>

Good idea. I wonder why we didn't have it already.


> The attached patch adds two options to create a partitioned "account"
> table in pgbench.
>
> It allows to answer quickly simple questions, eg "what is the overhead of
> hash partitioning on a simple select on my laptop"? Answer:
>
>   # N=0..?
>   sh> pgench -i -s 1 --partition-number=$N --partition-type=hash
>

Given current naming of options, I would call this
--partitions=number-of-partitions and --partition-method=hash


>   # then run
>   sh> pgench -S -M prepared -P 1 -T 10
>
>   # and look at latency:
>   # no parts = 0.071 ms
>   #   1 hash = 0.071 ms (did someone optimize this case?!)
>   #   2 hash ~ 0.126 ms (+ 0.055 ms)
>   #  50 hash ~ 0.155 ms
>   # 100 hash ~ 0.178 ms
>   # 150 hash ~ 0.232 ms
>   # 200 hash ~ 0.279 ms
>   # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms
>

It is linear?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Thomas Munro  writes:
> *I suspect that the only thing implicating parallelism in this failure
> is that parallel leaders happen to print out that message if the
> postmaster dies while they are waiting for workers; most other places
> (probably every other backend in your cluster) just quietly exit.
> That tells us something about what's happening, but on its own doesn't
> tell us that parallelism plays an important role in the failure mode.

I agree that there's little evidence implicating parallelism directly.
The reason I'm suspicious about a possible OOM kill is that parallel
queries would appear to the OOM killer to be eating more resources
than the same workload non-parallel, so that we might be at more
hazard of getting OOM'd just because of that.

A different theory is that there's some hard-to-hit bug in the
postmaster's processing of parallel workers that doesn't apply to
regular backends.  I've looked for one in a desultory way but not
really focused on it.

In any case, the evidence from the buildfarm is pretty clear that
there is *some* connection.  We've seen a lot of recent failures
involving "postmaster exited during a parallel transaction", while
the number of postmaster failures not involving that is epsilon.

regards, tom lane




Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions

2019-07-23 Thread Fabien COELHO




Pushed with minor fiddling with the toast-table code, and rather
more significant hacking on the partitioned-index code.  Notably,
0003 had broken output of Tablespace: footers for everything except
indexes.


Argh, sorry for the review miss.


It's possibly not Justin's fault that that wasn't noticed,
because we had no regression tests covering it :-(.  We do now.


Thanks.

--
Fabien.




Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby  wrote:
> #2  0x0085ddff in errfinish (dummy=) at 
> elog.c:555
> edata = 
> elevel = 22
> oldcontext = 0x27e15d0
> econtext = 0x0
> __func__ = "errfinish"
> #3  0x006f7e94 in CheckPointReplicationOrigin () at origin.c:588
> save_errno = 
> tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
> path = 0x9c4300 "pg_logical/replorigin_checkpoint"
> tmpfd = 64
> i = 
> magic = 307747550
> crc = 4294967295
> __func__ = "CheckPointReplicationOrigin"

> Supposedly it's trying to do this:
>
> |   ereport(PANIC,
> |   (errcode_for_file_access(),
> |errmsg("could not write to file \"%s\": %m",
> |   tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.  If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space.  Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though.  If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked).  Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence?  But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once.  I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that.  Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.


--
Thomas Munro
https://enterprisedb.com




Re: getting ERROR "relation 16401 has no triggers" with partition foreign key alter

2019-07-23 Thread Alvaro Herrera
On 2019-Jul-22, Alvaro Herrera wrote:

> After looking at the code some more, I think calling the new function in
> the Prep phase is correct.  The attached patch is pretty much final form
> for this bugfix.  I decided to unwrap a couple of error messages (I did
> get bitten while grepping because of this), and reordered one of the new
> Identity command cases in ATPrepCmd since it appeared in inconsistent
> order in that one place of four.

Pushed to all three branches.

Thanks for reporting

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Tom Lane
Martijn van Oosterhout  writes:
> On Tue, 23 Jul 2019 at 19:21, Tom Lane  wrote:
>> Martijn van Oosterhout  writes:
>>> 2. Add a field to AsyncQueueEntry which points to the next listening
>>> backend. This would allow the loops over all listening backends to
>>> complete much faster, especially in the normal case where there are
>>> not many listeners relative to the number of backends. The downside is
>>> this requires an exclusive lock to remove listeners, but that doesn't
>>> seem a big problem.

>> I don't understand how that would work?  The sending backend doesn't
>> know what the "next listening backend" is.  Having to scan the whole
>> queue when a listener unlistens seems pretty awful too, especially
>> if you need exclusive lock while doing so.

> I mean tracking the listening backends specifically, so you can
> replace the loops:
> for (i=0; i < MaxBackends; i++)
> with
> for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))

Ah ... but surely you would not put such info in AsyncQueueEntry,
where there'd be a separate copy for each message.  I think you
meant to add the info to AsyncQueueControl.

It might be better to redefine the backends[] array as being mostly
contiguous (ie, a new backend takes the first free slot not the one
indexed by its own BackendId), at the price of needing to store
BackendId in each slot explicitly instead of assuming it's equal to
the array index.  I suspect the existing data structure is putting too
much of a premium on making sizeof(QueueBackendStatus) a power of 2.

regards, tom lane




Re: stress test for parallel workers

2019-07-23 Thread Thomas Munro
On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby  wrote:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a 
> parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT:  CREATE UNIQUE INDEX 
> unused0_huawei_umts_nodeb_locell_201907_unique_idx ON 
> child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...

> ...  I've set
> max_parallel_workers_per_gather=0, ...

Just by the way, parallelism in CREATE INDEX is controlled by
max_parallel_maintenance_workers, not max_parallel_workers_per_gather.

-- 
Thomas Munro
https://enterprisedb.com




Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions

2019-07-23 Thread Tom Lane
Fabien COELHO  writes:
> Field relnamespace oid in pg_class appears with pg 7.3, maybe it would be 
> appropriate to guard agains older versions, with "pset.sversion >= 70300". 
> It seems that there are other unguarded instances in "describe.c", so 
> maybe this is considered too old.

Per the comment at the head of describe.c, we only expect it to work
back to 7.4.  I tested against a 7.4 server, the modified queries
seem fine.

> Marked as ready.

Pushed with minor fiddling with the toast-table code, and rather
more significant hacking on the partitioned-index code.  Notably,
0003 had broken output of Tablespace: footers for everything except
indexes.  It's possibly not Justin's fault that that wasn't noticed,
because we had no regression tests covering it :-(.  We do now.

regards, tom lane




Re: Fetching timeline during recovery

2019-07-23 Thread Jehan-Guillaume de Rorthais
On Tue, 23 Jul 2019 16:00:29 -0400
David Steele  wrote:

> On 7/23/19 2:59 PM, Andrey Borodin wrote:
> >   
> >> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais 
> >> написал(а):
> >>
> >> Fetching the timeline from a standby could be useful in various situation.
> >> Either for backup tools [1] or failover tools during some kind of election
> >> process.  
> > That backup tool is reading timeline from pg_control_checkpoint(). And
> > formats WAL file name itself when necessary.  
> 
> We do the same [1].

Thank you both for your comments.

OK, so backup tools are fine with reading slightly outdated data from
controldata file.

Anyway, my use case is mostly about auto failover. During election, I currently
have to force a checkpoint on standbys to get their real timeline from the
controldata.

However, the forced checkpoint could be very long[1] (considering auto
failover). I need to be able to compare TL without all the burden of a
CHECKPOINT just for this.

As I wrote, my favorite solution would be a function returning BOTH
current TL and LSN at the same time. I'll send a patch tomorrow to the list
and I'll bikeshedding later depending on the feedback.

In the meantime, previous patch might still be useful for some other purpose.
Comments are welcomed.

Thanks,

[1] this exact use case is actually hiding behind this thread:
https://www.postgresql.org/message-id/flat/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU%3Dq3sBiEZUm-D-sQ%40mail.gmail.com




Re: Fetching timeline during recovery

2019-07-23 Thread David Steele
On 7/23/19 2:59 PM, Andrey Borodin wrote:
> 
>> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais  
>> написал(а):
>>
>> Fetching the timeline from a standby could be useful in various situation.
>> Either for backup tools [1] or failover tools during some kind of election
>> process.
> That backup tool is reading timeline from pg_control_checkpoint(). And 
> formats WAL file name itself when necessary.

We do the same [1].

>> Please, find in attachment a first trivial patch to support pg_walfile_name()
>> and pg_walfile_name_offset() on a standby.
> 
> You just cannot format WAL file name for LSN when timeline changed. Because 
> there are at least three WALs for that point: previous, new and partial. 
> However, reading TLI from checkpoint seems safe for backup purposes.
> The only reason for WAL-G to read that timeline is to mark backup invalid: if 
> it's name is base_0001YY0YY and timeline change happens, it 
> should be named base_0002YY0YY (consistency point is not on 
> TLI 2), but WAL-G cannot rename backup during backup-push.

Naming considerations aside, I don't think that a timeline switch during
a standby backup is a good idea, mostly because it is (currently) not
tested.  We don't allow it in pgBackRest.

[1]
https://github.com/pgbackrest/pgbackrest/blob/release/2.15.1/lib/pgBackRest/Db.pm#L1008

-- 
-David
da...@pgmasters.net




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Martijn van Oosterhout
On Tue, 23 Jul 2019 at 19:21, Tom Lane  wrote:
>
> Martijn van Oosterhout  writes:
> > 2. Add a field to AsyncQueueEntry which points to the next listening
> > backend. This would allow the loops over all listening backends to
> > complete much faster, especially in the normal case where there are
> > not many listeners relative to the number of backends. The downside is
> > this requires an exclusive lock to remove listeners, but that doesn't
> > seem a big problem.
>
> I don't understand how that would work?  The sending backend doesn't
> know what the "next listening backend" is.  Having to scan the whole
> queue when a listener unlistens seems pretty awful too, especially
> if you need exclusive lock while doing so.

I mean tracking the listening backends specifically, so you can
replace the loops:

for (i=0; i < MaxBackends; i++)

with

for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))

Such loops occur often when trying to advance the tail, when adding a
new listener,
when sending a notify, etc, all while holding a (exclusive) lock.
Seems like such an easy win
to only loop over the listening backends rather than all of them.

> > Do 2 & 3 seem like a good direction to go? I can probably work something up.
>
> I'm on board with 3, obviously.  Not following what you have in mind
> for 2.

Hope this clears it up a bit. Only waking up one at a time is a good
idea, but needs to some
careful thinking to prove it actually works.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: Fetching timeline during recovery

2019-07-23 Thread Andrey Borodin



> 23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais  
> написал(а):
> 
> Fetching the timeline from a standby could be useful in various situation.
> Either for backup tools [1] or failover tools during some kind of election
> process.
That backup tool is reading timeline from pg_control_checkpoint(). And formats 
WAL file name itself when necessary.

> Please, find in attachment a first trivial patch to support pg_walfile_name()
> and pg_walfile_name_offset() on a standby.

You just cannot format WAL file name for LSN when timeline changed. Because 
there are at least three WALs for that point: previous, new and partial. 
However, reading TLI from checkpoint seems safe for backup purposes.
The only reason for WAL-G to read that timeline is to mark backup invalid: if 
it's name is base_0001YY0YY and timeline change happens, it 
should be named base_0002YY0YY (consistency point is not on TLI 
2), but WAL-G cannot rename backup during backup-push.

Hope this information is useful. Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/wal-g/wal-g/blob/master/internal/timeline.go#L39



pgbench - allow to create partitioned tables

2019-07-23 Thread Fabien COELHO


Hello devs,

While doing some performance tests and reviewing patches, I needed to 
create partitioned tables. Given the current syntax this is time 
consumming.


The attached patch adds two options to create a partitioned "account" 
table in pgbench.


It allows to answer quickly simple questions, eg "what is the overhead of 
hash partitioning on a simple select on my laptop"? Answer:


 # N=0..?
 sh> pgench -i -s 1 --partition-number=$N --partition-type=hash

 # then run
 sh> pgench -S -M prepared -P 1 -T 10

 # and look at latency:
 # no parts = 0.071 ms
 #   1 hash = 0.071 ms (did someone optimize this case?!)
 #   2 hash ~ 0.126 ms (+ 0.055 ms)
 #  50 hash ~ 0.155 ms
 # 100 hash ~ 0.178 ms
 # 150 hash ~ 0.232 ms
 # 200 hash ~ 0.279 ms
 # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..c10789262c 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -306,6 +306,32 @@ pgbench  options  d
   
  
 
+ 
+  --partition-number=NUM
+  
+   
+Create a partitioned pgbench_accounts table with
+NUM partitions of nearly equal size for
+the scaled number of accounts.
+Default is 0, meaning no partitioning.
+   
+  
+ 
+
+ 
+  --partition-type=TYPE
+  
+   
+Create a partitioned pgbench_accounts table of type
+TYPE.
+Expected values are range or hash.
+This option is only taken into account if
+--partition-number is non-zero.
+Default is range.
+   
+  
+ 
+
  
   --tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..0b262eff13 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -186,6 +186,11 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* partitioning for pgbench_accounts table, 0 for no partitioning */
+int 		partition_number = 0;
+enum { PART_RANGE, PART_HASH }
+			partition_type = PART_RANGE;
+
 /* random seed used to initialize base_random_sequence */
 int64		random_seed = -1;
 
@@ -617,6 +622,8 @@ usage(void)
 		   "  --foreign-keys   create foreign key constraints between tables\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
+		   "  --partition-number=NUM   partition account table in NUM parts (defaults: 0)\n"
+		   "  --partition-type=TYPEpartition type (range or hash; default: range)\n"
 		   "  --tablespace=TABLESPACE  create tables in the specified tablespace\n"
 		   "  --unlogged-tablescreate tables as unlogged tables\n"
 		   "\nOptions to select what to run:\n"
@@ -3601,6 +3608,17 @@ initDropTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * add fillfactor percent option if not 100.
+ */
+static void
+append_fillfactor(char *opts, int len)
+{
+	if (fillfactor < 100)
+		snprintf(opts + strlen(opts), len - strlen(opts),
+ " with (fillfactor=%d)", fillfactor);
+}
+
 /*
  * Create pgbench's standard tables
  */
@@ -3625,6 +3643,7 @@ initCreateTables(PGconn *con)
 		const char *bigcols;	/* column decls if accountIDs are 64 bits */
 		int			declare_fillfactor;
 	};
+
 	static const struct ddlinfo DDLs[] = {
 		{
 			"pgbench_history",
@@ -3651,11 +3670,10 @@ initCreateTables(PGconn *con)
 			1
 		}
 	};
-	int			i;
 
 	fprintf(stderr, "creating tables...\n");
 
-	for (i = 0; i < lengthof(DDLs); i++)
+	for (int i = 0; i < lengthof(DDLs); i++)
 	{
 		char		opts[256];
 		char		buffer[256];
@@ -3664,9 +3682,17 @@ initCreateTables(PGconn *con)
 
 		/* Construct new create table statement. */
 		opts[0] = '\0';
-		if (ddl->declare_fillfactor)
+
+		/* Partition pgbench_accounts table */
+		if (partition_number >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0)
+		{
 			snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts),
-	 " with (fillfactor=%d)", fillfactor);
+	 " partition by %s (aid)",
+	 partition_type == PART_RANGE ? "range" : "hash");
+		}
+		else if (ddl->declare_fillfactor)
+			append_fillfactor(opts, sizeof(opts));
+
 		if (tablespace != NULL)
 		{
 			char	   *escape_tablespace;
@@ -3686,6 +3712,54 @@ initCreateTables(PGconn *con)
 
 		executeStatement(con, buffer);
 	}
+
+	if (partition_number >= 1)
+	{
+		int64		part_size = (naccounts * (int64) scale + partition_number - 1) / partition_number;
+		char		ff[64];
+		ff[0] = '\0';
+		append_fillfactor(ff, sizeof(ff));
+
+		fprintf(stderr, "creating %d partitions...\n", partition_number);
+
+		for (int p = 1; p <= partition_number; p++)
+		{
+			char		query[256];
+
+			if (partition_type == PART_RANGE)
+			{
+char		minvalue[32], maxvalue[32];
+
+if (p == 1)
+	sprintf(minvalue, "MINVALUE");
+else
+	sprintf(minvalue, INT64_FORMAT, (p-1) * part_size + 1);
+
+if (p < partition_number)
+	

Re: block-level incremental backup

2019-07-23 Thread Jeevan Ladhe
Hi Vignesh,

This backup technology is extending the pg_basebackup itself, which means
we can
still take online backups. This is internally done using pg_start_backup and
pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint
is
used in the recovery process while starting the cluster from a backup
image. What
incremental backup will just modify (as compared to traditional
pg_basebackup)
is - After doing the checkpoint, instead of copying the entire relation
files,
it takes an input LSN and scan all the blocks in all relation files, and
store
the blocks having LSN >= InputLSN. This means it considers all the changes
that are already written into relation files including insert/update/delete
etc
up to the checkpoint performed by pg_start_backup internally, and as Jeevan
Chalke
mentioned upthread the incremental backup will also contain copy of WAL
files.
Once this incremental backup is combined with the parent backup by means of
new
combine process (that will be introduced as part of this feature itself)
should
ideally look like a full pg_basebackup. Note that any changes done by these
insert/delete/update operations while the incremental backup was being taken
will be still available via WAL files and as normal restore process, will be
replayed from the checkpoint onwards up to a consistent point.

My two cents!

Regards,
Jeevan Ladhe

On Sat, Jul 20, 2019 at 11:22 PM vignesh C  wrote:

> Hi Jeevan,
>
> The idea is very nice.
> When Insert/update/delete and truncate/drop happens at various
> combinations, How the incremental backup handles the copying of the
> blocks?
>
>
> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
>  wrote:
> >
> >
> >
> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed 
> wrote:
> >>
> >>
> >>
> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
> >>>
> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed 
> wrote:
> 
> 
>  At what stage you will apply the WAL generated in between the
> START/STOP backup.
> >>>
> >>>
> >>> In this design, we are not touching any WAL related code. The WAL
> files will
> >>> get copied with each backup either full or incremental. And thus, the
> last
> >>> incremental backup will have the final WAL files which will be copied
> as-is
> >>> in the combined full-backup and they will get apply automatically if
> that
> >>> the data directory is used to start the server.
> >>
> >>
> >> Ok, so you keep all the WAL files since the first backup, right?
> >
> >
> > The WAL files will anyway be copied while taking a backup (full or
> incremental),
> > but only last incremental backup's WAL files are copied to the combined
> > synthetic full backup.
> >
> >>>
> 
>  --
>  Ibrar Ahmed
> >>>
> >>>
> >>> --
> >>> Jeevan Chalke
> >>> Technical Architect, Product Development
> >>> EnterpriseDB Corporation
> >>>
> >>
> >>
> >> --
> >> Ibrar Ahmed
> >
> >
> >
> > --
> > Jeevan Chalke
> > Technical Architect, Product Development
> > EnterpriseDB Corporation
> >
>
>
> --
> Regards,
> vignesh
> EnterpriseDB: http://www.enterprisedb.com
>
>
>


Re: stress test for parallel workers

2019-07-23 Thread Justin Pryzby
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though.  Did you try syslog?)

Nothing in /var/log/messages (nor dmesg ring).

I enabled abrtd while trying to reproduce it last week.  Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:

(gdb) bt
#0  0x00395be32495 in raise () from /lib64/libc.so.6
#1  0x00395be33c75 in abort () from /lib64/libc.so.6
#2  0x0085ddff in errfinish (dummy=) at elog.c:555
#3  0x006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4  0x004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, 
flags=128) at xlog.c:9150
#5  0x004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6  0x006d49e2 in CheckpointerMain () at checkpointer.c:491
#7  0x0050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at 
bootstrap.c:451
#8  0x006dcf54 in StartChildProcess (type=CheckpointerProcess) at 
postmaster.c:5337
#9  0x006de78a in reaper (postgres_signal_arg=) at 
postmaster.c:2867
#10 
#11 0x00395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x006e1488 in ServerLoop (argc=, argv=) at postmaster.c:1671
#13 PostmasterMain (argc=, argv=) at 
postmaster.c:1380
#14 0x00656420 in main (argc=3, argv=0x27ae410) at main.c:228

#2  0x0085ddff in errfinish (dummy=) at elog.c:555
edata = 
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3  0x006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = 
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = 
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"
#4  0x004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, 
flags=128) at xlog.c:9150
No locals.
#5  0x004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
shutdown = false
checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID 
= 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2141308, nextOid = 
496731439, nextMulti = 1, nextMultiOffset = 0, 
  oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, 
time = 1563781930, oldestCommitTsXid = 0, newestCommitTsXid = 0, 
oldestActiveXid = 2141308}
recptr = 
_logSegNo = 
Insert = 
freespace = 
PriorRedoPtr = 
curInsert = 
last_important_lsn = 
vxids = 0x280afb8
nvxids = 0
__func__ = "CreateCheckPoint"
#6  0x006d49e2 in CheckpointerMain () at checkpointer.c:491
ckpt_performed = false
do_restartpoint = 
flags = 128
do_checkpoint = 
now = 1563781930
elapsed_secs = 
cur_timeout = 
rc = 
local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 
140728912407216, 140728912407224, -1669940128812603322, 1670605924426606662}, 
__mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 0, 246358747096, 140728912407296, 
140446084917816, 140446078556040, 9083146, 0, 246346239061, 140446078556040, 
140447207471460, 0, 140447207471424, 140446084917816, 0, 
7864320
checkpointer_context = 0x27e15d0
__func__ = "CheckpointerMain"

Supposedly it's trying to do this:

|   ereport(PANIC,  
|   (errcode_for_file_access(),
|errmsg("could not write to file \"%s\": %m",
|   tmppath)));

And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective).  Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.

Justin




Re: [bug fix] Produce a crash dump before main() on Windows

2019-07-23 Thread Alvaro Herrera
On 2019-Jul-23, Tom Lane wrote:

> Kyotaro Horiguchi  writes:

> > My investigation convinced me that there is no way for a process
> > to detect wheter it is running as a service (except the process
> > directly called from system (aka entry function)).

Maybe we can try calling pgwin32_is_service()?

> But I will say that in my experience, behavioral differences between
> Postgres started manually and Postgres started as a daemon are bad.
> So I think going out of our way to make the cases behave differently
> on Windows is probably not a good plan.

We already have such a difference in Windows -- elog.c uses it
extensively to use the eventlog to print if a service, console print if
not.

Given this thread's OP, ISTM failing to do likewise for this case makes
debugging problems difficult for no good reason.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: stress test for parallel workers

2019-07-23 Thread Tom Lane
Justin Pryzby  writes:
> Does anyone have a stress test for parallel workers ?
> On a customer's new VM, I got this several times while (trying to) migrate 
> their DB:

> < 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a 
> parallel transaction

We've been seeing this irregularly in the buildfarm, too.  I've been
suspicious that it's from an OOM kill on the postmaster in the
buildfarm cases, but ...

> There's nothing in dmesg nor in postgres logs.

... you'd think an OOM kill would show up in the kernel log.
(Not necessarily in dmesg, though.  Did you try syslog?)

> Ideally a minimal test, since I'm apparently going to
> have to run under gdb to see how it's dying, or even what process is failing.

Like it told you, it's the postmaster that's going away.
That's Not Supposed To Happen, of course, but unfortunately Linux'
OOM kill heuristic preferentially targets the postmaster when
its children are consuming too many resources.

If that is the problem, there's some info on working around it at

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

regards, tom lane




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Tom Lane
Martijn van Oosterhout  writes:
> There are a number of possible improvements here:

> 1. Do what sinval does and separate the reader and writer locks so
> they can't block each other. This is the ultimate solution, but it's a
> significant refactor and it's not clear that's actually worthwhile
> here. This would almost be adopting the sinvaladt structure wholesale.

I agree that that's probably more ambitious than is warranted.

> 2. Add a field to AsyncQueueEntry which points to the next listening
> backend. This would allow the loops over all listening backends to
> complete much faster, especially in the normal case where there are
> not many listeners relative to the number of backends. The downside is
> this requires an exclusive lock to remove listeners, but that doesn't
> seem a big problem.

I don't understand how that would work?  The sending backend doesn't
know what the "next listening backend" is.  Having to scan the whole
queue when a listener unlistens seems pretty awful too, especially
if you need exclusive lock while doing so.

> 3. The other idea from sinval where you only wake up one worker at a
> time is a good one as you point out. This seems quite doable, however,
> it seems wasteful to try and wake everyone up the moment we switch to
> a new page. The longer you delay the lower the chance you need to wake
> anyone at all because they've because they'll have caught up by
> themselves. A single SLRU page can hold hundreds, or even thousands of
> messages.

Not entirely following your comment here either.  The point of the change
is exactly that we'd wake up only one backend at a time (and only the
furthest-behind one, so that anyone who catches up of their own accord
stops being a factor).  Also, "hundreds or thousands" seems
over-optimistic given that the minimum size of AsyncQueueEntry is 20
bytes --- in practice it'll be more because people don't use empty
strings as notify channel names.  I think a few hundred messages per
page is the upper limit, and it could be a lot less.

> Do 2 & 3 seem like a good direction to go? I can probably work something up.

I'm on board with 3, obviously.  Not following what you have in mind
for 2.

regards, tom lane




Re: [bug fix] Produce a crash dump before main() on Windows

2019-07-23 Thread Tom Lane
Kyotaro Horiguchi  writes:
> We are obliged to assume that we won't have the desired behavior
> without detecting whether running as a service or not.

> My investigation convinced me that there is no way for a process
> to detect wheter it is running as a service (except the process
> directly called from system (aka entry function)). In other
> words, only pg_ctl knows that and other processes doesn't have a
> clue for that. The processes other than postmaster can receive
> that information via backend variables. But the postmaster has no
> way to get the information from pg_ctl other than command line
> parameter, environment variable or filesystem (or PIPE?).

> If we see the complexity meets the benefit, we can use, say,
> command line parameter, WER dialog can be shown when server is
> started in console but the parameter being specified, but I don't
> think it is a problem.

Not being a Windows user, I don't have much to say about the big
question of whether disabling WER is still a good idea or not.  But
I will say that in my experience, behavioral differences between
Postgres started manually and Postgres started as a daemon are bad.
So I think going out of our way to make the cases behave differently
on Windows is probably not a good plan.

regards, tom lane




Re: Change ereport level for QueuePartitionConstraintValidation

2019-07-23 Thread Alvaro Herrera
On 2019-Jul-23, David Rowley wrote:

> Also, if I'm not wrong, the votes so far appear to be:
> 
> NOTICE:  Robert, Amit
> DEBUG1: Tom, Alvaro (I'm entirely basing this on the fact that they
> mentioned possible ways to test with DEBUG1)
> 
> I'll be happy with DEBUG1 if we can get tests to test it.

Well, I think the user doesn't *care* to see a message about the
optimization.  They just want the command to be fast.  *We* (developers)
want the message in order to ensure the command remains fast.  So some
DEBUG level seems the right thing.

Another way to reach the same conclusion is to think about the "building
index ... serially" messages, which are are pretty much in the same
category and are using DEBUG1.  (I do think the TOAST ones are just
noise though, and since they disrupt potential testing with
client_min_messages=debug1, another way to go about this is to reduce
those to DEBUG2 or just elide them.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




stress test for parallel workers

2019-07-23 Thread Justin Pryzby
Does anyone have a stress test for parallel workers ?

On a customer's new VM, I got this several times while (trying to) migrate 
their DB:

< 2019-07-23 10:33:51.552 CDT postgres >FATAL:  postmaster exited during a 
parallel transaction
< 2019-07-23 10:33:51.552 CDT postgres >STATEMENT:  CREATE UNIQUE INDEX 
unused0_huawei_umts_nodeb_locell_201907_unique_idx ON 
child.unused0_huawei_umts_nodeb_locell_201907 USING btree ...

There's nothing in dmesg nor in postgres logs.

At first I thought it's maybe because of a network disconnection, then I
thought it's because we ran out of space (wal), then they had a faulty blade.
After that, I'd tried and failed to reproduce it a number of times, but it's
just recurred during what was intended to be their final restore.  I've set
max_parallel_workers_per_gather=0, but I'm planning to try to diagnose an issue
in another instance.  Ideally a minimal test, since I'm apparently going to
have to run under gdb to see how it's dying, or even what process is failing.

DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, 
BIOS 6.00 09/21/2015
CentOS release 6.9 (Final)
Linux alextelsasrv01 2.6.32-754.17.1.el6.x86_64 #1 SMP Tue Jul 2 12:42:48 UTC 
2019 x86_64 x86_64 x86_64 GNU/Linux
version | PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 
20120313 (Red Hat 4.4.7-23), 64-bit

Justin




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread Tom Lane
David Rowley  writes:
> I've attached v7, which really is v6 with some comments adjusted and
> the order of the hash_get_num_entries and hash_get_max_bucket function
> calls swapped.  I think hash_get_num_entries() will return 0 most of
> the time where we're calling it, so it makes sense to put the
> condition that's less likely to be true first in the if condition.

> I'm pretty happy with v7 now. If anyone has any objections to it,
> please speak up very soon.  Otherwise, I plan to push it about this
> time tomorrow.

I dunno, this seems close to useless in this form.  As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise).  So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold.  Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all.  The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks.  So we probably need to not use a plain average,
but account for that effect somehow.  Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ]  As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work.  At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea.  Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

regards, tom lane




Fetching timeline during recovery

2019-07-23 Thread Jehan-Guillaume de Rorthais
Hello,

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1] or failover tools during some kind of election
process.

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.

Previous restriction on this functions seems related to ThisTimeLineID not
being safe on standby. This patch is fetching the timeline from
WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
this is updated each time some data are flushed to the WAL. 

As the SQL function pg_last_wal_receive_lsn() reads WalRcv->receivedUpto
which is flushed in the same time, any tool relying on these functions should be
quite fine. It will just have to parse the TL from the walfile name.

It doesn't seems perfectly sain though. I suspect a race condition in any SQL
statement that would try to get the LSN and the walfile name in the same time
if the timeline changes in the meantime. Ideally, a function should be able to
return both LSN and TL in the same time, with only one read from WalRcv. I'm not
sure if I should change the result from pg_last_wal_receive_lsn() or add a
brand new admin function. Any advice?

Last, I plan to produce an extension to support this on older release. Is
it something that could be integrated in official source tree during a minor
release or should I publish it on eg. pgxn?

Regards,

[1]
https://www.postgresql.org/message-id/flat/BF2AD4A8-E7F5-486F-92C8-A6959040DEB6%40yandex-team.ru
>From 9d0fb73d03c6e7e06f2f8be62abab4e54cf01117 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 23 Jul 2019 17:28:44 +0200
Subject: [PATCH] Support pg_walfile_name on standby

Support executing both SQL functions pg_walfile_name() and
pg_walfile_name_offset() on a standby.
---
 src/backend/access/transam/xlogfuncs.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..a8184a20c4 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -460,13 +460,12 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 	TupleDesc	resultTupleDesc;
 	HeapTuple	resultHeapTuple;
 	Datum		result;
+	TimeLineID  tl;
 
 	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("%s cannot be executed during recovery.",
-		 "pg_walfile_name_offset()")));
+		GetWalRcvWriteRecPtr(NULL, );
+	else
+		tl = ThisTimeLineID;
 
 	/*
 	 * Construct a tuple descriptor for the result row.  This must match this
@@ -484,7 +483,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 	 * xlogfilename
 	 */
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
+	XLogFileName(xlogfilename, tl, xlogsegno, wal_segment_size);
 
 	values[0] = CStringGetTextDatum(xlogfilename);
 	isnull[0] = false;
@@ -517,16 +516,15 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 	XLogSegNo	xlogsegno;
 	XLogRecPtr	locationpoint = PG_GETARG_LSN(0);
 	char		xlogfilename[MAXFNAMELEN];
+	TimeLineID  tl;
 
 	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("%s cannot be executed during recovery.",
-		 "pg_walfile_name()")));
+		GetWalRcvWriteRecPtr(NULL, );
+	else
+		tl = ThisTimeLineID;
 
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
+	XLogFileName(xlogfilename, tl, xlogsegno, wal_segment_size);
 
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }
-- 
2.20.1



Re: pg_upgrade version checking questions

2019-07-23 Thread Daniel Gustafsson
> On 22 Jul 2019, at 10:46, Peter Eisentraut  
> wrote:
> 
> On 2019-04-04 15:40, Daniel Gustafsson wrote:
>> On Wednesday, March 27, 2019 1:43 PM, Christoph Berg  wrote:
>> 
>>> Re: Daniel Gustafsson 2019-03-26 
>>> pC-NMmh4vQLQP76YTwY4AuoD4OdNw9egikekyQpXFpgqmTlGjIZXOTd2W5RDZPpRski5N3ADRrLYgLk6QUuvmuT5fWC9acPAYyDU1AVxJcU=@yesql.se
>>> 
 0003 - Make -B default to CWD and remove the exec_path check
 Defaulting to CWD for the new bindir has the side effect that the default
 sockdir is in the bin/ directory which may be less optimal.
>>> 
>>> Hmm, I would have thought that the default for the new bindir is the
>>> directory where pg_upgrade is located, not the CWD, which is likely to
>>> be ~postgres or the like?
>> 
>> Yes, thinking on it that's obviously better.  The attached v2 repurposes the
>> find_my_exec() check to make the current directory of pg_upgrade the default
>> for new_cluster.bindir (the other two patches are left as they were).

Thanks for reviewing!

> 0001-Only-allow-upgrades-by-the-same-exact-version-new-v2.patch
> 
> I don't understand what this does.  Please explain.

This patch makes the version check stricter to ensure that pg_upgrade and the
new cluster is of the same major and minor version.  The code grabs the full
version from the various formats we have (x.y.z, x.z, xdevel) where we used to
skip the minor rev.  This is done to address one of Toms original complaints in
this thread.

> 0002-Check-all-used-executables-v2.patch
> 
> I think we'd also need a check for pg_controldata.

Fixed.  I also rearranged the new cluster checks to be in alphabetical order
since the list makes more sense then (starting with initdb etc).

> Perhaps this comment could be improved:
> 
> /* these are only needed in the new cluster */
> 
> to
> 
> /* these are only needed for the target version */
> 
> (pg_dump runs on the old cluster but has to be of the new version.)

I like this suggestion, fixed with a little bit of wordsmithing.

> 0003-Default-new-bindir-to-exec_path-v2.patch
> 
> I don't like how the find_my_exec() code has been moved around.  That
> makes the modularity of the code worse.  Let's keep it where it was and
> then structure it like this:
> 
> if -B was given:
>new_cluster.bindir = what was given for -B
> else:
># existing block

The reason for moving is that we print default values in usage(), and that
requires the value to be computed before calling usage().  We already do this
for resolving environment values in parseCommandLine().  If we do it in setup,
then we’d have to split out resolving the new_cluster.bindir into it’s own
function exposed to option.c, or do you have any other suggestions there?

I’ve attached all three patches as v3 to be compatible with the CFBot, only
0002 changed so far.

cheers ./daniel



0003-Default-new-bindir-to-exec_path-v3.patch
Description: Binary data


0002-Check-all-used-executables-v3.patch
Description: Binary data


0001-Only-allow-upgrades-by-the-same-exact-version-new-v3.patch
Description: Binary data


Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Martijn van Oosterhout
Hoi Tom,

Thank you for the detailed response. Sorry the delay, I was on holiday.

You are absolutely correct when you point out that the queue pointers
were accessed without the lock and this is probably unsafe.  For the
first two patches this is can be remedied, though I find it makes the
logic a bit harder to follow.  The comments will need to be updated to
reflect the new logic. I hope to post something soon.

As for your point about the third patch, you are right that it's
probably not saving many cycles. However I do think it's worthwhile
actually optimising this loop, because the number of backends that are
listening is likely to be much smaller than the total number of
backends, so there's a lot of cycles being wasted here already. Fixing
the thundering herd issue (like in sinval as you point out) doesn't
actually reduce the amount of work being done, just spreads it out.
Since readers and writers block each other, blocking a writer means
blocking commits across the whole cluster.

There are a number of possible improvements here:

1. Do what sinval does and separate the reader and writer locks so
they can't block each other. This is the ultimate solution, but it's a
significant refactor and it's not clear that's actually worthwhile
here. This would almost be adopting the sinvaladt structure wholesale.

2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.

3. The other idea from sinval where you only wake up one worker at a
time is a good one as you point out. This seems quite doable, however,
it seems wasteful to try and wake everyone up the moment we switch to
a new page. The longer you delay the lower the chance you need to wake
anyone at all because they've because they'll have caught up by
themselves. A single SLRU page can hold hundreds, or even thousands of
messages.

Do 2 & 3 seem like a good direction to go? I can probably work something up.

Thanks in advance,
Martijn


> Martijn van Oosterhout  writes:
> > Please find attached updated versions of the patches, I've now tested
> > them. Also attached is a reproduction script to verify that they
> > actually work.
>
> I looked through these (a bit cursorily).
>
> I'm generally on board with the idea of 0001, but not with the patch
> details.  As coded, asyncQueueAdvanceTail is supposing that it can
> examine the shared QUEUE_HEAD and QUEUE_TAIL pointers without any
> lock whatsoever.  That's probably unsafe, and if it is safe for some
> reason, you haven't made the argument why.  Moreover, it seems
> unnecessary to make any such assumption.  Why not put back the
> advanceTail tests you removed, but adjust them so that advanceTail
> isn't set true unless QUEUE_HEAD and QUEUE_TAIL point to different
> pages?  (Note that in the existing coding, those tests are made
> while holding an appropriate lock, so it's safe to look at those
> pointers there.)
>
> It might be a good idea to make a macro encapsulating this new,
> more complicated rule for setting advanceTail, instead of relying
> on keeping the various call sites in sync.
>
> More attention to comments is also needed.  For instance, you've
> made a lie out of the documentation of the tail pointer:
>
> QueuePosition tail; /* the global tail is equivalent to the pos of
>  * the "slowest" backend */
>
> It needs to say something like "is <= the pos of the slowest backend",
> instead.  I think the explanation of why this algorithm is good could
> use more effort, too.
>
> Comments for 0002 are about the same: for no explained reason, and
> certainly no savings, you've put the notify_all test in an unsafe
> place rather than a safe one (viz, two lines down, *after* taking
> the relevant lock).  And 0002 needs more commentary about why
> its optimization is safe and useful, too.  In particular it's
> not obvious why QUEUE_HEAD being on a different page from QUEUE_TAIL
> has anything to do with whether we should wake up other backends.
>
> I'm not very persuaded by 0003, mainly because it seems likely to
> me that 0001 and 0002 will greatly reduce the possibility that
> the early-exit can happen.  So it seems like it's adding cycles
> (in a spot where we hold exclusive lock) without a good chance of
> saving any cycles.
>
> Taking a step back in hopes of seeing the bigger picture ...
> as you already noted, these changes don't really fix the "thundering
> herd of wakeups" problem, they just arrange for it to happen
> once per SLRU page rather than once per message.  I wonder if we
> could improve matters by stealing an idea from the sinval code:
> when we're trying to cause advance of the global QUEUE_TAIL, waken
> only the slowest backend, and 

Re: POC: Cleaning up orphaned files using undo logs

2019-07-23 Thread Amit Khandekar
On Tue, 23 Jul 2019 at 08:48, Amit Kapila  wrote:
>
> On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar  wrote:
> >
> > On Mon, 22 Jul 2019 at 14:21, Amit Kapila  wrote:
> >
> > -
> >
> > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> > +{
> > + /* Block concurrent access. */
> > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> > +
> > + MyUndoWorker = >workers[slot];
> > Not sure why MyUndoWorker is used here. Can't we use a local variable
> > ? Or do we intentionally attach to the slot as a side-operation ?
> >
> > -
> >
>
> I think here, we can use a local variable as well.  Do you see any
> problem with the current code or do you think it is better to use a
> local variable here?

I think, even though there might not be a correctness issue with the
current code as it stands, we should still use a local variable.
Updating MyUndoWorker is a big side-effect, which the caller is not
supposed to be aware of, because all that function should do is just
get the slot info.

>
> > --
> >
> > + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> > I was thinking what happens if for some reason
> > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> > entry will not be marked invalid, and so there will be no undo action
> > carried out because I think the undo worker will exit. What happens
> > next with this entry ?
>
> The same entry is present in two queues xid and size, so next time it
> will be executed from the second queue based on it's priority in that
> queue.  However, if it fails again a second time in the same way, then
> we will be in trouble because now the hash table has entry, but none
> of the queues has entry, so none of the workers will attempt to
> execute again.  Also, when discard worker again tries to register it,
> we won't allow adding the entry to queue thinking either some backend
> is executing the same or it must be part of some queue.
>
> The one possibility to deal with this could be that we somehow allow
> discard worker to register it again in the queue or we can do this in
> critical section so that it allows system restart on error.  However,
> the main thing is it possible that InsertRequestIntoErrorUndoQueue
> will fail unless there is some bug in the code?  If so, we might want
> to have an Assert for this rather than handling this condition.

Yes, I also think that the function would error out only because of
can't-happen cases, like "too many locks taken" or "out of binary heap
slots" or "out of memory" (this last one is not such a can't happen
case). These cases happen probably due to some bugs, I suppose. But I
was wondering : Generally when the code errors out with such
can't-happen elog() calls, worst thing that happens is that the
transaction gets aborted. Whereas, in this case, the worst thing that
could happen is : the undo action would never get executed, which
means selects for this tuple will keep on accessing the undo log ?
This does not sound like any data consistency issue, so we should be
fine after all ?



Some further review comments for undoworker.c :


+/* Sets the worker's lingering status. */
+static void
+UndoWorkerIsLingering(bool sleep)
The function name sounds like "is the worker lingering ?". Can we
rename it to something like "UndoWorkerSetLingering" ?

-

+ errmsg("undo worker slot %d is empty, cannot attach",
+ slot)));

+ }
+
+ if (MyUndoWorker->proc)
+ {
+ LWLockRelease(UndoWorkerLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("undo worker slot %d is already used by "
+ "another worker, cannot attach", slot)));

These two error messages can have a common error message "could not
attach to worker slot", with errdetail separate for each of them :
slot %d is empty.
slot %d is already used by another worker.

--

+static int
+IsUndoWorkerAvailable(void)
+{
+ int i;
+ int alive_workers = 0;
+
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);

Should have bool return value.

Also, why is it keeping track of number of alive workers ? Sounds like
earlier it used to return number of alive workers ? If it indeed needs
to just return true/false, we can do away with alive_workers.

Also, *exclusive* lock is unnecessary.

--

+if (UndoGetWork(false, false, , NULL) &&
+IsUndoWorkerAvailable())
+UndoWorkerLaunch(urinfo);

There is no lock acquired between IsUndoWorkerAvailable() and
UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
returns true, there is a small window where UndoWorkerLaunch() does
not find any worker slot with in_use false, causing assertion failure
for (worker != NULL).
--

+UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
+{
+ /* Block concurrent access. */
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
*Exclusive* lock is unnecessary.
-

+ LWLockRelease(UndoWorkerLock);
+ ereport(ERROR,
+ 

Re: SQL/JSON: JSON_TABLE

2019-07-23 Thread Pavel Stehule
Hi

út 16. 7. 2019 v 16:06 odesílatel Nikita Glukhov 
napsal:

> On 29.06.2019 8:40, Pavel Stehule wrote:
>
> Hi
>
> so 29. 6. 2019 v 7:26 odesílatel Nikita Glukhov 
> napsal:
>
>> Attached 36th version of patches rebased onto jsonpath v36.
>>
>
> I cannot to apply these patches on master. Please, can you check these
> patches?
>
>
> Attached 37th version of patches rebased onto current master.
>

I got warning

ar crs libpgcommon.a base64.o config_info.o controldata_utils.o d2s.o
exec.o f2s.o file_perm.o ip.o keywords.o kwlookup.o link-canary.o md5.o
pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o saslprep.o
scram-common.o string.o unicode_norm.o username.o wait_error.>
...skipping...
clauses.c:1076:3: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
 1076 |   if (ExecEvalJsonNeedsSubTransaction(jsexpr, NULL))
  |   ^~
clauses.c:1078:4: note: ...this statement, but the latter is misleadingly
indented as if it were guarded by the ‘if’
 1078 |return true;
  |^~
gcc -Wall -Wmissing-protot

Regress tests diff is not empty - see attached file

some strange fragments from code:

deparseExpr(node->arg, context);
-   if (node->relabelformat != COERCE_IMPLICIT_CAST)
+   if (node->relabelformat != COERCE_IMPLICIT_CAST &&
+   node->relabelformat == COERCE_INTERNAL_CAST)


Now, "format"  is type_func_name_keyword, so when you use it, then nobody
can use "format" as column name. It can break lot of application. "format"
is common name. It is relatively unhappy, and it can touch lot of users.

This patch set (JSON functions & JSON_TABLE) has more tha 20K rows. More,
there are more than few features are implemented.

Is possible to better (deeper) isolate these features, please? I have
nothing against any implemented feature, but it is hard to work intensively
(hard test) on this large patch. JSON_TABLE has only 184kB, can we start
with this patch?

SQLJSON_FUNCTIONS has 760kB - it is maybe too much for one feature, one
patch.


Pavel



> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


regression.diffs
Description: Binary data


Re: Support for jsonpath .datetime() method

2019-07-23 Thread Peter Eisentraut
I think the best way forward here is to focus first on patch 0002 and
get the additional format templates in, independent of any surrounding
JSON functionality.

In particular, remove parse_datetime() and all the related API changes,
then it becomes much simpler.

The codes FF1..FF6 that you added appear to be correct, but reading the
spec I find there is more missing, specifically

-  and RR
- S (currently only  is supported, but that's not standard)

Also in some cases we allow timestamps with seven digits of fractional
precision, so perhaps FF7 should be supported as well.  I'm not quite
sure about the details here.  You tests only cover 6 and 9 digits.  It
would be good to cover 7 and perhaps 8 as well, since those are the
boundary cases.

Some concrete pieces of review:

+   
+FF1
+decisecond (0-9)
+   

Let's not use such weird terms as "deciseconds".  We could say
"fractional seconds, 1 digit" etc. or something like that.

+/* Return flags for DCH_from_char() */
+#define DCH_DATED  0x01
+#define DCH_TIMED  0x02
+#define DCH_ZONED  0x04

I think you mean do_to_timestamp() here.  These terms "dated" etc. are
from the SQL standard text, but they should be explained somewhere for
the readers of the code.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add CREATE DATABASE LOCALE option

2019-07-23 Thread Peter Eisentraut
On 2019-07-23 00:18, Fabien COELHO wrote:
> It applies cleanly, compiles, global & pg_dump make check ok, doc gen ok.
> 
> I'm still unconvinced of the interest of breaking backward compatibility, 
> but this is no big deal.
> 
> I do not like much calling strlen() to check whether a string is empty, 
> but this is pre-existing.
> 
> I switched the patch to READY.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_receivewal documentation

2019-07-23 Thread Jesper Pedersen

Hi,

On 7/22/19 8:08 PM, Michael Paquier wrote:

+to something other than


Looks fine to me.  Just a tiny nit.  For the second reference to
synchronous_commit, I would change the link to a  markup.


Sure.

Best regards,
 Jesper


>From f6c5e9128e0779f928d94bf9bcc8813bf03150f0 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Robert Haas
Review-by: Michael Paquier, Álvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..a7536bed92 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -52,7 +52,15 @@ PostgreSQL documentation
Unlike the WAL receiver of a PostgreSQL standby server, pg_receivewal
by default flushes WAL data only when a WAL file is closed.
The option --synchronous must be specified to flush WAL data
-   in real time.
+   in real time. Since pg_receivewal does not apply WAL,
+   you should not allow it to become a synchronous standby when
+equals remote_apply.
+   If it does, it will appear to be a standby which never catches up,
+   which may cause commits to block.  To avoid this, you should either
+   configure an appropriate value for , or
+   specify an application_name for
+   pg_receivewal that does not match it, or change the value of
+   synchronous_commit to something other than remote_apply.
   
 
   
-- 
2.21.0



[GSoC] Second period status report for the de-TOAST iterator

2019-07-23 Thread Binguo Bao
Hi hackers,
It's a report about my GSoC project[1] during the second period.

What I've done:
- Submitted the patch[2] to optimize partial TOAST decompression
- Submitted the patch[3] as a prototype of de-TOAST iterator and tested its
performance on position() function
- Perfected patches according to hacker's review comments

What I'm doing:
- Make overall tests on the iterator
- Applying the de-TOAST iterator to the json/jsonb data type

Thanks to hackers and mentors for helping me advance this project.

-- 
Best regards,
Binguo Bao

[1] https://summerofcode.withgoogle.com/projects/#6467011507388416
[2]
https://www.postgresql.org/message-id/flat/CAL-OGkthU9Gs7TZchf5OWaL-Gsi=hxquftxkv9qpng73d5n...@mail.gmail.com
[3]
https://www.postgresql.org/message-id/flat/cal-ogks_onzpc9m9bxpcztmofwulcfkyecekiagxzwrl8kx...@mail.gmail.com


Re: SegFault on 9.6.14

2019-07-23 Thread Amit Kapila
On Tue, Jul 23, 2019 at 9:11 AM Thomas Munro  wrote:
>
> On Fri, Jul 19, 2019 at 3:00 PM Amit Kapila  wrote:
> > I am thinking that why not we remove the part of destroying the
> > parallel context (and shared memory) from ExecShutdownGather (and
> > ExecShutdownGatherMerge) and then do it at the time of ExecEndGather
> > (and ExecEndGatherMerge)?   This should fix the bug in hand and seems
> > to be more consistent with our overall design principles.  I have not
> > tried to code it to see if there are any other repercussions of the
> > same but seems worth investigating.  What do you think?
>
> I tried moving ExecParallelCleanup() into ExecEndGather().  The first
> problem is that ExecutePlan() wraps execution in
> EnterParallelMode()/ExitParallelMode(), but ExitParallelMode() fails
> an assertion that no parallel context is active because
> ExecEndGather() hasn't run yet.  The enclosing
> ExecutorStart()/ExecutorEnd() calls are further down the call stack,
> in ProcessQuery().  So some more restructuring might be needed to exit
> parallel mode later, but then I feel like you might be getting way out
> of back-patchable territory, especially if it involves moving code to
> the other side of the executor hook boundary.  Is there an easier way?
>

If we have to follow the solution on these lines, then I don't see an
easier way.  One idea could be that we relax the assert in
ExitParallelMode so that it doesn't expect parallel context to be gone
by that time, but not sure if that is a good idea because it is used
in some other places as well.  I feel in general it is a good
assertion that before we leave parallel mode, the parallel context
should be gone as that ensures we won't do any parallel activity after
that.

> Another idea from the band-aid-solutions-that-are-easy-to-back-patch
> department: in ExecutePlan() where we call ExecShutdownNode(), we
> could write EXEC_FLAG_DONE into estate->es_top_eflags, and then have
> ExecGatherShutdown() only run ExecParallelCleanup() if it sees that
> flag.  That's not beautiful, but it's less churn that the 'bool final'
> argument we discussed before, and could be removed in master when we
> have a better way.
>

Right, that will be lesser code churn and it can also work.  However,
one thing that needs some thought is till now es_top_eflags is only
set in ExecutorStart and same is mentioned in comments where it is
declared and it seems we are going to change that with this idea.  How
about having a separate function ExecBlahShutdown which will clean up
resources as parallel context and can be called only from ExecutePlan
where we are calling ExecShutdownNode?  I think both these and the
other solution we have discussed are on similar lines and another idea
could be to relax the assert which again is not a superb idea.

> Stepping back a bit, it seems like we need two separate tree-walking
> calls: one to free resources not needed anymore by the current rescan
> (workers), and another to free resources not needed ever again
> (parallel context).  That could be spelled ExecShutdownNode(false) and
> ExecShutdownNode(true), or controlled with the EXEC_FLAG_DONE kluge,
> or a new additional ExecSomethingSomethingNode() function, or as you
> say, perhaps the second thing could be incorporated into
> ExecEndNode().  I suspect that the Shutdown callbacks for Hash, Hash
> Join, Custom Scan and Foreign Scan might not be needed anymore if we
> could keep the parallel context around until after the run
> ExecEndNode().
>

I think we need those to collect instrumentation information.  I guess
that has to be done before we call InstrStopNode, otherwise, we might
miss some instrumentation information.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [bug fix] Produce a crash dump before main() on Windows

2019-07-23 Thread Kyotaro Horiguchi
At Mon, 1 Jul 2019 00:04:47 +, "Tsunakawa, Takayuki" 
 wrote in 
<0A3221C70F24FB45833433255569204D1FC6515F@G01JPEXMBYT05>
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> > Tsunakawa/Haribabu - By reading this thread briefly, it seems we need
> > some more inputs from other developers on whether to fix this or not,
> > so ideally the status of this patch should be 'Needs Review'.  Why it
> > is in 'Waiting on Author' state?  If something is required from
> > Author, then do we expect to see the updated patch in the next few
> > days?
> 
> Thank you for paying attention to this.  I think the patch is good, but 
> someone else may have a different solution.  So I marked it as needs review.

We are obliged to assume that we won't have the desired behavior
without detecting whether running as a service or not.

My investigation convinced me that there is no way for a process
to detect wheter it is running as a service (except the process
directly called from system (aka entry function)). In other
words, only pg_ctl knows that and other processes doesn't have a
clue for that. The processes other than postmaster can receive
that information via backend variables. But the postmaster has no
way to get the information from pg_ctl other than command line
parameter, environment variable or filesystem (or PIPE?).

If we see the complexity meets the benefit, we can use, say,
command line parameter, WER dialog can be shown when server is
started in console but the parameter being specified, but I don't
think it is a problem.

Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Optimze usage of immutable functions as relation

2019-07-23 Thread Anastasia Lubennikova

08.07.2019 4:18, Thomas Munro:

The July Commitfest is here.  Could we please have a rebase of this patch?

Updated patch is in attachments.
I've only resolved one small cosmetic merge conflict.

Later this week I'm going to send a more thoughtful review.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 959dab6ef869821e757f93a82de40b6bf883ad71
Author: Anastasia 
Date:   Tue Jul 23 14:22:18 2019 +0300

[PATCH v6] Simplify immutable RTE_FUNCTION to RTE_RESULT

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36fefd9..acc3776 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1071,7 +1071,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind)
 		expr = flatten_join_alias_vars(root->parse, expr);
 
 	/*
-	 * Simplify constant expressions.
+	 * Simplify constant expressions. For function RTEs, this was already done
+	 * by pullup_simple_subqueries.
 	 *
 	 * Note: an essential effect of this is to convert named-argument function
 	 * calls to positional notation and insert the current actual values of
@@ -1085,7 +1086,8 @@ preprocess_expression(PlannerInfo *root, Node *expr, int kind)
 	 * careful to maintain AND/OR flatness --- that is, do not generate a tree
 	 * with AND directly under AND, nor OR directly under OR.
 	 */
-	expr = eval_const_expressions(root, expr);
+	if (kind != EXPRKIND_RTFUNC && kind != EXPRKIND_RTFUNC_LATERAL)
+		expr = eval_const_expressions(root, expr);
 
 	/*
 	 * If it's a qual or havingQual, canonicalize it.
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 4fbc03f..fca78b7 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -79,6 +79,9 @@ static Node *pull_up_simple_union_all(PlannerInfo *root, Node *jtnode,
 static void pull_up_union_leaf_queries(Node *setOp, PlannerInfo *root,
 	   int parentRTindex, Query *setOpQuery,
 	   int childRToffset);
+static void transform_const_function_to_result(PlannerInfo *root, Node *jtnode,
+	 RangeTblEntry *rte,
+	 JoinExpr *lowest_nulling_outer_join);
 static void make_setop_translation_list(Query *query, Index newvarno,
 		List **translated_vars);
 static bool is_simple_subquery(Query *subquery, RangeTblEntry *rte,
@@ -754,6 +757,18 @@ pull_up_subqueries_recurse(PlannerInfo *root, Node *jtnode,
 			is_simple_values(root, rte))
 			return pull_up_simple_values(root, jtnode, rte);
 
+		/*
+		 * Or is it an immutable function that evaluated to a single Const?
+		 */
+		if (rte->rtekind == RTE_FUNCTION)
+		{
+			rte->functions = (List *)
+eval_const_expressions(root, (Node *) rte->functions);
+			transform_const_function_to_result(root, jtnode, rte,
+			   lowest_nulling_outer_join);
+			return jtnode;
+		}
+
 		/* Otherwise, do nothing at this node. */
 	}
 	else if (IsA(jtnode, FromExpr))
@@ -1784,6 +1799,76 @@ is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes)
 }
 
 /*
+ * If the function of this RTE_FUNCTION entry evaluated to a single Const
+ * after eval_const_expressions, transform it to RTE_RESULT.
+ */
+static void
+transform_const_function_to_result(PlannerInfo *root, Node *jtnode,
+   RangeTblEntry *rte,
+   JoinExpr *lowest_nulling_outer_join)
+{
+	ListCell *lc;
+	pullup_replace_vars_context rvcontext;
+	RangeTblFunction *rtf = (RangeTblFunction *) linitial(rte->functions);
+	Query *parse = root->parse;
+
+	if (!IsA(rtf->funcexpr, Const))
+		return;
+
+	if (rte->funcordinality)
+		return;
+
+	if (list_length(rte->functions) != 1)
+		return;
+
+	rvcontext.targetlist = list_make1(makeTargetEntry((Expr *) rtf->funcexpr,
+		1 /* resno */, NULL /* resname */, false /* resjunk */));
+	rvcontext.root = root;
+	rvcontext.target_rte = rte;
+
+	/*
+	 * Since this function was reduced to Const, it can't really have lateral
+	 * references, even if it's marked as LATERAL. This means we don't need
+	 * to fill relids.
+	 */
+	rvcontext.relids = NULL;
+
+	rvcontext.outer_hasSubLinks = >hasSubLinks;
+	rvcontext.varno = ((RangeTblRef *) jtnode)->rtindex;
+
+	/*
+	 * If this RTE is on a nullable side of an outer join, we have to insert
+	 * PHVs around our Consts so that they go to null when needed.
+	 */
+	rvcontext.need_phvs = lowest_nulling_outer_join != NULL;
+
+	rvcontext.wrap_non_vars = false;
+	rvcontext.rv_cache = palloc0((list_length(rvcontext.targetlist) + 1)
+ * sizeof(Node *));
+
+	parse->targetList = (List *)
+			pullup_replace_vars((Node *) parse->targetList, );
+	parse->returningList = (List *)
+			pullup_replace_vars((Node *) parse->returningList, );
+	replace_vars_in_jointree((Node *) parse->jointree, , NULL);
+
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *otherrte = (RangeTblEntry *) lfirst(lc);
+
+		if (otherrte->rtekind == RTE_JOIN)
+			

Re: Runtime pruning problem

2019-07-23 Thread David Rowley
On Sat, 25 May 2019 at 18:55, David Rowley  wrote:
> and an updated patch, rebased after the pgindent run.
>
> Hopefully, this will make the CF bot happy again.

and rebased again due to a conflict with some List changes that
touched ruleutils.c.

I also made another couple of passes over this adding a few comments
and fixing some spelling mistakes. I also added another regression
test to validate the EXPLAIN VERBOSE target list output of a
MergeAppend that's had all its subnodes pruned. Previously the Vars
from the pruned rel were only shown in the MergeAppend's sort clause.
After doing all that I'm now pretty happy with it.

The part I wouldn't mind another set of eyes on is the ruleutils.c
changes. The patch changes things around so that we don't just pass
around and track PlanStates, we also pass around the Plan node for
that state.  In some cases, the PlanState can be NULL if the Plan has
no PlanState. Currently, that only happens when run-time pruning
didn't initialise any PlanStates for the given subplan's Plan node.
I've coded it so that Append and MergeAppend use the first PlanState
to resolve Vars. I only resort to using the first Plan's vars when
there are no PlanStates.  If we just took the first Plan node all the
time then it might get confusing for users reading an EXPLAIN when the
first subplan was run-time pruned as we'd be resolving Vars from a
pruned subnode.  It seems much less confusing to print the Plan vars
when the Append/MergeAppend has no subplans.

If there are no objections to the changes then I'd really like to be
pushing this early next week.

The v3 patch is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


improve_runtime_pruning_explain_output_v3.patch
Description: Binary data


Re: Berserk Autovacuum (let's save next Mandrill)

2019-07-23 Thread Masahiko Sawada
On Sun, Apr 14, 2019 at 9:59 PM Darafei "Komяpa" Praliaskouski
 wrote:
>>
>>
>> >I don't think it's helpful to force emergency vacuuming more
>> >frequently;
>> >quite the contrary, it's likely to cause even more issues.  We should
>> >tweak autovacuum to perform freezing more preemtively instead.
>>
>> I still think the fundamental issue with making vacuum less painful is that 
>> the all indexes have to be read entirely. Even if there's not much work (say 
>> millions of rows frozen, hundreds removed). Without that issue we could 
>> vacuum much more frequently. And do it properly in insert only workloads.
>
>
> Deletion of hundreds of rows on default settings will cause the same behavior 
> now.
> If there was 0 updates currently the index cleanup will be skipped.
>
> https://commitfest.postgresql.org/22/1817/ got merged. This means Autovacuum 
> can have two separate thresholds - the current, on dead tuples, triggering 
> the VACUUM same way it triggers it now, and a new one, on inserted tuples 
> only, triggering VACUUM (INDEX_CLEANUP FALSE)?
>

Agreed.

To invoke autovacuum even on insert-only tables we would need check
the number of inserted tuples since last vacuum. I think we can keep
track of the number of inserted tuples since last vacuum to the stats
collector and add the threshold to invoke vacuum with INDEX_CLEANUP =
false. If an autovacuum worker confirms that the number of inserted
tuples exceeds the threshold it invokes vacuum with INDEX_CLEANUP =
false. However if the number of dead tuples also exceeds the
autovacuum thresholds (autovacuum_vacuum_threshold and
autovacuum_vacuum_scale_factor) it should invoke vacuum with
INDEX_CLEANUP = true. Therefore new threshold makes sense only when
it's lower than the autovacuum thresholds.

I guess we can have one new GUC parameter to control scale factor.
Since only relatively large tables will require this feature we might
not need the threshold based the number of tuples.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-23 Thread Michael Paquier
On Mon, Jul 22, 2019 at 12:58:40PM -0400, Alvaro Herrera wrote:
> Maybe it's just me, but it seems weird to try to forestall a problem
> that cannot occur by definition.  I would rather remove the escaping,
> and add a one-line comment explaining why we don't do it?

No objections with doing that either, really.  Perhaps you would
prefer pushing a patch among those lines by yourself?

One argument that I got in mind to justify the escaping would be if we
add a new feature in pg_basebackup to write a new set of recovery
options on an existing data folder, which does not require an option.
In this case, if the escaping does not exist, starting the server
would fail with a confusing parsing error if a quote is added to the
slot name.  But if the escaping is done, then we would get a correct
error that the replication slot value includes an incorrect character.
If such an hypothetical option is added, most likely this would be
noticed anyway, so that's mainly nannyism from my side.
--
Michael


signature.asc
Description: PGP signature


Re: Race conditions with TAP test for syncrep

2019-07-23 Thread Michael Paquier
On Mon, Jul 22, 2019 at 11:45:53PM -0700, Noah Misch wrote:
> PostgresNode uses "print" the same way.  The patch does close the intended
> race conditions, and its implementation choices look fine to me.

Thanks Noah for the review.  I have reviewed the thread and improved a
couple of comments based on Alvaro's previous input.  Attached is v2.
If there are no objections, I would be fine to commit it.
--
Michael
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index bba47da17a..8ce77ffbcf 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -27,6 +27,23 @@ sub test_sync_state
 	return;
 }
 
+# Start a standby and check that it is registered within the WAL sender
+# array of the given primary.  This polls the primary's pg_stat_replication
+# until the standby is confirmed as registered.
+sub start_standby_and_wait
+{
+	my ($master, $standby) = @_;
+	my $master_name  = $master->name;
+	my $standby_name = $standby->name;
+	my $query =
+	  "SELECT count(1) = 1 FROM pg_stat_replication WHERE application_name = '$standby_name'";
+
+	$standby->start;
+
+	print "### Waiting for standby \"$standby_name\" on \"$master_name\"\n";
+	$master->poll_query_until('postgres', $query);
+}
+
 # Initialize master node
 my $node_master = get_new_node('master');
 $node_master->init(allows_streaming => 1);
@@ -36,23 +53,27 @@ my $backup_name = 'master_backup';
 # Take backup
 $node_master->backup($backup_name);
 
+# Create all the standbys.  Their status on the primary is checked to
+# ensure the ordering of each one of them in the WAL sender array of the
+# primary.
+
 # Create standby1 linking to master
 my $node_standby_1 = get_new_node('standby1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby2 linking to master
 my $node_standby_2 = get_new_node('standby2');
 $node_standby_2->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_2->start;
+start_standby_and_wait($node_master, $node_standby_2);
 
 # Create standby3 linking to master
 my $node_standby_3 = get_new_node('standby3');
 $node_standby_3->init_from_backup($node_master, $backup_name,
 	has_streaming => 1);
-$node_standby_3->start;
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Check that sync_state is determined correctly when
 # synchronous_standby_names is specified in old syntax.
@@ -82,8 +103,10 @@ $node_standby_1->stop;
 $node_standby_2->stop;
 $node_standby_3->stop;
 
-$node_standby_2->start;
-$node_standby_3->start;
+# Make sure that each standby reports back to the primary in
+# the wanted order.
+start_standby_and_wait($node_master, $node_standby_2);
+start_standby_and_wait($node_master, $node_standby_3);
 
 # Specify 2 as the number of sync standbys.
 # Check that two standbys are in 'sync' state.
@@ -94,7 +117,7 @@ standby3|3|sync),
 	'2(standby1,standby2,standby3)');
 
 # Start standby1
-$node_standby_1->start;
+start_standby_and_wait($node_master, $node_standby_1);
 
 # Create standby4 linking to master
 my $node_standby_4 = get_new_node('standby4');
@@ -126,14 +149,16 @@ standby4|1|sync),
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
-# chosen properly even in case of that setting.
-# The priority of standby2 should be 2 because it matches * first.
+# chosen properly even in case of that setting. standby1 is selected
+# as synchronous as it has the highest priority, and is followed by a
+# second standby listed first in the WAL sender array, which is
+# standby2 in this case.
 test_sync_state(
 	$node_master, qq(standby1|1|sync
 standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
-	'asterisk comes before another standby name',
+	'asterisk before another standby name',
 	'2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored


signature.asc
Description: PGP signature


Re: pgbench - add pseudo-random permutation function

2019-07-23 Thread Fabien COELHO


Hello Thomas,


Function nbits(), which was previously discussed, has been simplified by
using the function pg_popcount64().


Hi Fabien, Suzuki-san,

I am not smart enough to commit this


I'm in no hurry:-)


or judge its value for benchmarking,


I think that it is valuable given that we have non uniform random 
generators in pgbench.


I'm wondering about the modular_multiply manual implementation which adds 
quite a few lines of non trivial code. If int128 is available on all/most 
platforms, it could be removed and marked as not supported on these, 
although it would create an issue with non regression tests.



but I have a few trivial comments on the language:

+It allows to mix the output of non uniform random functions so that

"It allows the output of non-uniform random functions to be mixed so that"


Fixed.


+ensures that a perfect permutation is applied: there are no collisions
+nor holes in the output values.

"neither collisions nor holes", or "no collisions or holes"


I choose the first.


+The function errors if size is not positive.

"raises an error"


Fixed.


+ * 24 bits mega primes from https://primes.utm.edu/lists/small/millions/

"24 bit mega primes"


Fixed.


+/* length of n binary representation */
+static int
+nbits(uint64 n)
+{
+/* set lower bits to 1 and count them */
+return pg_popcount64(compute_mask(n));
+}

I suppose you could use n == 0 ? 0 : pg_leftmost_one_pos64(n) + 1, and then...


It would create a branch, that I'm trying to avoid.


+/* return smallest mask holding n  */
+static uint64
+compute_mask(uint64 n)
+{
+n |= n >> 1;
+n |= n >> 2;
+n |= n >> 4;
+n |= n >> 8;
+n |= n >> 16;
+n |= n >> 32;
+return n;
+}

... here you could use 1 << nbits(n)) - 1.  I have no idea if doing it
that way around is better, it's just a thought and removes a few lines
of bit-swizzling code.


This would create a infinite recursion as nbits currently uses 
compute_mask. The 6 bitfield operation above is pretty efficient, I'd let 
it at that.


Attached a v16.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..1747a9f66a 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -947,7 +947,7 @@ pgbench  options  d
 
   
 default_seed 
-   seed used in hash functions by default
+   seed used in hash and pseudo-random permutation functions by default
   
 
   
@@ -1443,6 +1443,13 @@ SELECT 2 AS two, 3 AS three \gset p_
pow(2.0, 10), power(2.0, 10)
1024.0
   
+  
+   pr_perm(i, size [, seed ] )
+   integer
+   pseudo-random permutation in [0,size)
+   pr_perm(0, 4)
+   0, 1, 2 or 3
+  
   
random(lb, ub)
integer
@@ -1608,6 +1615,24 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
 
   
 
+  
+Function pr_perm implements a pseudo-random permutation.
+It allows the output of non uniform random functions to be mixed so that
+values drawn more often are not trivially correlated.
+It permutes integers in [0, size) using a seed by applying rounds of
+simple invertible functions, similarly to an encryption function,
+although beware that it is not at all cryptographically secure.
+Compared to hash functions discussed above, the function
+ensures that a perfect permutation is applied: there are neither collisions
+nor holes in the output values.
+Values outside the interval are interpreted modulo the size.
+The function raises an error if size is not positive.
+If no seed is provided, :default_seed is used.
+For a given size and seed, the function is fully deterministic: if two
+permutations on the same size must not be correlated, use distinct seeds
+as outlined in the previous example about hash functions.
+  
+
   
As an example, the full definition of the built-in TPC-B-like
transaction is:
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 17c9ec32c5..13410787d4 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -19,6 +19,7 @@
 #define PGBENCH_NARGS_VARIABLE	(-1)
 #define PGBENCH_NARGS_CASE		(-2)
 #define PGBENCH_NARGS_HASH		(-3)
+#define PGBENCH_NARGS_PRPERM	(-4)
 
 PgBenchExpr *expr_parse_result;
 
@@ -370,6 +371,9 @@ static const struct
 	{
 		"hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
 	},
+	{
+		"pr_perm", PGBENCH_NARGS_PRPERM, PGBENCH_PRPERM
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
@@ -482,6 +486,19 @@ make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args)
 			}
 			break;
 
+		/* pseudo-random permutation function with optional seed argument */
+		case PGBENCH_NARGS_PRPERM:
+			if (len < 2 || len > 3)
+expr_yyerror_more(yyscanner, "unexpected number of arguments",
+  PGBENCH_FUNCTIONS[fnumber].fname);
+
+			if (len == 2)
+			{
+PgBenchExpr *var = 

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-23 Thread Michael Paquier
On Mon, Jul 22, 2019 at 02:40:19PM +0200, Julien Rouhaud wrote:
> Right, so I kept the long option.  Also this comment was outdated, as
> the --jobs is now just ignored with a list of indexes, so I fixed that
> too.

+   if (!parallel)
+   {
+   if (user_list == NULL)
+   {
+   /*
+* Create a dummy list with an empty string, as user requires an
+* element.
+*/
+   process_list = pg_malloc0(sizeof(SimpleStringList));
+   simple_string_list_append(process_list, "");
+   }
+   }
This part looks a bit hacky and this is needed because we don't have a
list of objects when doing a non-parallel system or database reindex.
The deal is that we just want a list with one element: the database of
the connection.  Wouldn't it be more natural to assign the database
name here using PQdb(conn)?  Then add an assertion at the top of
run_reindex_command() checking for a non-NULL name?

> I considered this, but it would require to adapt all code that declare
> SimpleStringList stack variable (vacuumdb.c, clusterdb.c,
> createuser.c, pg_dumpall.c and pg_dump.c), so it looked like too much
> trouble to avoid 2 local variables here and 1 in vacuumdb.c.  I don't
> have a strong opinion here, so I can go for it if you prefer.

Okay.  This is a tad wider than the original patch proposal, and this
adds two lines.  So let's discard that for now and keep it simple.

>> +$node->issues_sql_like([qw(reindexdb -j2)],
>> +   qr/statement: REINDEX TABLE public.test1/,
>> +   'Global and parallel reindex will issue per-table REINDEX');
>> Would it make sense to have some tests for schemas here?
>>
>> One of my comments in [1] has not been answered.  What about
>> the decomposition of a list of schemas into a list of tables when
>> using the parallel mode?
> 
> I did that in attached v6, and added suitable regression tests.

The two tests for "reindexdb -j2" can be combined into a single call,
checking for both commands to be generated in the same output.  The
second command triggering a reindex on two schemas cannot be used to
check for the generation of both s1.t1 and s2.t2 as the ordering may
not be guaranteed.  The commands arrays also looked inconsistent with
the rest.  Attached is an updated patch with some format modifications
and the fixes for the tests.
--
Michael
diff --git a/doc/src/sgml/ref/reindexdb.sgml b/doc/src/sgml/ref/reindexdb.sgml
index 25b5a72770..477b77c6e9 100644
--- a/doc/src/sgml/ref/reindexdb.sgml
+++ b/doc/src/sgml/ref/reindexdb.sgml
@@ -166,6 +166,31 @@ PostgreSQL documentation
   
  
 
+ 
+  -j njobs
+  --jobs=njobs
+  
+   
+Execute the reindex commands in parallel by running
+njobs
+commands simultaneously.  This option reduces the time of the
+processing but it also increases the load on the database server.
+   
+   
+reindexdb will open
+njobs connections to the
+database, so make sure your 
+setting is high enough to accommodate all connections.
+   
+   
+Note that this option is ignored with the --index
+option to prevent deadlocks when processing multiple indexes from
+the same relation, and incompatible with the --system
+option.
+   
+  
+ 
+
  
   -q
   --quiet
diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index 3cd793b134..ede665090f 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -29,7 +29,7 @@ dropdb: dropdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-
 dropuser: dropuser.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 clusterdb: clusterdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 vacuumdb: vacuumdb.o common.o scripts_parallel.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
-reindexdb: reindexdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
+reindexdb: reindexdb.o common.o scripts_parallel.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 pg_isready: pg_isready.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 
 install: all installdirs
diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index 219a9a9211..895dc03737 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -10,10 +10,14 @@
  */
 
 #include "postgres_fe.h"
+
+#include "catalog/pg_class_d.h"
 #include "common.h"
 #include "common/logging.h"
+#include "fe_utils/connect.h"
 #include "fe_utils/simple_list.h"
 #include "fe_utils/string_utils.h"
+#include "scripts_parallel.h"
 
 typedef enum ReindexType
 {
@@ -25,16 +29,26 @@ typedef enum ReindexType
 } ReindexType;
 
 
-static void reindex_one_database(const char *name, const char *dbname,
- ReindexType type, const char *host,
+static SimpleStringList 

Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions

2019-07-23 Thread Fabien COELHO




Find attached updated patches which also work against old servers.


I can't check that for sure.

* About toast table addition v7:

Patch applies cleanly, compiles, make check ok, no doc.

This addition show the main table of a toast table, which is useful.

Field relnamespace oid in pg_class appears with pg 7.3, maybe it would be 
appropriate to guard agains older versions, with "pset.sversion >= 70300". 
It seems that there are other unguarded instances in "describe.c", so 
maybe this is considered too old.


Test is ok.

* About toast index v7:

Patch applies cleanly on top of previous, compiles, make check ok, no doc.

This patch simply enables an existing query on toast tables so as to show 
corresponding indices.


Test is ok.

* About toast part v7.

Patch applies cleanly, compiles, make check ok, no doc.

It gives the partition info about an index as it is shown about a table, 
which is useful.


There are some changes in the query on older systems, which seem harmless.
The code is rather simplified because a special case is removed, which is 
a good thing.


Test is ok.

Marked as ready.

--
Fabien.




query1 followed by query2 at maximum distance vs current fixed distance

2019-07-23 Thread Wh isere
Is this possible with the current websearch_to_tsquery function?

Thanks.

Hello everyone, I am wondering if
> AROUND(N) or  is still possible? I found this thread below and the
> original post
> https://www.postgresql.org/message-id/fe93ff7e9ad79196486ada79e268%40postgrespro.ru
> mentioned the proposed feature: 'New operator AROUND(N). It matches if the
> distance between words(or maybe phrases) is less than or equal to N.'
>
> currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
> integer) the distaince is searching a fixed distance, is there way to
> search maximum distance so the search returns query1 followed by query2 up
> to a certain distance? like the AROUND(N) or  mentioned in the
> thread?
>
> Thank you!
>
>
>
> On Mon, Jul 22, 2019 at 9:13 AM Dmitry Ivanov 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to share some intermediate results. Here's what has changed:
>>
>>
>> 1. OR operator is now case-insensitive. Moreover, trailing whitespace is
>> no longer used to identify it:
>>
>> select websearch_to_tsquery('simple', 'abc or');
>>   websearch_to_tsquery
>> --
>>   'abc' & 'or'
>> (1 row)
>>
>> select websearch_to_tsquery('simple', 'abc or(def)');
>>   websearch_to_tsquery
>> --
>>   'abc' | 'def'
>> (1 row)
>>
>> select websearch_to_tsquery('simple', 'abc or!def');
>>   websearch_to_tsquery
>> --
>>   'abc' | 'def'
>> (1 row)
>>
>>
>> 2. AROUND(N) has been dropped. I hope that  operator will allow us
>> to implement it with a few lines of code.
>>
>> 3. websearch_to_tsquery() now tolerates various syntax errors, for
>> instance:
>>
>> Misused operators:
>>
>> 'abc &'
>> '| abc'
>> '<- def'
>>
>> Missing parentheses:
>>
>> 'abc & (def <-> (cat or rat'
>>
>> Other sorts of nonsense:
>>
>> 'abc &--|| def'  =>  'abc' & !!'def'
>> 'abc:def'  =>  'abc':D & 'ef'
>>
>> This, however, doesn't mean that the result will always be adequate (who
>> would have thought?). Overall, current implementation follows the GIGO
>> principle. In theory, this would allow us to use user-supplied websearch
>> strings (but see gotchas), even if they don't make much sense. Better
>> then nothing, right?
>>
>> 4. A small refactoring: I've replaced all WAIT* macros with a enum for
>> better debugging (names look much nicer in GDB). Hope this is
>> acceptable.
>>
>> 5. Finally, I've added a few more comments and tests. I haven't checked
>> the code coverage, though.
>>
>>
>> A few gotchas:
>>
>> I haven't touched gettoken_tsvector() yet. As a result, the following
>> queries produce errors:
>>
>> select websearch_to_tsquery('simple', );
>> ERROR:  syntax error in tsquery: "'"
>>
>> select websearch_to_tsquery('simple', '\');
>> ERROR:  there is no escaped character: "\"
>>
>> Maybe there's more. The question is: should we fix those, or it's fine
>> as it is? I don't have a strong opinion about this.
>>
>> --
>> Dmitry Ivanov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>
>


Re: Possible race condition in pg_mkdir_p()?

2019-07-23 Thread Ning Yu
Hi Michael,

The patches are attached.  To make reviewing easier we spilt them into small
pieces:

- v1-0001-Fix-race-condition-in-pg_mkdir_p.patch: the fix to pg_mkdir_p()
  itself, basically we are following the `mkdir -p` logic;
- v1-0002-Test-concurrent-call-to-pg_mkdir_p.patch: the tests for
pg_mkdir_p(),
  we could see how it fails by reverting the first patch, and a reproducer
with
  initdb is also provided in the README; as we do not know how to create a
unit
  test in postgresql we have to employ a test module to do the job, not
sure if
  this is a proper solution;
- v1-0003-Fix-callers-of-pg_mkdir_p.patch &
  v1-0004-Fix-callers-of-MakePGDirectory.patch: fix callers of pg_mkdir_p()
and
  MakePGDirectory(), tests are not provided for these changes;

MakePGDirectory() is also called in TablespaceCreateDbspace(), EEXIST is
considered as non-error for parent directories, however as it considers
EEXIST
as a failure for the last level of the path so the logic is still correct,
we
do not add a final stat() check for it.

Best Regards
Ning


On Thu, Jul 18, 2019 at 4:57 PM Michael Paquier  wrote:

> On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote:
> > This is still wrong with current code logic, because when the statusdir
> is
> > a file the errno is also EEXIST, but it can pass the check here.  Even if
> > we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here
> is
> > still wrong.
>
> Would you like to send a patch?
> --
> Michael
>


v1-0003-Fix-callers-of-pg_mkdir_p.patch
Description: Binary data


v1-0004-Fix-callers-of-MakePGDirectory.patch
Description: Binary data


v1-0002-Test-concurrent-call-to-pg_mkdir_p.patch
Description: Binary data


v1-0001-Fix-race-condition-in-pg_mkdir_p.patch
Description: Binary data


Re: Race conditions with TAP test for syncrep

2019-07-23 Thread Noah Misch
On Tue, Jun 18, 2019 at 09:59:08AM +0900, Michael Paquier wrote:
> On Mon, Jun 17, 2019 at 10:50:39AM -0400, Alvaro Herrera wrote:
> > I think this should be note() rather than print(), or maybe diag().  (I
> > see that we have a couple of other cases which use print() in the tap
> > tests, which I think should be note() as well.)
> 
> OK.  Let's change it for this patch.

PostgresNode uses "print" the same way.  The patch does close the intended
race conditions, and its implementation choices look fine to me.




Re: initdb recommendations

2019-07-23 Thread Peter Eisentraut
On 2019-07-22 21:16, Andrew Dunstan wrote:
> Modulo this issue, experimentation shows that adding '-A trust' to the
> line in run_build.pl where initdb is called fixes the issue. If we're
> going to rely on a buildfarm client fix that one seems simplest.

Yes, that is the right fix.  It's what the in-tree test drivers
(pg_regress, PostgresNode.pm) do.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley
On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki
 wrote:
> OTOH, how about my original patch that is based on the local lock list?  I 
> expect that it won't that significant slowdown in the same test case.  If 
> it's not satisfactory, then v6 is the best to commit.

I think we need to move beyond the versions that have been reviewed
and rejected. I don't think the reasons for their rejection will have
changed.

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped.  I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon.  Otherwise, I plan to push it about this
time tomorrow.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v7.patch
Description: Binary data