date:20200219

On Wed, Feb 12, 2020 at 01:59:05PM -0300, Fabrízio de Royes Mello wrote:
> In parallel mode it's firing the EventTrigger and it can't be happen.
> Poking around it I did some test with attached just to leave EventTriggers
> in pending_list to process it in restore_toc_entries_postfork and
> everything is ok, but my solution is very ugly, so maybe we need to invent
> a new RestorePass to take care of it like RESTORE_PASS_ACL and
> RESTORE_PASS_REFRESH. I can provide a more polished patch if it'll be a
> good way to do that.

That sounds right, as event triggers could interact with GRANT and
REFRESH of matviews, so they should be logically last.  Looking at the
recent commit history, this would be similar to 3eb9a5e as we don't
really have a way to treat event triggers as dependency-sortable
objects.  What kind of errors did you see in this customer
environment?  Errors triggered by one or more event triggers blocking
some commands based on a tag match? 
--
Michael


signature.asc
Description: PGP signature

Re: Autovacuum on partitioned table

Hosoya-san,

On Thu, Feb 20, 2020 at 3:34 PM yuzuko  wrote:
> Attach the latest patch based on discussion in this thread.
>
> > > Yeah that is what I meant. In addition, adding partition's
> > > changes_since_analyze to its parent needs to be done recursively as
> > > the parent table could also be a partitioned table.
> >
> > That's a good point.  So, changes_since_analyze increments are
> > essentially propagated from leaf partitions to all the way up to the
> > root table, including any intermediate partitioned tables.  We'll need
> > to consider whether we should propagate only one level at a time (from
> > bottom of the tree) or update all parents up to the root, every time a
> > leaf partition is analyzed.
>
> For multi-level partitioning, all parents' changes_since_analyze will be
> updated whenever analyzing a leaf partition in this patch.
> Could you please check the patch again?

Thank you for the new patch.

I built and confirmed that the patch works.

Here are some comments:

* White-space noise in the diff (space used where tab is expected);
please check with git diff --check and fix.

* Names changes_tuples, m_changes_tuples should be changed_tuples and
m_changed_tuples, respectively?

* Did you intend to make it so that we now report *all* inherited
stats to the stats collector, not just those for partitioned tables?
IOW, do did you intend the new feature to also cover traditional
inheritance parents? I am talking about the following diff:

 /*
- * Report ANALYZE to the stats collector, too.  However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.  Reset the changes_since_analyze counter only
- * if we analyzed all columns; otherwise, there is still work for
- * auto-analyze to do.
+ * Report ANALYZE to the stats collector, too.  If the table is a
+ * partition, report changes_since_analyze of its parent because
+ * autovacuum process for partitioned tables needs it.  Reset the
+ * changes_since_analyze counter only if we analyzed all columns;
+ * otherwise, there is still work for auto-analyze to do.
  */
-if (!inh)
-pgstat_report_analyze(onerel, totalrows, totaldeadrows,
-  (va_cols == NIL));
+pgstat_report_analyze(onerel, totalrows, totaldeadrows,
+  (va_cols == NIL));

* I may be missing something, but why doesn't do_autovacuum() fetch a
partitioned table's entry from pgstat instead of fetching that for
individual children and adding? That is, why do we need to do the
following:

+/*
+ * If the relation is a partitioned table, we check it
using reltuples
+ * added up childrens' and changes_since_analyze tracked
by stats collector.


More later...

Thanks,
Amit

Re: Internal key management system

2020-02-19 Thread Masahiko Sawada

On Wed, 19 Feb 2020 at 09:29, Cary Huang  wrote:
>
> Hi
>
> I have tried the attached kms_v3 patch and have some comments:
>
> 1) In the comments, I think you meant hmac + iv + encrypted data instead of 
> iv + hmac + encrypted data?
>
> ---> in kmgr_wrap_key( ):
> +   /*
> +* Assemble the wrapped key. The order of the wrapped key is iv, hmac 
> and
> +* encrypted data.
> +*/

Right, will fix.

>
>
> 2) I see that create_keywrap_ctx function in kmgr_utils.c and regular cipher 
> context init will both call ossl_aes256_encrypt_init to initialise context 
> for encryption and key wrapping. In ossl_aes256_encrypt_init, the cipher 
> method always initialises to aes-256-cbc, which is ok for keywrap because 
> under CBC block cipher mode, we only had to supply one unique IV as initial 
> value. But for actual WAL and buffer encryption that will come in later, I 
> think the discussion is to use CTR block cipher mode, which requires one 
> unique IV for each block, and the sequence id from WAL and buffer can be used 
> to derive unique IV for each block for better security? I think it would be 
> better to allow caller to decide which EVP_CIPHER to initialize? 
> EVP_aes_256_cbc(), EVP_aes_256_ctr() or others?
>
> +ossl_aes256_encrypt_init(pg_cipher_ctx *ctx, uint8 *key)
> +{
> +   if (!EVP_EncryptInit_ex(ctx, EVP_aes_256_cbc(), NULL, NULL, NULL))
> +   return false;
> +   if (!EVP_CIPHER_CTX_set_key_length(ctx, PG_AES256_KEY_LEN))
> +   return false;
> +   if (!EVP_EncryptInit_ex(ctx, NULL, NULL, key, NULL))
> +   return false;
> +
> +   /*
> +* Always enable padding. We don't need to check the return
> +* value as EVP_CIPHER_CTX_set_padding always returns 1.
> +*/
> +   EVP_CIPHER_CTX_set_padding(ctx, 1);
> +
> +   return true;
> +}

It seems good. We can expand it to make caller decide the block cipher
mode of operation and key length. I removed such code from the
previous patch to make it simple since currently we support only
AES-256 CBC.

>
> 3) Following up point 2), I think we should enhance the enum to include not 
> only the Encryption algorithm and key size, but also the block cipher mode as 
> well because having all 3 pieces of information can describe exactly how KMS 
> is performing the encryption and decryption. So when we call 
> "ossl_aes256_encrypt_init", we can include the new enum as input parameter 
> and it will initialise the EVP_CIPHER_CTX with either EVP_aes_256_cbc() or 
> EVP_aes_256_ctr() for different purposes (key wrapping, or WAL 
> encryption..etc).
>
> ---> kmgr.h
> +/* Value of key_management_cipher */
> +enum
> +{
> +   KMGR_CIPHER_OFF = 0,
> +   KMGR_CIPHER_AES256
> +};
> +
>
> so it would become
> +enum
> +{
> +   KMGR_CIPHER_OFF = 0,
> +   KMGR_CIPHER_AES256_CBC = 1,
> +   KMGR_CIPHER_AES256_CTR = 2
> +};
>
> If you agree with this change, several other places will need to be changed 
> as well, such as "kmgr_cipher_string", "kmgr_cipher_value" and the initdb 
> code

KMGR_CIPHER_XXX is relevant with cipher mode used by KMS and KMS would
still use AES256 CBC even if we had TDE which would use AES256 CTR.

After more thoughts, I think currently we can specify -e aes-256 to
initdb but actually this is not necessary. When
--cluster-passphrase-command specified, we enable the internal KMS and
always use AES256 CBC. Something like -e option will be needed after
supporting TDE with several cipher options. Thoughts?

>
> 4) the pg_wrap_key and pg_unwrap_key SQL functions defined in kmgr.c
> I tried these new SQL functions and found that the pg_unwrap_key will produce 
> the original key with 4 bytes less. This is because the result length is not 
> set long enough to accommodate the 4 byte VARHDRSZ header used by the 
> multi-type variable.
>
> the len variable in SET_VARSIZE(res, len) should include also the variable 
> header VARHDRSZ. Now it is 4 byte short so it will produce incomplete output.
>
> ---> pg_unwrap_key function in kmgr.c
> +   if (!kmgr_unwrap_key(UnwrapCtx, (uint8 *) VARDATA_ANY(data), datalen,
> +(uint8 *) VARDATA(res), 
> &len))
> +   ereport(ERROR,
> +   (errmsg("could not unwrap the given 
> secret")));
> +
> +   /*
> +* The size of unwrapped key can be smaller than the size estimated
> +* before unwrapping since the padding is removed during unwrapping.
> +*/
> +   SET_VARSIZE(res, len);
> +   PG_RETURN_BYTEA_P(res);
>
> I am only testing their functionalities with random key as input data. It is 
> currently not possible for a user to obtain the wrapped key from the server 
> in order to use these wrap/unwrap functions. I personally don't think it is a 
> good idea to expose these functions to user

Thank you for testing. I'm going to include regression tests and
documentation in the next version

Re: Error on failed COMMIT

2020-02-19 Thread Haumacher, Bernhard


Am 17.02.2020 um 23:12 schrieb Dave Cramer:
On Mon, 17 Feb 2020 at 13:02, Haumacher, Bernhard > wrote:


... would be an appropriate solution. PostgreSQL reports the
"unsuccessful" commit through the "ROLLBACK" status code and the
driver
translates this into a Java SQLException, because this is the only
way
to communicate the "non-successfullness" from the void commit()
method.
Since the commit() was not successful, from the API point of view
this
is an error and it is fine to report this using an exception.


Well it doesn't always report the unsuccessful commit as a rollback 
sometimes it says

"there is no transaction" depending on what happened in the transaction

even worse...


Also when there is an error there is also a status provided by the 
backend.
Since this is not an error to the backend there is no status that the 
exception can provide.

be free to choose/define one...


Doing this in a (non-default) driver setting is not ideal, because I
expect do be notified *by default* from a database (driver) if a
commit
was not successful (and since the API is void, the only notification
path is an exception). We already have a non-default option named
"autosafe", which fixes the problem somehow.


The challenge with making this the default, is as Tom noted, many 
other people don't expect this.


Nobody expects a database reporting a successful commit, while it 
internally rolled back.


If there is code out there depending on this bug, it is fair to provide 
a backwards-compatible option to re-activate this unexpected behavior.



What many other frameworks do is have vendor specific behaviour.
Perhaps writing a proxying driver might solve the problem?


That's exactly what we do - extending our database abstraction layer to 
work around database-specific interpretations of the JDBC API.


But of cause, the abstraction layer is not able to reconstruct an error 
from a commit() call, that has been dropped by the driver. Of cause, I 
could try to insert another dummy entry into a dummy table immediately 
before each commit to get again the exception reporting that the 
transaction is in rollback-only-mode... but this does not sound 
reasonable to me.



If we really need both behaviors ("silently ignore failed commits"
and
"notify about failed commits") I would prefer adding a
backwards-compatible option
"silently-ignore-failed-commit-due-to-auto-rollback" (since it is a
really aburd setting from my point of view, since consistency is
at risk
if this happens - the worst thing to expect from a database).


The error has been reported to the client. At this point the client is 
expected to do a rollback.


As I explained, there is not "the client" but there are several software 
layers - and the error only has been reported to some of these layers 
that may decide not to communicate the problem down the road. Therefore, 
the final commit() must report the problem again.


Best regard, Bernhard

Re: [Patch] Make pg_checksums skip foreign tablespace directories

On Wed, Feb 19, 2020 at 12:37:00PM -0600, David Steele wrote:
> On 2/19/20 2:13 AM, Michael Paquier wrote:
>> Please note that pg_internal.init is listed within noChecksumFiles in
>> basebackup.c, so we would miss any temporary pg_internal.init.PID if
>> we don't check after the file prefix and the base backup would issue
>> extra WARNING messages, potentially masking messages that could
>> matter.  So let's fix that as well.
> 
> Agreed.  Though, I think pg_internal.init.XX should be excluded from the
> backup as well.

Sure.  That's the intention.  pg_rewind, pg_checksums and basebackup.c
are all the things on my list.

> As far as I can see, the pg_internal.init.XX will not be cleaned up by
> PostgreSQL on startup.  I've only tested this in 9.6 so far, but I don't see
> any differences in the code since then that would lead to better behavior.
> Perhaps that's also something we should fix?

Not sure that it is worth spending cycles on that at the beginning of
recovery as when a mapping file is written its temporary entry
truncates any existing one present matching its name.

> I'm really not a fan of a blind prefix match.  I think we should stick with
> only excluding files that are created by Postgres.

Thinking more on that, excluding any backup_label with a custom suffix
worries me as it could cause a potential breakage for exiting backup
solutions.  So attached is an updated patch making things in a
smarter way: I have added to the exclusion lists the possibility to
match an entry based on its prefix, or not, the choice being optional.
This solves the problem with pg_internal.PID and is careful to not
exclude unnecessary entries like suffixed backup labels or such.  This
leads to some extra duplication within pg_rewind, basebackup.c and
pg_checksums but I think we can live with that, and that makes
back-patching simpler.  Refactoring is still tricky though as it
relates to the use of paths across the backend and the frontend..

> So backup_label.old and
> tablespace_map.old should just be added to the exclude list.  That's how we
> have it in pgBackRest.

That would be a behavior change.  We could change that on HEAD, but I
don't think that this can be back-patched as this does not cause an
actual problem.

For now, my proposal is to fix the prefix first, and then let's look
at the business with tablespaces where needed. 
--
Michael
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index dea8aab45e..c03bc0c00b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -121,6 +121,18 @@ static long long int total_checksum_failures;
 /* Do not verify checksums. */
 static bool noverify_checksums = false;
 
+/*
+ * Definition of one item part of an exclusion list, used for checksum
+ * or base backup items.  "name" is the name of the file or path to
+ * check for exclusion.  If "match_prefix" is true, any items matching the
+ * name as prefix are excluded.
+ */
+struct exclude_list_item
+{
+	const char *name;
+	bool		match_prefix;
+};
+
 /*
  * The contents of these directories are removed or recreated during server
  * start so they are not included in backups.  The directories themselves are
@@ -170,16 +182,19 @@ static const char *const excludeDirContents[] =
 /*
  * List of files excluded from backups.
  */
-static const char *const excludeFiles[] =
+static const struct exclude_list_item excludeFiles[] =
 {
 	/* Skip auto conf temporary file. */
-	PG_AUTOCONF_FILENAME ".tmp",
+	{PG_AUTOCONF_FILENAME ".tmp", false},
 
 	/* Skip current log file temporary file */
-	LOG_METAINFO_DATAFILE_TMP,
+	{LOG_METAINFO_DATAFILE_TMP, false},
 
-	/* Skip relation cache because it is rebuilt on startup */
-	RELCACHE_INIT_FILENAME,
+	/*
+	 * Skip relation cache because it is rebuilt on startup.  This includes
+	 * temporary files.
+	 */
+	{RELCACHE_INIT_FILENAME, true},
 
 	/*
 	 * If there's a backup_label or tablespace_map file, it belongs to a
@@ -187,14 +202,14 @@ static const char *const excludeFiles[] =
 	 * for this backup.  Our backup_label/tablespace_map is injected into the
 	 * tar separately.
 	 */
-	BACKUP_LABEL_FILE,
-	TABLESPACE_MAP,
+	{BACKUP_LABEL_FILE, false},
+	{TABLESPACE_MAP, false},
 
-	"postmaster.pid",
-	"postmaster.opts",
+	{"postmaster.pid", false},
+	{"postmaster.opts", false},
 
 	/* end of list */
-	NULL
+	{NULL, false}
 };
 
 /*
@@ -203,16 +218,15 @@ static const char *const excludeFiles[] =
  * Note: this list should be kept in sync with what pg_checksums.c
  * includes.
  */
-static const char *const noChecksumFiles[] = {
-	"pg_control",
-	"pg_filenode.map",
-	"pg_internal.init",
-	"PG_VERSION",
+static const struct exclude_list_item noChecksumFiles[] = {
+	{"pg_control", false},
+	{"pg_filenode.map", false},
+	{"pg_internal.init", true},
+	{"PG_VERSION", false},
 #ifdef EXEC_BACKEND
-	"config_exec_params",
-	"config_exec_params.new",
+	{"config_exec_params", true},
 #endif
-	NULL,
+	{NULL, false}
 };
 
 /*
@@

Re: Print physical file path when checksum check fails

2020-02-19 Thread Hubert Zhang

Thanks Kyotaro,

On Wed, Feb 19, 2020 at 2:02 PM Kyotaro Horiguchi 
wrote:

> At Wed, 19 Feb 2020 13:28:04 +0900, Michael Paquier 
> wrote in
> > On Wed, Feb 19, 2020 at 01:07:36PM +0900, Kyotaro Horiguchi wrote:
> > > If we also verify checksum in md layer, callback is overkill since the
> > > immediate caller consumes the event immediately.  We can signal the
> > > error by somehow returning a file tag.
> >
> > FWIW, I am wondering if there is any need for a change here and
> > complicate more the code.  If you know the block number, the page size
> > and the segment file size you can immediately guess where is the
> > damaged block.  The first information is already part of the error
>
> I have had support requests related to broken block several times, and
> (I think) most of *them* had hard time to locate the broken block or
> even broken file.  I don't think it is useles at all, but I'm not sure
> it is worth the additional complexity.
>
> > damaged block.  The first information is already part of the error
> > message, and the two other ones are constants defined at
> > compile-time.
>
> May you have misread the snippet?
>
> What Hubert proposed is:
>
>  "invalid page in block %u of relation file %s; zeroing out page",
> blkno, 
>
> The second format in my messages just before is:
>   "invalid page in block %u in relation %u, file \"%s\"",
>  blockNum, smgr->smgr_rnode.node.relNode, smgrfname()
>
> All of them are not compile-time constant at all.
>
>
I like your error message, the block number is relation level not file
level.
I 'll change the error message to
"invalid page in block %u of relation %u, file %s"


-- 
Thanks

Hubert Zhang

Re: Print physical file path when checksum check fails

2020-02-19 Thread Hubert Zhang

Thanks,

On Thu, Feb 20, 2020 at 11:36 AM Andres Freund  wrote:

> Hi,
>
> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
> > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote:
> > > I have had support requests related to broken block several times, and
> > > (I think) most of *them* had hard time to locate the broken block or
> > > even broken file.  I don't think it is useles at all, but I'm not sure
> > > it is worth the additional complexity.
> >
> > I handle stuff like that from time to time, and any reports usually
> > go down to people knowledgeable about PostgreSQL enough to know the
> > difference.  My point is that in order to know where a broken block is
> > physically located on disk, you need to know four things:
> > - The block number.
> > - The physical location of the relation.
> > - The size of the block.
> > - The length of a file segment.
> > The first two items are printed in the error message, and you can
> > guess easily the actual location (file, offset) with the two others.
>
> > I am not necessarily against improving the error message here, but
> > FWIW I think that we need to consider seriously if the code
> > complications and the maintenance cost involved are really worth
> > saving from one simple calculation.
>
> I don't think it's that simple for most.
>
> And if we e.g. ever get the undo stuff merged, it'd get more
> complicated, because they segment entirely differently. Similar, if we
> ever manage to move SLRUs into the buffer pool and checksummed, it'd
> again work differently.
>
> Nor is it architecturally appealing to handle checksums in multiple
> places above the smgr layer: For one, that requires multiple places to
> compute verify them. But also, as the way checksums are computed depends
> on the page format etc, it'll likely change for things like undo/slru -
> which then again will require additional smarts if done above the smgr
> layer.
>

So considering undo staff, it's better to move checksum logic into md.c
I will keep it in the new patch.

On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:

> Particularly, quickly reading through the patch, I am rather unhappy
> > about the shape of the second patch which pushes down the segment
> > number knowledge into relpath.c, and creates more complication around
> > the handling of zero_damaged_pages and zero'ed pages.  -- Michael
>
> I do not like the SetZeroDamagedPageInChecksum stuff at all however.
>
>
I'm +1 on the first concern, and I will delete the new added function
`GetRelationFilePath`
in relpath.c and append the segno directly in error message inside function
`VerifyPage`

As for SetZeroDamagedPageInChecksum, the reason why I introduced it is that
when we are doing
smgrread() and one of the damaged page failed to pass the checksum check,
we could not directly error
out, since the caller of smgrread() may tolerate this error and just zero
all the damaged page plus a warning message.
Also, we could not just use GUC zero_damaged_pages to do this branch, since
we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it.

To get rid of SetZeroDamagedPageInChecksum, one idea is to pass
zero_damaged_page flag into smgrread(), something like below:
==

extern void smgrread(SMgrRelation reln, ForkNumber forknum,

BlockNumber blocknum, char *buffer, int flag);

===


Any comments?



-- 
Thanks

Hubert Zhang

Re: Autovacuum on partitioned table

2020-02-19 Thread yuzuko

Hello,

I'm sorry for the delay.
Attach the latest patch based on discussion in this thread.

> > Yeah that is what I meant. In addition, adding partition's
> > changes_since_analyze to its parent needs to be done recursively as
> > the parent table could also be a partitioned table.
>
> That's a good point.  So, changes_since_analyze increments are
> essentially propagated from leaf partitions to all the way up to the
> root table, including any intermediate partitioned tables.  We'll need
> to consider whether we should propagate only one level at a time (from
> bottom of the tree) or update all parents up to the root, every time a
> leaf partition is analyzed.

For multi-level partitioning, all parents' changes_since_analyze will be
updated whenever analyzing a leaf partition in this patch.
Could you please check the patch again?

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


v3_autovacuum_on_partitioned_table.patch
Description: Binary data

Re: Improve heavyweight locks instead of building new lock managers?

Hi Andres,

On Thu, Feb 20, 2020 at 1:14 PM Andres Freund  wrote:
> Here's a *draft* patch series for removing all use of SHM_QUEUE, and
> subsequently removing SHM_QUEUE. There's a fair bit of polish needed,
> but I do think it makes the code considerably easier to read
> (particularly for predicate.c). The diffstat is nice too:
>
> 0005: Remove now unused SHMQueue*.
> 0006: Remove PROC_QUEUE.size.

Maybe you're aware, but there still seem to be places using it.  In
LOCK_DEBUG build:

lock.c: In function ‘LOCK_PRINT’:
lock.c:320:20: error: ‘PROC_QUEUE’ {aka ‘const struct PROC_QUEUE’} has
no member named ‘size’
 lock->waitProcs.size,

lock.c: In function ‘DumpLocks’:
lock.c:3906:2: error: unknown type name ‘SHM_QUEUE’; did you mean ‘SI_QUEUE’?

Fwiw, I for one, am all for removing specialized data structures when
more widely used data structures will do, especially if that
specialization is no longer relevant.

Thanks,
Amit

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Adam Lee

On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote:
> 5) Assert(nbuckets > 0);
> ... 
> This however quickly fails on this assert in BuildTupleHashTableExt (see
> backtrace1.txt):
> 
>   Assert(nbuckets > 0);
> 
> The value is computed in hash_choose_num_buckets, and there seem to be
> no protections against returning bogus values like 0. So maybe we should
> return
> 
>   Min(nbuckets, 1024)
> 
> or something like that, similarly to hash join. OTOH maybe it's simply
> due to agg_refill_hash_table() passing bogus values to the function?
> 
> 
> 6) Another thing that occurred to me was what happens to grouping sets,
> which we can't spill to disk. So I did this:
> 
>   create table t2 (a int, b int, c int);
> 
>   -- run repeatedly, until there are about 20M rows in t2 (1GB)
>   with tx as (select array_agg(a) as a, array_agg(b) as b
> from (select a, b from t order by random()) foo),
>ty as (select array_agg(a) AS a
> from (select a from t order by random()) foo)
>   insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b)
> from tx, ty;
> 
>   analyze t2;
> ...
> 
> which fails with segfault at execution time:
> 
>   tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
>   870 for (i = 0; i < tb->size; i++)
>   (gdb) bt
>   #0  tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
>   #1  0x00654e49 in agg_retrieve_hash_table_in_memory ...
> 
> That's not surprising, because 0x18 pointer is obviously bogus. I guess
> this is simply an offset 18B added to a NULL pointer?

I did some investigation, have you disabled the assert when this panic
happens? If so, it's the same issue as "5) nbucket == 0", which passes a
zero size to allocator when creates that endup-with-0x18 hashtable.

Sorry my testing env goes weird right now, haven't reproduced it yet.

-- 
Adam Lee

Re: pg_regress cleans up tablespace twice.

On Wed, Feb 19, 2020 at 04:06:33PM -0500, Tom Lane wrote:
> I think the existing coding dates from before we had a Perl driver for
> this, or else we had it but there were other less-functional ways to
> replace "make check" on Windows.  +1 for taking the code out of
> pg_regress.c --- but I'm not in a position to say whether the other
> part of your patch is sufficient.

Removing this code from pg_regress.c makes also sense to me.  Now, the
patch breaks "vcregress installcheck" as this is missing to patch
installcheck_internal() for the tablespace path creation.  I would
also recommend using a full path for the directory location to avoid
any potential issues if this code is refactored or moved around, the
patch now relying on the current path used.
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
> I applied my patchset that mmap()-s WAL segments as WAL buffers to
> refs/tags/REL_12_0, and measured and analyzed its performance with
> pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL,
> it was "obviously worse" than the original REL_12_0.  VTune told me
> that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
> larger than before.

FWIW, this might largely be because of page faults. In contrast to
before we wouldn't reuse the same pages (because they've been
munmap()/mmap()ed), so the first time they're touched, we'll incur page
faults.  Did you try mmap()ing with MAP_POPULATE? It's probably also
worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good
numbers to back me up :)

Greetings,

Andres Freund

Re: Improve heavyweight locks instead of building new lock managers?

Hi,

Some of the discussions about improving the locking code, in particular
the group locking / deadlock detector discussion with Robert, again made
me look at lock.c.  While looking at how much work it'd be to a) remove
the PROCLOCK hashtable b) move more of the group locking logic into
lock.c, rather than smarts in deadlock.c, I got sidetracked by all the
verbose and hard to read SHM_QUEUE code.

Here's a *draft* patch series for removing all use of SHM_QUEUE, and
subsequently removing SHM_QUEUE. There's a fair bit of polish needed,
but I do think it makes the code considerably easier to read
(particularly for predicate.c). The diffstat is nice too:

 src/include/lib/ilist.h | 132 +
 src/include/replication/walsender_private.h |   3 +-
 src/include/storage/lock.h  |  10 +-
 src/include/storage/predicate_internals.h   |  49 +++-
 src/include/storage/proc.h  |  18 +--
 src/include/storage/shmem.h |  22 
 src/backend/access/transam/twophase.c   |   4 +-
 src/backend/lib/ilist.c |   8 +-
 src/backend/replication/syncrep.c   |  89 ++
 src/backend/replication/walsender.c |   2 +-
 src/backend/storage/ipc/Makefile|   1 -
 src/backend/storage/ipc/shmqueue.c  | 190 
--
 src/backend/storage/lmgr/deadlock.c |  76 +---
 src/backend/storage/lmgr/lock.c | 129 
 src/backend/storage/lmgr/predicate.c| 692 
+++
 src/backend/storage/lmgr/proc.c | 197 
+--
 16 files changed, 569 insertions(+), 1053 deletions(-)

I don't want to invest a lot of time into this if there's not some
agreement that this is a good direction to go into. So I'd appreciate a
few cursory looks before spending more time.

Overview:
0001: Add additional prev/next & detached node functions to ilist.
  I think the additional prev/next accessors are nice. I am less
  convinced by the 'detached' stuff, but it's used by some SHM_QUEUE
  users. I don't want to make the plain dlist_delete reset the node's
  prev/next pointers, it's not needed in the vast majority of cases...

0002: Use dlists instead of SHM_QUEUE for heavyweight locks.
  I mostly removed the odd reliance on PGPROC.links needing to be the
  first struct member - seems odd.

  I think we should rename PROC_QUEUE.links, elsewhere that's used for
  list membership nodes, so it's imo confusing/odd.

0003: Use dlist for syncrep queue.
  This seems fairly simple to me.

0004: Use dlists for predicate locking.
  Unfortunately pretty large. I think it's a huge improvement, but it's
  also subtle code. Wonder if there's something better to do here wrt
  OnConflict_CheckForSerializationFailure?

0005: Remove now unused SHMQueue*.
0006: Remove PROC_QUEUE.size.
  I'm not sure whether this is a a good idea. I was looking primarily at
  that because I thought it'd allow us to remove PROC_QUEUE as a whole
  if we wanted to. But as PROC_QUEUE.size doesn't really seem to buy us
  much, we should perhaps just do something roughly like in the patch?

Greetings,

Andres Freund
>From 777808a9f51063dc31abcef54f49e6a327efcfb8 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 19 Feb 2020 12:06:08 -0800
Subject: [PATCH v1 1/6] Add additional prev/next & detached node functions to
 ilist.

---
 src/include/lib/ilist.h | 132 +---
 src/backend/lib/ilist.c |   8 +--
 2 files changed, 113 insertions(+), 27 deletions(-)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index 98db885f6ff..1ad9ceb033f 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -255,8 +255,8 @@ typedef struct slist_mutable_iter
 extern void slist_delete(slist_head *head, slist_node *node);
 
 #ifdef ILIST_DEBUG
-extern void dlist_check(dlist_head *head);
-extern void slist_check(slist_head *head);
+extern void dlist_check(const dlist_head *head);
+extern void slist_check(const slist_head *head);
 #else
 /*
  * These seemingly useless casts to void are here to keep the compiler quiet
@@ -270,6 +270,11 @@ extern void slist_check(slist_head *head);
 
 /* doubly linked list implementation */
 
+/* prototypes dlist helpers */
+static inline void *dlist_element_off(dlist_node *node, size_t off);
+static inline void *dlist_head_element_off(dlist_head *head, size_t off);
+static inline void *dlist_tail_element_off(dlist_head *head, size_t off);
+
 /*
  * Initialize a doubly linked list.
  * Previous state will be thrown away without any cleanup.
@@ -280,13 +285,24 @@ dlist_init(dlist_head *head)
 	head->head.next = head->head.prev = &head->head;
 }
 
+/*
+ * Initialize a doubly linked element.
+ *
+ * This is only needed when dlist_node_is_detached() may be needed.
+ */
+static inline void
+dlist_

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Adam Lee

On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote:
> 4) lookup_hash_entries says
> 
>   /* check to see if we need to spill the tuple for this grouping set */
> 
> But that seems bogus, because AFAIK we can't spill tuples for grouping
> sets. So maybe this should say just "grouping"?

As I see it, it does traverse all hash sets, fill the hash table and
spill if needed, for each tuple.

The segfault is probably related to this and MixedAggregate, I'm looking
into it.

-- 
Adam Lee

Experimenting with transactional memory for SERIALIZABLE

2020-02-19 Thread Thomas Munro

Hello hackers,

Here's a *highly* experimental patch set that tries to skip the LWLock
protocol in predicate.c and use HTM[1] instead.  HTM is itself a sort
of hardware-level implementation of SSI for shared memory.  My
thinking was that if your workload already suits the optimistic nature
of SSI, perhaps it could make sense to go all-in and remove the rather
complicated pessimistic locking it's built on top of.  It falls back
to an LWLock-based path at compile time if you don't build with
--enable-htm, or at runtime if a startup test discovered that your CPU
doesn't have the Intel TSX instruction set (microarchitectures older
than Skylake, and some mobile and low power variants of current ones),
or if a hardware transaction is aborted for various reasons.

The good news is that it seems to produce correct results in simple
tests (well, some lock-held-by-me assertions can fail in an
--enable-cassert build, that's trivial to fix).  The bad news is that
it doesn't perform very well yet, and I think the reason for that is
that there are some inherently serial parts of the current design that
cause frequent conflicts.  In particular, the
FinishedSerializableTransactions list, protected by
SerializableFinishedListLock, produces a stream of conflicts, and
falls back to the traditional behaviour which involves long lock wait
queues and thereby more HTM conflicts.  I think we probably need a
more concurrent way to release SSI transactions, entirely independent
of this HTM experiment.  There's another point of serialisation at
snapshot acquisition time, which may be less avoidable; I don't know.
For much of the code that runs between snapshot acquisition and
transaction release, we really only care about touching memory
directly related to the SQL objects we touch in our SQL transaction,
and the other SQL transactions which have also touched them.  The
question is whether it's possible to get to a situation where
non-overlapping read/write sets at the SQL level don't cause conflicts
at the memory level and everything goes faster, or whether the SSI
algorithm is somehow inherently unsuitable for running on top of, erm,
SSI-like technology.  It seems like a potentially interesting research
project.

Here's my one paragraph introduction to HTM programming:  Using the
wrapper macros from my 0001 patch, you call pg_htm_begin(), and if
that returns true you're in a memory transaction and should eventually
call pg_htm_commit() or pg_htm_abort(), and if it returns false your
transaction has aborted and you need to fall back to some other
strategy.  (Retrying is also an option, but the reason codes are
complicated, and progress is not guaranteed, so introductions to the
topic often advise going straight to a fallback.)  No other thread is
allowed to see your changes to memory until you commit, and if you
abort (explicitly, due to lack of cache for uncommitted changes, due
to a serialisation conflict, or due to other internal details possibly
known only to Intel), all queued changes to memory are abandoned, and
control returns at pg_htm_begin(), a bit like the way setjmp() returns
non-locally when you call longjmp().  There are plenty of sources to
read about this stuff in detail, but for a very gentle introduction I
recommend Maurice Herlihy's 2-part talk[2][3] (the inventor of this
stuff at DEC in the early 90s), despite some strange claims he makes
about database hackers.

In theory this should work on POWER and future ARM systems too, with a
bit more work, but I haven't looked into that.  There are doubtless
many other applications for this type of technology within PostgreSQL.
Perhaps some more fruitful.

[1] https://en.wikipedia.org/wiki/Transactional_memory
[2] https://www.youtube.com/watch?v=S3Fx-7avfs4
[3] https://www.youtube.com/watch?v=94ieceVxSHs
From c80a75a51ae4dd5a67ac801deefe61fdd112279a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 20 Feb 2020 11:31:45 +1300
Subject: [PATCH 1/2] Add infrastruction for hardware transactional memory.

If --enable-htm is provided to configure, add support for
using the Intel TSX RTM facilities.  Provide a very simple
abstraction that tries once but requires the caller to
provide a fallback.  In later patches, it should be
possible to map this to the builtins required for other
ISAs including ARM TME and POWER8.

Perform a runtime test at startup to check if the chip
we're running on has those instructions (eg Skylake and
later server-class chips; many mobile/laptop chips don't
have them, unfortunately).

This facility will be used by later patches to perform
hardware memory transactions, to replace LWLocks and
spinlocks optimistically, with a fallback.  If compiled
without --enable-htm, a constant value is provide so that
the HTM path can be removed by constant folding leaving
only the fallback path.

WORK IN PROGRESS: solution looking for a problem

Author: Thomas Munro
---
 configure   | 40 +++
 configure.in

Re: backend type in log_line_prefix?

Hi,

On 2020-02-13 09:56:38 +0100, Peter Eisentraut wrote:
> Attached is a demo patch that adds a placeholder %b for log_line_prefix (not
> in the default setting) that contains the backend type, the same that you
> see in pg_stat_activity and in the ps status.  I would have found this
> occasionally useful when analyzing logs, especially if you have a lot of
> background workers active.  Thoughts?

I wished for this several times.


> @@ -342,7 +342,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
>   statmsg = "??? process";
>   break;
>   }
> - init_ps_display(statmsg, "", "", "");
> + init_ps_display((backend_type_str = statmsg), "", "", "");
>   }

But I'm decidedly not a fan of this.

Greetings,

Andres Freund

Re: Print physical file path when checksum check fails

Hi,

On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
> On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote:
> > I have had support requests related to broken block several times, and
> > (I think) most of *them* had hard time to locate the broken block or
> > even broken file.  I don't think it is useles at all, but I'm not sure
> > it is worth the additional complexity.
>
> I handle stuff like that from time to time, and any reports usually
> go down to people knowledgeable about PostgreSQL enough to know the
> difference.  My point is that in order to know where a broken block is
> physically located on disk, you need to know four things:
> - The block number.
> - The physical location of the relation.
> - The size of the block.
> - The length of a file segment.
> The first two items are printed in the error message, and you can
> guess easily the actual location (file, offset) with the two others.

> I am not necessarily against improving the error message here, but
> FWIW I think that we need to consider seriously if the code
> complications and the maintenance cost involved are really worth
> saving from one simple calculation.

I don't think it's that simple for most.

And if we e.g. ever get the undo stuff merged, it'd get more
complicated, because they segment entirely differently. Similar, if we
ever manage to move SLRUs into the buffer pool and checksummed, it'd
again work differently.

Nor is it architecturally appealing to handle checksums in multiple
places above the smgr layer: For one, that requires multiple places to
compute verify them. But also, as the way checksums are computed depends
on the page format etc, it'll likely change for things like undo/slru -
which then again will require additional smarts if done above the smgr
layer.

> Particularly, quickly reading through the patch, I am rather unhappy
> about the shape of the second patch which pushes down the segment
> number knowledge into relpath.c, and creates more complication around
> the handling of zero_damaged_pages and zero'ed pages.  -- Michael

I do not like the SetZeroDamagedPageInChecksum stuff at all however.

Greetings,

Andres Freund

Re: Clean up some old cruft related to Windows

On Wed, Feb 19, 2020 at 07:02:55PM +0100, Juan José Santamaría Flecha wrote:
> Please find a patched for so. I have tried to make it more version
> neutral.

+  You can change certain build options, such as the targeted CPU
+  architecture, build type, and the selection of the SDK by using either
+  VSVARS32.bat or VsDevCmd.bat depending
+  on your Visual Studio release. All commands
+  should be run from the src\tools\msvc directory.

Both commands have different ways of doing things, and don't really
shine with their documentation, so it could save time to the reader to
add more explicit details of how to switch to the 32-bit mode, like
with "VsDevCmd -arch=x86".  And I am not actually sure which
environment variable to touch when using VSVARS32.bat or
VSVARSALL.bat with MSVC <= 2017.
--
Michael


signature.asc
Description: PGP signature

Re: Constraint's NO INHERIT option is ignored in CREATE TABLE LIKE statement

On Thu, Feb 20, 2020 at 8:20 AM David G. Johnston
 wrote:
> On Wed, Feb 19, 2020 at 4:02 PM Tom Lane  wrote:
>> Ildar Musin  writes:
>> > My colleague Chris Travers discovered something that looks like a bug.
>> > Let's say we have a table with a constraint that is declared as NO INHERIT.
>> > ...
>> > Now when we want to make a copy of the table structure into a new table
>> > the `NO INHERIT` option is ignored.
>>
>> Hm, I agree that's a bug, since the otherwise-pretty-detailed CREATE TABLE
>> LIKE documentation makes no mention of such a difference between original
>> and cloned constraint.
>>
>> However, I'd be disinclined to back-patch, since it's barely possible
>> somebody out there is depending on the existing behavior.
>
> Not sure I agree with the premise that it is not supposed to be copied; is 
> there some other object type the allows NO INHERIT that isn't copied when 
> CREATE TABLE LIKE is used and check constraints are the odd ones out?

Syntax currently allows only CHECK constraints to be marked NO INHERIT.

Thanks,
Amit

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> I think till we know the real need for changing group locking, going
> in the direction of what Tom suggested to use an array of LWLocks [1]
> to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.

> It is not very clear to me that are we thinking to give up on Tom's
> idea [1] and change group locking even though it is not clear or at
> least nobody has proposed an idea/patch which requires that?  Or are
> we thinking that we can do what Tom suggested for relation extension
> lock and also plan to change group locking for future parallel
> operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated. And if there's concerns
about the cost of lock.c, I outlined a pretty long list of improvements
that'll help everyone, and I showed that the locking itself isn't
actually a large fraction of the scalability issues that extension has.

Regards,

Andres

Re: Constraint's NO INHERIT option is ignored in CREATE TABLE LIKE statement

On Thu, Feb 20, 2020 at 8:02 AM Tom Lane  wrote:
> Ildar Musin  writes:
> > My colleague Chris Travers discovered something that looks like a bug.
> > Let's say we have a table with a constraint that is declared as NO INHERIT.
> > ...
> > Now when we want to make a copy of the table structure into a new table
> > the `NO INHERIT` option is ignored.
>
> Hm, I agree that's a bug, since the otherwise-pretty-detailed CREATE TABLE
> LIKE documentation makes no mention of such a difference between original
> and cloned constraint.

By the way, partitioned tables to not allow constraints that are
marked NO INHERIT.  For example,

create table b (a int check (a > 0) no inherit) partition by list (a);
ERROR:  cannot add NO INHERIT constraint to partitioned table "b"

We must ensure that partitioned tables don't accidentally end up with
one via CREATE TABLE LIKE path.  I tested Ildar's patch and things
seem fine, but it might be better to add a test.  Attached updated
patch with that taken care of.

Thanks,
Amit
diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index ee2d2b54a1..38d8849fdb 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1165,6 +1165,7 @@ transformTableLikeClause(CreateStmtContext *cxt, 
TableLikeClause *table_like_cla
n->contype = CONSTR_CHECK;
n->location = -1;
n->conname = pstrdup(ccname);
+   n->is_no_inherit = 
tupleDesc->constr->check[ccnum].ccnoinherit;
n->raw_expr = NULL;
n->cooked_expr = nodeToString(ccbin_node);
cxt->ckconstraints = lappend(cxt->ckconstraints, n);
diff --git a/src/test/regress/expected/create_table_like.out 
b/src/test/regress/expected/create_table_like.out
index 94d48582db..632a970567 100644
--- a/src/test/regress/expected/create_table_like.out
+++ b/src/test/regress/expected/create_table_like.out
@@ -405,3 +405,17 @@ DROP TYPE ctlty1;
 DROP VIEW ctlv1;
 DROP TABLE IF EXISTS ctlt4, ctlt10, ctlt11, ctlt11a, ctlt12;
 NOTICE:  table "ctlt10" does not exist, skipping
+/* LIKE must respect NO INHERIT property of constraints */
+CREATE TABLE noinh_con_copy (a int CHECK (a > 0) NO INHERIT);
+CREATE TABLE noinh_con_copy1 (LIKE noinh_con_copy INCLUDING ALL);
+\d noinh_con_copy1
+  Table "public.noinh_con_copy1"
+ Column |  Type   | Collation | Nullable | Default 
++-+---+--+-
+ a  | integer |   |  | 
+Check constraints:
+"noinh_con_copy_a_check" CHECK (a > 0) NO INHERIT
+
+-- error as partitioned tables don't allow NO INHERIT constraints
+CREATE TABLE noinh_con_copy1_parted (LIKE noinh_con_copy INCLUDING ALL) 
PARTITION BY LIST (a);
+ERROR:  cannot add NO INHERIT constraint to partitioned table 
"noinh_con_copy1_parted"
diff --git a/src/test/regress/sql/create_table_like.sql 
b/src/test/regress/sql/create_table_like.sql
index 589ee12ebc..a873812576 100644
--- a/src/test/regress/sql/create_table_like.sql
+++ b/src/test/regress/sql/create_table_like.sql
@@ -170,3 +170,10 @@ DROP SEQUENCE ctlseq1;
 DROP TYPE ctlty1;
 DROP VIEW ctlv1;
 DROP TABLE IF EXISTS ctlt4, ctlt10, ctlt11, ctlt11a, ctlt12;
+
+/* LIKE must respect NO INHERIT property of constraints */
+CREATE TABLE noinh_con_copy (a int CHECK (a > 0) NO INHERIT);
+CREATE TABLE noinh_con_copy1 (LIKE noinh_con_copy INCLUDING ALL);
+\d noinh_con_copy1
+-- error as partitioned tables don't allow NO INHERIT constraints
+CREATE TABLE noinh_con_copy1_parted (LIKE noinh_con_copy INCLUDING ALL) 
PARTITION BY LIST (a);

Re: WIP: expression evaluation improvements

2020-02-19 Thread Soumyadeep Chakraborty

Hey Andres,

> Awesome! +1. Attached 2nd version of patch rebased on master.
> (v2-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch)
>
>
>
> > Did you check whether there's any cases this fails in the tree with your
> > patch applied? The way I usually do that is by running the regression
> > tests like
> > PGOPTIONS='-cjit_above_cost=0' make -s -Otarget check-world
> >
> > (which will take a bit longer if use an optimized LLVM build, and a
> > *lot* longer if you use a debug llvm build)
>
>
>
> Great suggestion! I used:
> PGOPTIONS='-c jit_above_cost=0' gmake installcheck-world
> It all passed except a couple of logical decoding tests that never pass
> on my machine for any tree (t/006_logical_decoding.pl and
> t/010_logical_decoding_timelines.pl) and point (which seems to be failing
> even
> on master as of: d80be6f2f) I have attached the regression.diffs which
> captures
> the point failure.

I have attached the 3rd version of the patch rebased on master. I made one
slight modification to the previous patch. PL handlers, such as that of
plsh,
can be in an external library. So I account for that in modname (earlier
naively I set it to NULL). There are also some minor changes to the comments
and I have rehashed the commit message.

Apart from running the regress tests as you suggested above, I installed
plsh
and forced JIT on the following:

CREATE FUNCTION query_plsh (x int) RETURNS text
LANGUAGE plsh
AS $$
#!/bin/sh
psql -At -c "select 1"
$$;

SELECT query_plsh(5);

and I also ran plsh's make installcheck with jit_above_cost = 0. Everything
looks good. I think this is ready for another round of review. Thanks!!

Soumyadeep


v3-0001-Resolve-PL-handler-names-for-JITed-code-instead-o.patch
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Feb 19, 2020 at 11:16 AM Peter Geoghegan  wrote:
> On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
>  wrote:
> > Thank you for this work. I've looked through the patches and they seem
> > to be ready for commit.
> > I haven't yet read recent documentation and readme changes, so maybe
> > I'll send some more feedback tomorrow.
>
> Great.

I should add: I plan to commit the patch within the next 7 days.

I believe that the design of deduplication itself is solid; it has
many more strengths than weaknesses. It works in a way that
complements the existing approach to page splits. The optimization can
easily be turned off (and easily turned back on again).
contrib/amcheck can detect almost any possible form of corruption that
could affect a B-Tree index that has posting list tuples. I have spent
months microbenchmarking every little aspect of this patch in
isolation. I've also spent a lot of time on conventional benchmarking.

It seems quite possible that somebody won't like some aspect of the
user interface. I am more than willing to work with other contributors
on any issue in that area that comes to light. I don't see any point
in waiting for other hackers to speak up before the patch is
committed, though. Anastasia posted the first version of this patch in
August of 2015, and there have been over 30 revisions of it since the
project was revived in 2019. Everyone has been given ample opportunity
to offer input.

--
Peter Geoghegan

Re: Parallel copy

2020-02-19 Thread David Fetter

On Fri, Feb 14, 2020 at 01:41:54PM +0530, Amit Kapila wrote:
> This work is to parallelize the copy command and in particular "Copy
>  from 'filename' Where ;" command.

Apropos of the initial parsing issue generally, there's an interesting
approach taken here: https://github.com/robertdavidgraham/wc2

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Constraint's NO INHERIT option is ignored in CREATE TABLE LIKE statement

2020-02-19 Thread David G. Johnston

On Wed, Feb 19, 2020 at 4:02 PM Tom Lane  wrote:

> Ildar Musin  writes:
> > My colleague Chris Travers discovered something that looks like a bug.
> > Let's say we have a table with a constraint that is declared as NO
> INHERIT.
> > ...
> > Now when we want to make a copy of the table structure into a new table
> > the `NO INHERIT` option is ignored.
>
> Hm, I agree that's a bug, since the otherwise-pretty-detailed CREATE TABLE
> LIKE documentation makes no mention of such a difference between original
> and cloned constraint.
>
> However, I'd be disinclined to back-patch, since it's barely possible
> somebody out there is depending on the existing behavior.
>

Not sure I agree with the premise that it is not supposed to be copied; is
there some other object type the allows NO INHERIT that isn't copied when
CREATE TABLE LIKE is used and check constraints are the odd ones out?

Inheritance is what NO INHERIT is about and CREATE TABLE LIKE pointedly
doesn't setup an inheritance structure.  The documentation seems ok since
saying that NO INHERIT is ignored when inheritance is not being used seems
self-evident.  Sure, maybe some clarity here could be had, but its not like
this comes up with any regularity.

David J.

Re: Constraint's NO INHERIT option is ignored in CREATE TABLE LIKE statement

Ildar Musin  writes:
> My colleague Chris Travers discovered something that looks like a bug.
> Let's say we have a table with a constraint that is declared as NO INHERIT.
> ...
> Now when we want to make a copy of the table structure into a new table
> the `NO INHERIT` option is ignored.

Hm, I agree that's a bug, since the otherwise-pretty-detailed CREATE TABLE
LIKE documentation makes no mention of such a difference between original
and cloned constraint.

However, I'd be disinclined to back-patch, since it's barely possible
somebody out there is depending on the existing behavior.

regards, tom lane

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Wed, Feb 19, 2020 at 2:45 PM Tom Lane  wrote:
> The buildfarm would only tell you if it compiles, not whether the
> performance results are what you hoped for.

Right. Plus I think that more granular control might make more sense
in this particular instance anyway.

-- 
Peter Geoghegan

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Andres Freund  writes:
> On 2020-02-19 15:55:38 -0500, Tom Lane wrote:
>> Boy, I'd be pretty darn hesitant to go there, even with our new
>> expectation of C99 compilers.  What we found out when we last experimented
>> with non-static inlines was that the semantics were not very portable nor
>> desirable.  I've forgotten the details unfortunately.

> I think most of those problems were about putting extern inlines into
> headers however - not about putting an inline onto an extern within one
> translation unit only.  Given that potential fallout should be within a
> single file, and can fairly easily be fixed with adding wrappers etc, I
> think it's a fairly low risk experiment to see what the buildfarm
> thinks.

The buildfarm would only tell you if it compiles, not whether the
performance results are what you hoped for.

regards, tom lane

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Hi,

On 2020-02-19 15:55:38 -0500, Tom Lane wrote:
> Peter Geoghegan  writes:
> > It also inlines (in the second patch) by marking the _bt_compare
> > definition as inline, while not changing anything in nbtree.h. I
> > believe that this is portable C99 -- let's see what CF Tester thinks
> > of it.

> Boy, I'd be pretty darn hesitant to go there, even with our new
> expectation of C99 compilers.  What we found out when we last experimented
> with non-static inlines was that the semantics were not very portable nor
> desirable.  I've forgotten the details unfortunately.

I think most of those problems were about putting extern inlines into
headers however - not about putting an inline onto an extern within one
translation unit only.  Given that potential fallout should be within a
single file, and can fairly easily be fixed with adding wrappers etc, I
think it's a fairly low risk experiment to see what the buildfarm
thinks.

Greetings,

Andres Freund

bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-02-19 Thread Alex Malek

Hello Postgres Hackers -

We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large
data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,
remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file
had to retrieved from the wal archive.

The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1

The issue seems similar to
https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443

One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.

Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.

Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.

The bad file either had a block of zeros that were not in the good file at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.

This made me think that maybe there was an issue w/ wal file recycling and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).

There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM, and
the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the other
replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower than the
others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.

Any insight into the source of this bug or how to address it?

Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?

When using replication slots, what circumstances would cause the master to
not save the WAL file?
(I can't remember if it always had the next wal file or the one after that)

Thanks in advance,
Alex Malek

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Peter Geoghegan  writes:
> On Wed, Feb 19, 2020 at 12:55 PM Tom Lane  wrote:
>> Boy, I'd be pretty darn hesitant to go there, even with our new
>> expectation of C99 compilers.  What we found out when we last experimented
>> with non-static inlines was that the semantics were not very portable nor
>> desirable.  I've forgotten the details unfortunately.

> My original approach to inlining was to alter the nbtsearch.c
> _bt_compare() callers (the majority) to call _bt_compare_inl(). This
> function matches our current _bt_compare() function, except it's a
> static inline. A "new" function, _bt_compare(), is also added. That's a
> shim function that simply calls _bt_compare_inl().

Yeah, that's pretty much the approach we concluded was necessary
for portable results.

regards, tom lane

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

On Wed, Feb 19, 2020 at 12:55 PM Tom Lane  wrote:
> Boy, I'd be pretty darn hesitant to go there, even with our new
> expectation of C99 compilers.  What we found out when we last experimented
> with non-static inlines was that the semantics were not very portable nor
> desirable.  I've forgotten the details unfortunately.

My original approach to inlining was to alter the nbtsearch.c
_bt_compare() callers (the majority) to call _bt_compare_inl(). This
function matches our current _bt_compare() function, except it's a
static inline. A "new" function, _bt_compare(), is also added. That's a
shim function that simply calls _bt_compare_inl().

This earlier approach now seems to work better than the approach I took
in v3. In fact, my overnight testing shows that v3 was a minor loss
for performance. I guess we can scrap the non-static inline thing
right away.

> But why do we need
> this, and what exactly are you hoping the compiler will do with it?

Well, the patch's approach to inlining prior to v3 was kind of ugly,
and it would have been nice to not have to do it that way. That's all.
Some further refinement is probably possible.

More generally, my goal is to realize a pretty tangible performance
benefit from avoiding a pipeline stall -- this work was driven by a
complaint from Andres. I haven't really explained the reason why the
inlining matters here, and there are a few things that need to be
weighed. I'll have to do some kind of microarchitectural analysis
before proceeding with commit. This is still unsettled.

--
Peter Geoghegan

Re: Yet another fast GiST build

2020-02-19 Thread Thomas Munro

On Wed, Feb 19, 2020 at 8:00 PM Thomas Munro  wrote:
> Could this function please have a comment that explains why it works?
> I mean, just a breadcrumb... the name of the technique or something...
> so that uninitiated hackers can google their way to a clue (is it
> "Morton encoding"?)

Ok I think I get it now after doing some homework.

1.  We expect floats to be in IEEE format, and the sort order of IEEE
floats is mostly correlated to the binary sort order of the bits
reinterpreted as an int.  It isn't in some special cases, but for this
use case we don't really care about that, we're just trying to
encourage locality.
2.  We generate a Morton code that interleaves the bits of N integers
to produce a single integer that preserves locality: things that were
close in the N dimensional space are close in the resulting integer.

Cool.

+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+/* esteblish order between x and y */
+
+return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}

This example code from the documentation looks wrong, probably missing
eg int64 z1 = DatumGetInt64(x).

Re: pg_regress cleans up tablespace twice.

Kyotaro Horiguchi  writes:
> But in the first place it comes from [1] and the comment says:

>> * XXX it would be better if pg_regress.c had nothing at all to do with
>> * testtablespace, and this were handled by a .BAT file or similar on
>> * Windows.  See pgsql-hackers discussion of 2008-01-18.

> Is there any reason not to do that in vcregress.pl?  I think the
> commands other than 'check' don't needs this.

I think the existing coding dates from before we had a Perl driver for
this, or else we had it but there were other less-functional ways to
replace "make check" on Windows.  +1 for taking the code out of
pg_regress.c --- but I'm not in a position to say whether the other
part of your patch is sufficient.

regards, tom lane

Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

Peter Geoghegan  writes:
> It also inlines (in the second patch) by marking the _bt_compare
> definition as inline, while not changing anything in nbtree.h. I
> believe that this is portable C99 -- let's see what CF Tester thinks
> of it.

Boy, I'd be pretty darn hesitant to go there, even with our new
expectation of C99 compilers.  What we found out when we last experimented
with non-static inlines was that the semantics were not very portable nor
desirable.  I've forgotten the details unfortunately.  But why do we need
this, and what exactly are you hoping the compiler will do with it?

regards, tom lane

Re: Increase psql's password buffer size

Fujii Masao  writes:
> Attached is the patch that Nathan proposed at [1] and I think that
> it's worth applying. I'd like to add this to next CommitFest.
> Thought?

I can't get excited about this in the least.  For any "normal" use of
passwords, 100 bytes is surely far more than sufficient.  Furthermore,
if there is someone out there for whom it isn't sufficient, they're not
going to want to build custom versions of Postgres to change it.

If we think that longer passwords are actually a thing to be concerned
about, then what we need to do is change all these places to support
expansible buffers.  I'm not taking a position on whether that's worth
the trouble ... but I do take the position that just inserting a
#define is a waste of time.

regards, tom lane

Re: error context for vacuum to include block number

2020-02-19 Thread Justin Pryzby

Rebased on top of 007491979461ff10d487e1da9bcc87f2fd834f26

Also, I was thinking that lazy_scan_heap doesn't needs to do this:

+   /* Pop the error context stack while calling vacuum */
+   error_context_stack = errcallback.previous;
...
+   /* Set the error context while continuing heap scan */
+   error_context_stack = &errcallback;

It seems to me that's not actually necessary, since lazy_vacuum_heap will just
*push* a context handler onto the stack, and then pop it back off.  We don't
need to pop our context beforehand.  We also vacuum the FSM, and one might say
that we shouldn't report "...while scanning block number..." if it was
"vacuuming FSM" instead of "scanning heap", to which I would reply that either:
vacuuming FSM could be considered a part of scanning heap??  Or, maybe we
should add an additional callback for that, which is only not very nice since
we'd need to add a PROGRESS enum for which we don't actually report PROGRESS
(or stop using that enum).

I tested using variations on this that works as expected, that context is
correct during vacuum while scanning and after vacuum while scanning:

template1=# SET statement_timeout=0; SET maintenance_work_mem='1MB'; DROP TABLE 
tt; CREATE UNLOGGED TABLE tt(i int); INSERT INTO tt SELECT 
generate_series(1,39); CREATE INDEX ON tt(i); UPDATE tt SET i=i-1; SET 
statement_timeout=1222; VACUUM VERBOSE tt;

>From 91158171f75cd20e69b18843dd3b6525961e4e8b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v21 1/2] vacuum errcontext to show block being processed

Discussion:
https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com
---
 src/backend/access/heap/vacuumlazy.c | 130 ++-
 1 file changed, 129 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c43ef..9e69294 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -292,6 +292,14 @@ typedef struct LVRelStats
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+typedef struct
+{
+	char 		*relnamespace;
+	char		*relname;
+	char 		*indname;
+	BlockNumber blkno;	/* used only for heap operations */
+	int			phase;	/* Reusing same enums as for progress reporting */
+} vacuum_error_callback_arg;
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -361,6 +369,9 @@ static void end_parallel_vacuum(Relation *Irel, IndexBulkDeleteResult **stats,
 LVParallelState *lps, int nindexes);
 static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
 static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static void vacuum_error_callback(void *arg);
+static void init_vacuum_error_callback(ErrorContextCallback *errcallback,
+		vacuum_error_callback_arg *errcbarg, Relation onerel, int phase);
 
 
 /*
@@ -724,6 +735,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	ErrorContextCallback errcallback;
+	vacuum_error_callback_arg errcbarg;
 
 	pg_rusage_init(&ru0);
 
@@ -870,6 +883,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	else
 		skipping_blocks = false;
 
+	/* Setup error traceback support for ereport() */
+	init_vacuum_error_callback(&errcallback, &errcbarg, onerel,
+	PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+	error_context_stack = &errcallback;
+
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -891,6 +909,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
+		errcbarg.blkno = blkno;
+
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		if (blkno == next_unskippable_block)
@@ -987,6 +1007,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 vmbuffer = InvalidBuffer;
 			}
 
+			/* Pop the error context stack while calling vacuum */
+			error_context_stack = errcallback.previous;
+
 			/* Work on all the indexes, then the heap */
 			lazy_vacuum_all_indexes(onerel, Irel, indstats,
 	vacrelstats, lps, nindexes);
@@ -1011,6 +1034,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 		 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
+
+			/* Set the error context while continuing heap scan */
+			error_context_stack = &errcallback;
 		}
 
 		/*
@@ -1597,6 +1623,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+
 	/* r

Re: Unicode normalization SQL functions

2020-02-19 Thread Peter Eisentraut


On 2020-02-17 20:08, Daniel Verite wrote:

Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.


Yes, this is a known issue and I think room for future optimization work.


Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.


That would require linking libpq to ICU (for SASLprep), and in general 
would either make ICU required or require maintaining multiple 
implementations.  I don't think we're there yet.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: explain HashAggregate to report bucket and memory stats

2020-02-19 Thread Justin Pryzby

On Sun, Feb 16, 2020 at 11:53:07AM -0600, Justin Pryzby wrote:
> Updated:
> 
>  . remove from explain analyze those tests which would display sort
>Memory/Disk.  Oops.

 . Rebased on top of 5b618e1f48aecc66e3a9f60289491da520faae19
 . Updated to avoid sort's Disk output, for real this time.
 . And fixed a syntax error in an intermediate commit.

-- 
Justin
>From 19ad84696710de7b5ac19e1124856701697d28c0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 15 Feb 2020 12:03:11 -0600
Subject: [PATCH v4 1/7] Run some existing tests with explain (ANALYZE)..

..in a separate, earlier patch, to better show what bits are added by later
patches for hashtable instrumentation.
---
 src/test/regress/expected/groupingsets.out| 57 +++---
 src/test/regress/expected/select_parallel.out | 20 
 src/test/regress/expected/subselect.out   | 69 +++
 src/test/regress/expected/union.out   | 43 +
 src/test/regress/sql/groupingsets.sql | 12 ++---
 src/test/regress/sql/select_parallel.sql  |  4 +-
 src/test/regress/sql/subselect.sql| 25 ++
 src/test/regress/sql/union.sql|  4 +-
 8 files changed, 166 insertions(+), 68 deletions(-)

diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c..95d619c 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -458,16 +458,17 @@ ERROR:  aggregate functions are not allowed in FROM clause of their own query le
 LINE 3:lateral (select a, b, sum(v.x) from gstest_data(v.x) ...
  ^
 -- min max optimization should still work with GROUP BY ()
-explain (costs off)
+explain (costs off, timing off, summary off, analyze)
   select min(unique1) from tenk1 GROUP BY ();
- QUERY PLAN 
-
- Result
+ QUERY PLAN 
+
+ Result (actual rows=1 loops=1)
InitPlan 1 (returns $0)
- ->  Limit
-   ->  Index Only Scan using tenk1_unique1 on tenk1
+ ->  Limit (actual rows=1 loops=1)
+   ->  Index Only Scan using tenk1_unique1 on tenk1 (actual rows=1 loops=1)
  Index Cond: (unique1 IS NOT NULL)
-(5 rows)
+ Heap Fetches: 0
+(6 rows)
 
 -- Views with GROUPING SET queries
 CREATE VIEW gstest_view AS select a, b, grouping(a,b), sum(c), count(*), max(c)
@@ -1126,14 +1127,14 @@ select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a)
 ---+---+-+---
 (0 rows)
 
-explain (costs off)
+explain (costs off, timing off, summary off, analyze)
   select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a);
-   QUERY PLAN   
-
- HashAggregate
+   QUERY PLAN   
+
+ HashAggregate (actual rows=0 loops=1)
Hash Key: a, b
Hash Key: a
-   ->  Seq Scan on gstest_empty
+   ->  Seq Scan on gstest_empty (actual rows=0 loops=1)
 (4 rows)
 
 select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),());
@@ -1150,16 +1151,16 @@ select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),()
|   | | 0
 (3 rows)
 
-explain (costs off)
+explain (costs off, timing off, summary off, analyze)
   select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),(),(),());
-   QUERY PLAN   
-
- MixedAggregate
+   QUERY PLAN   
+
+ MixedAggregate (actual rows=3 loops=1)
Hash Key: a, b
Group Key: ()
Group Key: ()
Group Key: ()
-   ->  Seq Scan on gstest_empty
+   ->  Seq Scan on gstest_empty (actual rows=0 loops=1)
 (6 rows)
 
 select sum(v), count(*) from gstest_empty group by grouping sets ((),(),());
@@ -1170,15 +1171,15 @@ select sum(v), count(*) from gstest_empty group by grouping sets ((),(),());
  | 0
 (3 rows)
 
-explain (costs off)
+explain (costs off, timing off, summary off, analyze)
   select sum(v), count(*) from gstest_empty group by grouping sets ((),(),());
-   QUERY PLAN   
-
- Aggregate
+   QUERY PLAN   
+
+ Aggregate (actual rows=3 loops=1)
Group Key: ()
Group Key: ()
Group Key: ()
-   ->  Seq Scan on gstest_empty
+   ->  Seq Scan on gstest_empty (actual rows=0 loops=1)
 (5 rows)
 
 -- check that functionally dependent cols are not nulled
@@ -1193,16 +1194,16 @@ sele

Re: Unicode normalization SQL functions

2020-02-19 Thread Peter Eisentraut


On 2020-02-17 20:14, Daniel Verite wrote:

The comment in full says:

/*  
  * unicode_normalize - Normalize a Unicode string to the specified form.
  * 
  * The input is a 0-terminated array of codepoints.
  * 
  * In frontend, returns a 0-terminated array of codepoints, allocated with
  * malloc. Or NULL if we run out of memory. In frontend, the returned  
  * string is palloc'd instead, and OOM is reported with ereport(). 
  */

It looks like the 2nd occurrence of "frontend" was meant to be "backend".


This was a pre-existing problem, so I have fixed that separately.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resolving the python 2 -> python 3 mess

Peter Eisentraut  writes:
> Your scheme appears to center around the assumption that people will 
> want to port their functions at the same time as not building plpython2u 
> anymore.

Not really; use of the proposed porting infrastructure is the same whether
plpython2u still works or not.  You end up with functions that are labeled
plpython3u, so what bare "plpythonu" means is not a factor.

It is true that as this patch is written, switching of plpythonu to
point at Python 3 rather than 2 is coupled to disabling plpython2u.
If we'd have gotten this done a year or two ago, I'd have made it more
complex to allow more separation there.  But events have passed us by:
the info we are getting from packagers is that Python 2 is getting
dropped *this year*, not in some distant future.  So I think that allowing
the plpythonu redefinition to be separate is no longer of any great value,
and not worth extra complication for.  People are just going to be
shipping v13 with both things changed in any case.

If we wanted to do something to help people port their functions in
advance of the big changeover, the thing to do would be to back-patch
the proposed convert_python3 extension into existing branches.

regards, tom lane

Re: Resolving the python 2 -> python 3 mess

2020-02-19 Thread Peter Eisentraut


On 2020-02-19 05:39, Tom Lane wrote:

After thinking about this awhile longer, I'm starting to believe
we should do some of each.  That is, the stub replacement for
plpython2.so should redirect "plpythonu" functions to plpython3.so,
but throw errors for "plpython2u" functions.


I'm not sure these complications are worth it.  They don't match 
anything that is done in other Python 2/3 porting schemes.  I think 
there should just be an option "plpython is: {2|3|don't build it at 
all}".  Then packagers can match this to what their plan for 
/usr/bin/python* is -- which appears to be different everywhere.


Your scheme appears to center around the assumption that people will 
want to port their functions at the same time as not building plpython2u 
anymore.  This would defeat testing functions before and after in the 
same installation.  I think the decisions of what plpythonu points to 
and which variants are built at all should be separate.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Tomas Vondra


Hi,

I've started reviewing the 20200218 version of the patch. In general it
seems fine, but I have a couple minor comments and two crashes.


1) explain.c currently does this:

I wonder if we could show something for plain explain (without analyze).
At least the initial estimate of partitions, etc. I know not showing
those details until after execution is what e.g. sort does, but I find
it a bit annoying.

A related comment is that maybe this should report also the initial
number of partitions, not just the total number. With just the total
it's impossible to say if there were any repartitions, etc.


2) The ExecBuildAggTrans comment should probably explain "spilled".


3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to
just add a new flag to the agg_* structs instead? I haven't tried hacking
this, so maybe it's a silly idea.


4) lookup_hash_entries says

  /* check to see if we need to spill the tuple for this grouping set */

But that seems bogus, because AFAIK we can't spill tuples for grouping
sets. So maybe this should say just "grouping"?


5) Assert(nbuckets > 0);

I was curious what happens in case of extreme skew, when a lot/all rows
consistently falls into a single partition. So I did this:

  create table t (a int, b real);

  insert into t select i, random()
from generate_series(-20, 20) s(i)
   where mod(hashint4(i), 16384) = 0;

  analyze t;

  set work_mem = '64kB';
  set max_parallel_workers_per_gather = 0;
  set enable_sort = 0;

  explain select a, sum(b) from t group by a;

QUERY PLAN   
  ---

   HashAggregate  (cost=23864.26..31088.52 rows=244631 width=8)
 Group Key: a
 ->  Seq Scan on t  (cost=0.00..3529.31 rows=244631 width=8)
  (3 rows)

This however quickly fails on this assert in BuildTupleHashTableExt (see
backtrace1.txt):

  Assert(nbuckets > 0);

The value is computed in hash_choose_num_buckets, and there seem to be
no protections against returning bogus values like 0. So maybe we should
return

  Min(nbuckets, 1024)

or something like that, similarly to hash join. OTOH maybe it's simply
due to agg_refill_hash_table() passing bogus values to the function?


6) Another thing that occurred to me was what happens to grouping sets,
which we can't spill to disk. So I did this:

  create table t2 (a int, b int, c int);

  -- run repeatedly, until there are about 20M rows in t2 (1GB)
  with tx as (select array_agg(a) as a, array_agg(b) as b
from (select a, b from t order by random()) foo),
   ty as (select array_agg(a) AS a
from (select a from t order by random()) foo)
  insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b)
from tx, ty;

  analyze t2;

This produces a table with two independent columns, skewed the same as
the column t.a. I don't know which of this actually matters, considering
grouping sets don't spill, so maybe the independence is sufficient and
the skew may be irrelevant?

And then do this:

  set work_mem = '200MB';
  set max_parallel_workers_per_gather = 0;
  set enable_sort = 0;

  explain select a, b, sum(c) from t2 group by cube (a,b);;

   QUERY PLAN  
  -

   MixedAggregate  (cost=0.00..833064.27 rows=2756495 width=16)
 Hash Key: a, b
 Hash Key: a
 Hash Key: b
 Group Key: ()
 ->  Seq Scan on t2  (cost=0.00..350484.44 rows=22750744 width=12)
  (6 rows)

which fails with segfault at execution time:

  tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
  870   for (i = 0; i < tb->size; i++)
  (gdb) bt
  #0  tuplehash_start_iterate (tb=0x18, iter=iter@entry=0x2349340)
  #1  0x00654e49 in agg_retrieve_hash_table_in_memory ...

That's not surprising, because 0x18 pointer is obviously bogus. I guess
this is simply an offset 18B added to a NULL pointer?

Disabling hashagg spill (setting both GUCs to off) makes no difference,
but on master it fails like this:

  ERROR:  out of memory
  DETAIL:  Failed on request of size 3221225472 in memory context 
"ExecutorState".

which is annoying, but expected with an under-estimate and hashagg. And
much better than just crashing the whole cluster.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
#0  0x7308e8e06e35 in raise () from /lib64/libc.so.6
#1  0x7308e8df1895 in abort () from /lib64/libc.so.6
#2  0x008c2aeb in ExceptionalCondition 
(conditionName=conditionName@entry=0x9f8583 "nbuckets > 0", 
errorType=errorType@entry=0x915029 "FailedAssertion", 
fileName=fileName@entry=0x9f84b7 "execGrouping.c", 
lineNumber=lineNumber@entry=174) at assert.c:67
#3  0x0063f704 in BuildTupleHashTableExt 
(parent=parent@entry=0x2344538, inputDesc=, n

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
 wrote:
> Thank you for this work. I've looked through the patches and they seem
> to be ready for commit.
> I haven't yet read recent documentation and readme changes, so maybe
> I'll send some more feedback tomorrow.

Great.

> Is there any specific reason, why we need separate btnameequalimage,
> bpchar_equalimage and bttextequalimage functions?
> As far as I see, they have the same implementation.

Not really. This approach allows us to reverse the decision to enable
deduplication in a point release, which is theoretically useful. OTOH,
if that's so important, why not have many more support function 4
implementations (one per opclass)?

I suspect that we would just disable deduplication in a hard-coded
fashion if we needed to disable it due to some issue that transpired.
For example, we could do this by modifying _bt_allequalimage() itself.

> I would simply move it to debug level for all cases. Since from user's
> perspective it doesn't differ that much from the case where
> deduplication is applicable in general, but not very efficient due to
> data distribution.

I was more concerned about cases where the user would really like to
use deduplication, but wants to make sure that it gets used. And
doesn't want to install pageinspect to find out.

> I also noticed that this is not consistent with ALTER index. For
> example, alter index idx_n set (deduplicate_items =true); won't show any
> message about deduplication.

But that's a change in the user's preference. Not a change in whether
or not it's safe in principle.

> In my opinion, this message is too specific for default behavior. It
> exposes internal details without explanation and may look to user like
> something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.

-- 
Peter Geoghegan

Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-19 Thread David Steele


On 1/31/20 3:59 AM, Michael Banck wrote:

Hi,

Am Freitag, den 31.01.2020, 13:53 +0900 schrieb Michael Paquier:

On Thu, Jan 30, 2020 at 06:11:22PM +0100, Michael Banck wrote:
Having a past tablespace version left
around after an upgrade is a pilot error in my opinion because
pg_upgrade generates a script to cleanup past tablespaces, no?  So
your patch does not look like a good idea to me.


Not sure I agree with it, but if that (i.e. after pg_upgrade in copy
mode, you have no business to use the old cluster as well as the new
one) is project policy, fair enough.


I don't see how this is project policy.  The directories for other 
versions of Postgres should be ignored as they are in other utilities, 
e.g. pg_basebackup.



However, Postgres does not disallow to just create tablespaces in the
same location from two different versions, so you don't need the
pg_upgade scenario to get into this (pg_checksums checking the wrong
cluster's data) problem:


Exactly.

Regards,
--
-David
da...@pgmasters.net

Re: [Patch] Make pg_checksums skip foreign tablespace directories

2020-02-19 Thread David Steele


On 2/19/20 2:13 AM, Michael Paquier wrote:

On Fri, Jan 31, 2020 at 05:39:36PM +0900, Kyotaro Horiguchi wrote:

At Fri, 31 Jan 2020 17:30:43 +0900 (JST), Kyotaro Horiguchi 
 wrote in

I don't think that is a problem right away, of course. It looks good
to me except for the possible excessive exclusion.  So, I don't object
it if we don't mind that.


That's a bit wrong.  All the discussion is only on excludeFiles.  I
think we should refrain from letting more files match to
nohecksumFiles.


I am not sure what you are saying here.  Are you saying that we should
not use a prefix matching for that part?  Or are you saying that we
should not touch this list at all?


Perhaps he is saying that if it is already excluded it won't be 
checksummed.  So, if pg_internal.init* is excluded from the backup, that 
is all that is needed.  If so, I agree.  This might not help 
pg_verify_checksums, though, except that it should be applying the same 
rules.



Please note that pg_internal.init is listed within noChecksumFiles in
basebackup.c, so we would miss any temporary pg_internal.init.PID if
we don't check after the file prefix and the base backup would issue
extra WARNING messages, potentially masking messages that could
matter.  So let's fix that as well.


Agreed.  Though, I think pg_internal.init.XX should be excluded from the 
backup as well.


As far as I can see, the pg_internal.init.XX will not be cleaned up by 
PostgreSQL on startup.  I've only tested this in 9.6 so far, but I don't 
see any differences in the code since then that would lead to better 
behavior.  Perhaps that's also something we should fix?



I agree that a side effect of this change would be to discard anything
prefixed with "backup_label" or "tablespace_map", including any old,
renamed files.  Do you know of any backup solutions which could be
impacted by that?  I am adding David Steele and Stephen Frost in CC so
as they can comment based on their experience in this area.  I recall
that backrest stuff uses the replication protocol, but I may be
wrong.


I'm really not a fan of a blind prefix match.  I think we should stick 
with only excluding files that are created by Postgres.  So 
backup_label.old and tablespace_map.old should just be added to the 
exclude list.  That's how we have it in pgBackRest.


Regards,
--
-David
da...@pgmasters.net

Re: Clean up some old cruft related to Windows

2020-02-19 Thread Juan José Santamaría Flecha

On Wed, Feb 19, 2020 at 4:49 AM Michael Paquier  wrote:

> On Tue, Feb 18, 2020 at 12:26:06PM +0100, Juan José Santamaría Flecha
> wrote:
> > I cannot point when SetEnv.bat was exactly dropped, probably Windows 7
> SDK
> > was the place where it was *last* included [1], so that needs to be
> updated.
> >
> > Using VS2019 and VS2017 this would be done using VsDevCmd.bat [2], while
> > VS2015 and VS2013 use VSVARS32.bat.
>
> Would you like to write a patch for that part?
>

Please find a patched for so. I have tried to make it more version neutral.

Regards,

Juan José Santamaría Flecha


install-windows_doc_cleanup.patch
Description: Binary data

Re: Make ringbuffer threshold and ringbuffer sizes configurable?

2020-02-19 Thread Justin Pryzby

On Wed, Feb 05, 2020 at 08:00:26PM -0800, Andres Freund wrote:
> I think it would make sense to have seqscan_ringbuffer_threshold,
> {bulkread,bulkwrite,vacuum}_ringbuffer_size.

I suggest the possibility of somehow forcing a ringbuffer for nonbulk writes
for the current session.

In our use-case, we have loader processes INSERTing data using prepared
statements, UPSERT, and/or multiple VALUES(),() lists.  Some of that data will
be accessed in the near future (15min-24hr) but some parts (large parts, even)
may never be accessed.  I imagine most of the buffer pages never get
usagecount > 0 before being evicted.

I think it'd still be desirable to make the backend do write() its own dirty
buffers to the OS, rather than leaving behind large numbers of dirty buffers
for another backend to deal with, since that *could* be a customer facing
report.  I'd prefer the report run 10% faster due to rarely hitting dirty
buffer (by avoiding the need to write out lots of someone elses data), than the
loaders to run 25% slower, due to constantly writing to the OS.

The speed of loaders is not something our customers would be concerned with.
It's okay if they are slower than they might be.  They need to keep up with
incoming data, but it'd rarely matter if we load a 15min interval of data in
5min instead of in 4.

We would use copy if we could, to get ring buffer during writes.  But cannot
due to UPSERT (and maybe other reasons).  

I have considered the possibility of loading data into a separate instance with
small (or in any case separate) shared_buffers and then tranferring its data to
a customer-facing report instance using pg_restore (COPY)...but the overhead to
maintain that would be significant for us (me).

-- 
Justin

Re: BUG #16108: Colorization to the output of command-line has unproperly behaviors at Windows platform

2020-02-19 Thread Juan José Santamaría Flecha

On Tue, Feb 18, 2020 at 11:39 PM Michail Nikolaev <
michail.nikol...@gmail.com> wrote:

>
> I have checked the patch source code and it seems to be working. But a
> few moments I want to mention:
>

Thanks for looking into this.

> I think it is not good idea to mix the logic of detecting the fact of
> TTY with enabling of the VT100 mode. Yeah, it seems to be correct for
> current case but a little confusing.
> Maybe is it better to detect terminal using *isatty* and later call
> *enable_vt_mode*?
>

Most of what enable_vt_mode() does is actually detecting the terminal, but
I can see why that is confusing without better comments.

> Also, it seems like if GetConsoleMode returns
> ENABLE_VIRTUAL_TERMINAL_PROCESSING flag already set - we could skip
> SetConsoleMode call (not a big deal of course).
>

Agreed.

The patch about making color by default [1] introduces the
function terminal_supports_color(), that I think is relevant for this
issue. Please find attached a new version based on that idea.

Also, adding Peter to weight on this approach.

[1] https://commitfest.postgresql.org/27/2406/

Regards,

Juan José Santamaría Flecha

v3-0001-command-line-colorization-on-windows.patch
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2020-02-19 Thread Anastasia Lubennikova


On 14.02.2020 05:57, Peter Geoghegan wrote:


Attached is v33, which adds the last piece we need: opclass
infrastructure that tells nbtree whether or not deduplication can be
applied safely. This is based on work by Anastasia that was shared
with me privately.
Thank you for this work. I've looked through the patches and they seem 
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe 
I'll send some more feedback tomorrow.



New opclass proc


In general, supporting deduplication is the rule for B-Tree opclasses,
rather than the exception. Most can use the generic
btequalimagedatum() routine as their support function 4, which
unconditionally indicates that deduplication is safe. There is a new
test that tries to catch opclasses that omitted to do this. Here is
the opr_sanity.out changes added by the first patch:

-- Almost all Btree opclasses can use the generic btequalimagedatum function
-- as their equalimage proc (support function 4).  Look for opclasses that
-- don't do so; newly added Btree opclasses will usually be able to support
-- deduplication with little trouble.
SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
FROM pg_am am
JOIN pg_opclass opc ON opc.opcmethod = am.oid
JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
 amproclefttype = opcintype AND
 amprocnum = 4
WHERE am.amname = 'btree' AND
 amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
proc|  opfamily_name   |   opclass_name   |opcintype
---+--+--+--
  bpchar_equalimage | bpchar_ops   | bpchar_ops   | character
  btnameequalimage  | text_ops | name_ops | name
  bttextequalimage  | text_ops | text_ops | text
  bttextequalimage  | text_ops | varchar_ops  | text
| array_ops| array_ops| anyarray
| enum_ops | enum_ops | anyenum
| float_ops| float4_ops   | real
| float_ops| float8_ops   | double precision
| jsonb_ops| jsonb_ops| jsonb
| money_ops| money_ops| money
| numeric_ops  | numeric_ops  | numeric
| range_ops| range_ops| anyrange
| record_image_ops | record_image_ops | record
| record_ops   | record_ops   | record
| tsquery_ops  | tsquery_ops  | tsquery
| tsvector_ops | tsvector_ops | tsvector
(16 rows)



Is there any specific reason, why we need separate btnameequalimage, 
bpchar_equalimage and bttextequalimage functions?

As far as I see, they have the same implementation.

Since using deduplication is supposed to pretty much be the norm from
now on, it seemed like it might make sense to add a NOTICE about it
during CREATE INDEX -- a notice letting the user know that it isn't
being used due to a lack of opclass support:

regression=# create table foo(bar numeric);
CREATE TABLE
regression=# create index on foo(bar);
NOTICE:  index "foo_bar_idx" cannot use deduplication
CREATE INDEX

Note that this NOTICE isn't seen with an INCLUDE index, since that's
expected to not support deduplication.

I have a feeling that not everybody will like this, which is why I'm
pointing it out.

Thoughts?


I would simply move it to debug level for all cases. Since from user's 
perspective it doesn't differ that much from the case where 
deduplication is applicable in general, but not very efficient due to 
data distribution.
I also noticed that this is not consistent with ALTER index. For 
example, alter index idx_n set (deduplicate_items =true); won't show any 
message about deduplication.


I've tried several combinations with an index on a numeric column:

1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items = 
true);

NOTICE:  index "idx_nd" cannot use deduplication
CREATE INDEX

Here the message seems appropriate. I don't think, we should restrict 
creation of the index even when deduplicate_items parameter is set 
explicitly, rather we may warn the user that it won't be efficient.


2) postgres=# create index idx_n on tbl (n) with (deduplicate_items = 
false);

NOTICE:  index "idx_n" cannot use deduplication
CREATE INDEX

In this case the message seems slightly strange to me.
Why should we show a notice about the fact that deduplication is not 
possible if that is exactly what was requested?


3)
postgres=# create index idx on tbl (n);
NOTICE:  index "idx" cannot use deduplication

In my opinion, this message is too specific for default behavior. It 
exposes internal d

Re: Internal key management system

2020-02-19 Thread Masahiko Sawada

On Sat, 15 Feb 2020 at 01:00, Robert Haas  wrote:
>
> On Thu, Feb 6, 2020 at 9:19 PM Masahiko Sawada
>  wrote:
> > This feature protects data from disk thefts but cannot protect data
> > from attackers who are able to access PostgreSQL server. In this
> > design application side still is responsible for managing the wrapped
> > secret in order to protect it from attackers. This is the same as when
> > we use pgcrypto now. The difference is that data is safe even if
> > attackers steal the wrapped secret and the disk. The data cannot be
> > decrypted either without the passphrase which can be stored to other
> > key management systems or without accessing postgres server. IOW for
> > example, attackers can get the data if they get the wrapped secret
> > managed by application side then run pg_kmgr_unwrap() to get the
> > secret and then steal the disk.
>
> If you only care about protecting against the theft of the disk, you
> might as well just encrypt the whole filesystem, which will probably
> perform better and probably be a lot harder to break since you won't
> have short encrypted strings but instead large encrypted blocks of
> data. Moreover, I think a lot of people who are interested in these
> kinds of features are hoping for more than just protecting against the
> theft of the disk. While some people may be hoping for too much in
> this area, setting your sights only on encryption at rest seems like a
> fairly low bar.

This feature also protects data from reading database files directly.
And it's also good that it's independent of platforms.

To be clear, let me summarize scenarios where we will be able to
protect data and won't. We can put the cluster key which will be
obtained by cluster_passphrase_command into another component in the
system, for instance into KMS ideally. The user key is wrapped and
saved to an application server or somewhere it can obtain promptly.
PostgreSQL server has the master key in the disk which is wrapped by
the cluster key along with the user data encrypted by the user key.
While running PostgreSQL server, user can unwrap the user key using by
pg_unwrap_key to get the user key. Given that attackers stole the
database disk that includes encrypted user data and the wrapped master
key, what they need to complete their attack is (1) the wrapped user
key and an access to PostgreSQL server, (2) the cluster key and the
wrapped user key or (3) the master key and the wrapped user key. They
cannot get user data with only one of those secrets: the cluster key,
the master key and the wrapped user key.

In case (1), PostgreSQL needs to be running and they need to be able
to access a PostgreSQL server, which may require a password, to
execute pg_unwrap_key with the wrapped user key they stole. In case
(2), since the wrapped user key is stored in the application server
and it will be likely to be accessible without special privilege it
may be easy for attackers to get it. However in addition, they need to
attack KMS to get the cluster key. Finally in case (3), again, they
may be able to steal the wrapped user key. But they need also to be
able to login to OS in an unauthorized way and then illegally see the
PostgreSQL shared buffer.

ISTM these all cases will be not easy for attackers.

>
> It also doesn't seem very likely to actually provide any security.
> You're talking about sending the encryption key in the query string,
> which means that there's a good chance it's going to end up in a log
> file someplace. One way that could happen is if the user has
> configured log_statement=all or log_min_duration_statement, but it
> could also happen any time the query throws an error. In theory, you
> might arrange for the log messages to be sent to another server that
> is protected by separate layers of security, but a lot of people are
> going to just log locally. And, even if you do have a separate server,
> do you really want to have the logfile over there be full of
> passwords? I know I can be awfully negative some times, but that it
> seems like a weakness so serious as to make this whole thing
> effectively useless.
>

Since the user key could be logged to server logs attackers will be
able to get user data by stealing only the database disk if the server
logs locally. But I personally think that it's not a serious problem
that will make this feature meaningless, depending on user cases. User
will be likely to have user key per users or one key for one instance.
So for example, in the case where the system doesn't add new users
during running, user can wrap the user key before the system starting
service and therefore user will need pay attention only at that time.
If user can take care of that we can accept such restriction.

> One way to plug this hole is to use new protocol messages for key
> exchanges. For example, suppose that after authentication is complete,
> you can send the server a new protocol message: KeyPassphrase
>  . The server stores the passphrase in
> backend

Re: Do we need to handle orphaned prepared transactions in the server?

2020-02-19 Thread Hamid Akhtar

All,

Attached is version 1 of POC patch for notifying of orphaned
prepared transactions via warnings emitted to a client
application and/or log file. It applies to PostgreSQL branch
"master" on top of "e2e02191" commit.

I've tried to keep the patch as less invasive as I could with
minimal impact on vacuum processes, so the performance impact
and the changes are minimal in that area of PostgreSQL core.

- What's in this Patch:

This patch throws warnings when an autovacuum worker encounters
an orphaned prepared transaction. It also throws warnings to a
client when a vacuum command is issued. This patch also
introduces two new GUCs:

(1) max_age_prepared_xacts
- The age after creation of a prepared transaction after which it
will be considered an orphan.

(2) prepared_xacts_vacuum_warn_timeout
- The timeout period for an autovacuum (essentially any of its
worker) to check for orphaned prepared transactions and throw
warnings if any are found.

- What This Patch Does:

If the GUCs are enabled (set to a value higher than -1), an
autovacuum worker running in the background checks if the
timeout has expired. If so, it checks if there are any orphaned
prepared transactions (i.e. their age has exceeded
max_age_prepared_xacts). If it finds any, it throws a warning for
every such transaction. It also emits the total number of orphaned
prepared transactions if one or more are found.

When a vacuum command is issued from within a client, say psql,
in that case, we skip the vacuum timeout check and simply scan
for any orphaned prepared transactions. Warnings are emitted to
the client and log file if any are found.

- About the New GUCs:

= max_age_prepared_xacts:
Sets maximum age after which a prepared transaction is considered an
orphan. It applies when "prepared transactions" are enabled. The
age for a transaction is calculated from the time it was created to
the current time. If this value is specified without units, it is taken
as milliseconds. The default value is -1 which allows prepared
transactions to live forever.

= prepared_xacts_vacuum_warn_timeout:
Sets timeout after which vacuum starts throwing warnings for every
prepared transactions that has exceeded maximum age defined by
"max_age_prepared_xacts". If this value is specified without units,
it is taken as milliseconds. The default value of -1 will disable
this warning mechanism. Setting a too value could potentially fill
up log with orphaned prepared transaction warnings, so this
parameter must be set to a value that is reasonably large to not
fill up log file, but small enough to notify of long running and
potential orphaned prepared transactions. There is no additional
timer or worker introduced with this change. Whenever a vacuum
worker runs, it first checks for any orphaned prepared transactions.
So at best, this GUC serves as a guideline for a vacuum worker
if a warning should be thrown to log file or a client issuing
vacuum command.

- What this Patch Does Not Cover:

The warning is not thrown when user either runs vacuumdb or passes
individual relations to be vacuum. Specifically in case of vacuumdb,
it breaks down a vacuum command to an attribute-wise vacuum command.
So the vacuum command is indirectly run many times. Considering that
we want to emit warnings for every manual vacuum command, this simply
floods the terminal and log with orphaned prepared transactions
warnings. We could potentially handle that, but the overhead of
that seemed too much to me (and I've not invested any time looking
to fix that either). Hence, warnings are not thrown when user runs
vacuumdb and relation specific vacuum.

On Fri, Jan 31, 2020 at 7:02 PM Hamid Akhtar  wrote:

>
>
> On Thu, Jan 30, 2020 at 8:28 AM Craig Ringer 
> wrote:
>
>> On Thu, 30 Jan 2020 at 02:04, Hamid Akhtar 
>> wrote:
>> >
>> > So having seen the feedback on this thread, and I tend to agree with
>> most of what has been said here, I also agree that the server core isn't
>> really the ideal place to handle the orphan prepared transactions.
>> >
>> > Ideally, these must be handled by a transaction manager, however, I do
>> believe that we cannot let database suffer for failing of an external
>> software, and we did a similar change through introduction of idle in
>> transaction timeout behavior.
>>
>> The difference, IMO, is that idle-in-transaction aborts don't affect
>> anything we've promised to be durable.
>>
>> Once you PREPARE TRANSACTION the DB has made a promise that that txn
>> is durable. We don't have any consistent feedback channel to back to
>> applications and say "Hey, if you're not going to finish this up we
>> need to get rid of it soon, ok?". If a distributed transaction manager
>> gets consensus for commit and goes to COMMIT PREPARED a previously
>> prepared txn only to find that it has vanished, that's a major
>> problem, and one that may bring the entire DTM to a halt until the
>> admin can intervene.
>>
>> This isn't like idle-in-transaction aborts. It's closer to so

Re: DROP OWNED CASCADE vs Temp tables

2020-02-19 Thread ahsan hadi

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I have tested the patch with REL_12_STABLE for the given error scenario by 
running the "CREATE TEMPORARY TABLE..." and "DROP OWNED BY..." commands 
concurrently using parallel background workers. I have also tested a few 
related scenarios and the patch does seem to fix the reported bug. Ran make 
installcheck-world, no difference with and without patch.

Constraint's NO INHERIT option is ignored in CREATE TABLE LIKE statement

2020-02-19 Thread Ildar Musin

Hi hackers,

My colleague Chris Travers discovered something that looks like a bug.
Let's say we have a table with a constraint that is declared as NO INHERIT.

CREATE TABLE test (
x INT CHECK (x > 0) NO INHERIT
);
\d test
Table "public.test"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 x  | integer |   |  |
Check constraints:
"test_x_check1" CHECK (x > 0) NO INHERIT

Now when we want to make a copy of the table structure into a new table
the `NO INHERIT` option is ignored.

CREATE TABLE test2 (LIKE test INCLUDING CONSTRAINTS);
\d test2
   Table "public.test2"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 x  | integer |   |  |
Check constraints:
"test_x_check1" CHECK (x > 0)

Is this a bug or expected behaviour? Just in case I've attached a patch
that fixes this.

Regards,
Ildar
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 2406ca7a5d..d75ec6766a 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1154,6 +1154,7 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 			n->contype = CONSTR_CHECK;
 			n->location = -1;
 			n->conname = pstrdup(ccname);
+			n->is_no_inherit = tupleDesc->constr->check[ccnum].ccnoinherit;
 			n->raw_expr = NULL;
 			n->cooked_expr = nodeToString(ccbin_node);
 			cxt->ckconstraints = lappend(cxt->ckconstraints, n);

Re: Increase psql's password buffer size




On 2020/01/22 11:01, Fujii Masao wrote:



On 2020/01/22 0:12, Bruce Momjian wrote:

On Tue, Jan 21, 2020 at 02:42:07PM +0900, Fujii Masao wrote:

I have no strong opinion about the maximum length of password,
for now. But IMO it's worth committing that 0001 patch as the first step
for this problem.

Also IMO the more problematic thing is that psql silently truncates
the password specified in the prompt into 99B if its length is
more than 99B. I think that psql should emit a warning in this case
so that users can notice that.


I think we should be using a macro to define the maximum length, rather
than have 100 used in various places.


+1 as the first step for this issue. The patch that I mentioned
upthread actually does that.


Attached is the patch that Nathan proposed at [1] and I think that
it's worth applying. I'd like to add this to next CommitFest.
Thought?

[1] 
https://www.postgresql.org/message-id/09512c4f-8cb9-4021-b455-ef4c4f0d5...@amazon.com

Regards,


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
From ba8fb45bfad75ebba9b22bfaf397d2647a416f8e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 12 Oct 2018 15:31:31 +
Subject: [PATCH v1 1/3] Refactor maximum password length enforced by client
 utilities.

---
 contrib/oid2name/oid2name.c|  2 +-
 contrib/vacuumlo/vacuumlo.c|  2 +-
 src/bin/initdb/initdb.c|  4 ++--
 src/bin/pg_basebackup/streamutil.c |  2 +-
 src/bin/pg_dump/pg_backup_db.c |  4 ++--
 src/bin/pg_dump/pg_dumpall.c   |  2 +-
 src/bin/pgbench/pgbench.c  |  2 +-
 src/bin/psql/command.c |  6 +++---
 src/bin/psql/startup.c |  2 +-
 src/bin/scripts/common.c   |  2 +-
 src/bin/scripts/createuser.c   |  4 ++--
 src/common/saslprep.c  |  4 ++--
 src/include/postgres_fe.h  | 11 +++
 13 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index aa122ca0e9..08904ffd12 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -285,7 +285,7 @@ sql_conn(struct options *my_opts)
 {
PGconn *conn;
boolhave_password = false;
-   charpassword[100];
+   charpassword[PROMPT_MAX_PASSWORD_LENGTH];
boolnew_pass;
PGresult   *res;
 
diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index 3075781abe..5acfa65289 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -70,7 +70,7 @@ vacuumlo(const char *database, const struct _param *param)
boolnew_pass;
boolsuccess = true;
static bool have_password = false;
-   static char password[100];
+   static char password[PROMPT_MAX_PASSWORD_LENGTH];
 
/* Note: password can be carried over from a previous call */
if (param->pg_prompt == TRI_YES && !have_password)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f0c1..a9663531de 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1528,8 +1528,8 @@ setup_auth(FILE *cmdfd)
 static void
 get_su_pwd(void)
 {
-   charpwd1[100];
-   charpwd2[100];
+   charpwd1[PROMPT_MAX_PASSWORD_LENGTH];
+   charpwd2[PROMPT_MAX_PASSWORD_LENGTH];
 
if (pwprompt)
{
diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index 52f1e559b7..8ecb412ab8 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -51,7 +51,7 @@ char *dbport = NULL;
 char  *dbname = NULL;
 intdbgetpassword = 0;  /* 0=auto, -1=never, 1=always */
 static bool have_password = false;
-static char password[100];
+static char password[PROMPT_MAX_PASSWORD_LENGTH];
 PGconn*conn = NULL;
 
 /*
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index 5e32ee8a5b..402ee3cff1 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -126,7 +126,7 @@ _connectDB(ArchiveHandle *AH, const char *reqdb, const char 
*requser)
const char *newdb;
const char *newuser;
char   *password;
-   charpassbuf[100];
+   charpassbuf[PROMPT_MAX_PASSWORD_LENGTH];
boolnew_pass;
 
if (!reqdb)
@@ -246,7 +246,7 @@ ConnectDatabase(Archive *AHX,
 {
ArchiveHandle *AH = (ArchiveHandle *) AHX;
char   *password;
-   charpassbuf[100];
+   charpassbuf[PROMPT_MAX_PASSWORD_LENGTH];
boolnew_pass;
 
if (AH->connection)
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 5176626476..0521f8c700 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -1536,7 +1536,7

Re: Parallel copy

2020-02-19 Thread Amit Kapila

On Wed, Feb 19, 2020 at 4:08 PM Tomas Vondra
 wrote:
>
> The one piece of information I'm missing here is at least a very rough
> quantification of the individual steps of CSV processing - for example
> if parsing takes only 10% of the time, it's pretty pointless to start by
> parallelising this part and we should focus on the rest. If it's 50% it
> might be a different story.
>

Right, this is important information to know.

> Has anyone done any measurements?
>

Not yet, but planning to work on it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

On Wed, Feb 19, 2020 at 9:49 PM Fujii Masao  wrote:
> On 2020/02/19 11:22, Amit Langote wrote:
> > On Tue, Feb 18, 2020 at 9:32 PM Fujii Masao  
> > wrote:
> >> OK, I changed the doc that way. Attached the updated version of the patch.
> >
> > Thank you.  Looks good to me.
>
> Thanks for the review!
> You think that the patch can be marked as "ready for committer"?

As far as I am concerned, yes.  :)

Thanks,
Amit

Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side





On 2020/02/19 11:22, Amit Langote wrote:

On Tue, Feb 18, 2020 at 9:32 PM Fujii Masao  wrote:

OK, I changed the doc that way. Attached the updated version of the patch.


Thank you.  Looks good to me.


Thanks for the review!
You think that the patch can be marked as "ready for committer"?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: RecoveryWalAll and RecoveryWalStream wait events




On 2020/02/18 14:20, Kyotaro Horiguchi wrote:

Hello.

At Tue, 18 Feb 2020 12:25:51 +0900, Fujii Masao  
wrote in

Hi,

RecoveryWalAll and RecoveryWalStream wait events are documented as
follows.

 RecoveryWalAll
 Waiting for WAL from any kind of source (local, archive or stream) at
 recovery.

 RecoveryWalStream
 Waiting for WAL from a stream at recovery.

But as far as I read the code, RecoveryWalAll is reported only when
waiting
for WAL from a stream. So the current description looks
incorrect. What's
described now for RecoveryWalStream seems rather fit to
RecoveryWalAll.
I'd like to change the description of RecoveryWalAll to "Waiting for
WAL
  from a stream at recovery".


Good catch!


Regarding RecoveryWalStream, as far as I read the code, while this
event is
being reported, the startup process is waiting for next trial to
retrieve
WAL data when WAL data is not available from any sources, based on
wal_retrieve_retry_interval. So this current description looks also
incorrect. I'd like to change it to "Waiting when WAL data is not
available
  from any kind of sources (local, archive or stream) before trying
  again
  to retrieve WAL data".

Thought?


I agree that the corrected description sound correct in meaning. The
latter seems a bit lengthy, though.


Yeah, so better idea?

Anyway, attached is the patch (fix_recovery_wait_event_doc_v1.patch)
that fixes the descriptions of those wait events. This should be
back-patched to v9.5.


Also the current names of these wait events sound confusing. I think
that RecoveryWalAll should be changed to RecoveryWalStream.
RecoveryWalStream should be RecoveryRetrieveRetryInterval or
something.


I agree to the former, I think RecoveryWalInterval works well enough.


RecoveryWalInterval sounds confusing to me...

Attached is the patch (improve_recovery_wait_event_for_master_v1.patch) that
changes the names and types of wait events. This patch uses
RecoveryRetrieveRetryInterval, but if there is better name,
I will adopt that.

Note that this patch needs to be applied after
fix_recovery_wait_event_doc_v1.patch is applied.


Another problem is that the current wait event types of them also look
strange. Currently the type of them is Activity, but IMO it's better
to
use IPC for RecoveryWalAll because it's waiting for walreceiver to
receive new WAL. Also it's better to use Timeout for RecoveryWalStream
because it's waiting depending on wal_retrieve_retry_interval.


Do you mean condition variable by the "IPC"? But the WaitLatch waits
not only for new WAL but also for trigger, SIGHUP, shutdown and
walreceiver events other than new WAL. I'm not sure that condition
variable fits for the purpose.


OK, I didn't change the type of RecoveryWalStream to IPC, in the patch.
Its type is still Activity.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87586a7b06..c641361418 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1270,12 +1270,16 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
 
 
  RecoveryWalAll
- Waiting for WAL from any kind of source (local, archive or 
stream) at recovery.
-
-
- RecoveryWalStream
  Waiting for WAL from a stream at recovery.
 
+
+ RecoveryWalStream
+ 
+  Waiting when WAL data is not available from any kind of sources
+  (local, archive or stream) before trying again to retrieve WAL data,
+  at recovery.
+ 
+
 
  SysLoggerMain
  Waiting in main loop of syslogger process.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c641361418..8fbaad197b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1236,7 +1236,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
  Waiting to acquire a pin on a buffer.
 
 
- Activity
+ Activity
  ArchiverMain
  Waiting in main loop of the archiver process.
 
@@ -1268,17 +1268,9 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
  PgStatMain
  Waiting in main loop of the statistics collector 
process.
 
-
- RecoveryWalAll
- Waiting for WAL from a stream at recovery.
-
 
  RecoveryWalStream
- 
-  Waiting when WAL data is not available from any kind of sources
-  (local, archive or stream) before trying again to retrieve WAL data,
-  at recovery.
- 
+ Waiting for WAL from a stream at recovery.
 
 
  SysLoggerMain
@@ -1488,7 +1480,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 po

Re: small improvement of the elapsed time for truncating heap in vacuum





On 2020/02/17 17:52, Fujii Masao wrote:



On 2020/02/17 14:28, Kasahara Tatsuhito wrote:

Hi,

On Mon, Feb 17, 2020 at 1:07 PM Masahiko Sawada
 wrote:

For the patch, we can put also the declaration of ru0 into the loop.

Thanks for your reply.
Hmm, certainly that it may be better.

Fix the v2 patch and attached.


Thanks for updating the patch!
Barring any objection, I will commit this.

As far as I check the back branches, ISTM that
this patch needs to be back-patch to v9.5.


Pushed. Thanks!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Portal->commandTag as an enum

2020-02-19 Thread John Naylor

Hi Mark,

On Wed, Feb 19, 2020 at 10:40 AM Mark Dilger
 wrote:
> This would only make sense to me if the string held in $_ had already been 
> checked for safety, but Catalog.pm does very little to verify that the string 
> is safe to eval.  The assumption here, so far as I can infer, is that we 
> don’t embed anything dangerous in our .dat files, so this should be ok.  That 
> may be true for the moment, but I can imagine a day when we start embedding 
> perl functions as quoted text inside a data file, or shell commands as text 
> which look enough like perl for eval() to be able to execute them.  
> Developers who edit these .dat files and mess up the quoting, and then rerun 
> ‘make’ to get the new .c and .h files generated, may not like the side 
> effects.  Perhaps I’m being overly paranoid….

The use case for that seems slim. However, at a brief glance your
module seems more robust in other ways.

> Rather than add more code generation logic based on the design of Catalog.pm, 
> I wrote a perl based data file parser that parses .dat files and returns 
> vivified perl data, as Catalog.pm does, but with much stricter parsing logic 
> to make certain nothing dangerous gets eval()ed.  I put the new module in 
> DataFile.pm.
> [...]
> The new parser is more flexible about the structure of the data, which seems 
> good to me for making it easier to add or modify data files in the future.  
> The new parser does not yet have a means of hacking up the data to add 
> autogenerated fields and such that Catalog.pm does, but I think a more clean 
> break between parsing and autovivifying fields would be good anyway.

Separation of concerns sounds like a good idea, but I've not fully
thought it through. For one advantage, I think it might be nicer to
have indexing.dat and toasting.dat instead of having to dig the info
out of C macros. This was evident while recently experimenting with
generating catcache control data.

As for the patch, I have not done a full review, but I have some
comments based on a light read-through:

utils/Makefile:

+# location of commandtag.dat
+headerdir = $(top_srcdir)/src/include/utils

This variable name is too generic for what the comment says it is. A
better abstraction, if we want one, would be the full path to the
commandtag input file. The other script invocations in this Makefile
don't do it this way, but that's a separate patch.

+# location to write generated headers
+sourcedir = $(top_srcdir)/src/backend/utils

Calling the output the source is bound to confuse people. The comment
implies all generated headers, not just the ones introduced by the
patch. I would just output to the current directory (i.e. have an
--output option with a default empty string). Also, if we want to
output somewhere else, I would imagine it'd be under the top builddir,
not srcdir.

+$(PERL) -I $(top_srcdir)/src/include/utils $<
--headerdir=$(headerdir) --sourcedir=$(sourcedir)
--inputfile=$(headerdir)/commandtag.dat

1. headerdir is entirely unused by the script
2. We can default to working dir for the output as mentioned above
3. -I $(top_srcdir)/src/include/utils is supposed to point to the dir
containing DataFile.pm, but since gencommandtag.pl has "use lib..."
it's probably not needed here. I don't recall why we keep the "-I"
elsewhere. (ditto in Solution.pm)

I'm thinking it would look something like this:

+$(PERL) $<  --inputfile=$(top_srcdir)/src/include/utils/commandtag.dat

--
utils/misc/Makefile

+all: distprep
+
 # Note: guc-file.c is not deleted by 'make clean',
 # since we want to ship it in distribution tarballs.
 clean:
  @rm -f lex.yy.c
+
+maintainer-clean: clean

Seems non-functional.

--
DataFiles.pm

I haven't studied this in detail, but I'll suggest that if this meant
to have wider application, maybe it should live in src/tools ?

I'm not familiar with using different IO routines depending on the OS
-- what's the benefit of that?

--
gencommandtag.pl

slurp_without_comments() is unused.

sanity_check_data() seems longer than the main body of the script
(minus header boilerplate), and I wonder if we can pare it down some.
For one, I have difficulty imagining anyone would accidentally type an
unprintable or non-ascii character in a command tag and somehow not
realize it. For another, duplicating checks that were done earlier
seems like a maintenance headache.

dataerror() is defined near the top, but other functions are defined
at the bottom.

+# Generate all output internally before outputting anything, to avoid
+# partially overwriting generated files under error conditions

My personal preference is, having this as a design requirement
sacrifices readability for unclear gain, especially since a "chunk"
also includes things like header boilerplate. That said, the script is
also short enough that it doesn't make a huge difference either way.
Speaking of boilerplate, it's better for readability to separate that
from actual code such as:

typedef enum CommandTag
{
#define FIRST_COMM

Re: Updating row and width estimates in postgres_fdw

2020-02-19 Thread Etsuro Fujita

Hi Ashutosh,

Long time no see!

On Thu, Feb 13, 2020 at 1:17 AM Ashutosh Bapat
 wrote:
> In postgresGetForeignJoinPaths(), I see
>
>/* Estimate costs for bare join relation */
> estimate_path_cost_size(root, joinrel, NIL, NIL, NULL,
> &rows, &width, &startup_cost, &total_cost);
> /* Now update this information in the joinrel */
> joinrel->rows = rows;
> joinrel->reltarget->width = width;
>
> This code is good as well as bad.
>
> For a join relation, we estimate the number of rows in 
> set_joinrel_size_estimates() inside build_*_join_rel() and set the width of 
> the join when building the targetlist. For foreign join, the size estimates 
> may not be correct but width estimate should be. So updating the number of 
> rows looks good since it would be better than what 
> set_joinrel_size_etimates() might come up with but here are the problems with 
> this code
> 1. The rows estimated by estimate_path_cost_size() are better only when 
> use_remote_estimates is true. So, we should be doing this only when 
> use_remote_estimate is true.

I think it's actually harmless to do that even when
use_remote_estimate=false because in that case we get the rows
estimate from joinrel->rows in estimate_path_cost_size() and return to
the caller the estimate as-is, IIRC.

> 2. This function gets called after local paths for the first pair for this 
> join have been added. So those paths are not being judged fairly and perhaps 
> we might be throwing away better paths just because the local estimates with 
> which they were created were very different from the remote estimates.

Yeah, but I'm not sure we really need to fix that because I think the
remote-join path would usually win against any of local-join paths.
Could you show me an example causing an issue?

Best regards,
Etsuro Fujita

Re: Parallel copy

2020-02-19 Thread Tomas Vondra

On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote:

On Wed, 19 Feb 2020 at 06:22, Amit Kapila  wrote:

On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma  wrote:
>
> On Tue, 18 Feb 2020 at 15:21, Amit Kapila  wrote:
> >
> > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma  wrote:
> > >
> > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila  wrote:
> > > > This is something similar to what I had also in mind for this idea.  I
> > > > had thought of handing over complete chunk (64K or whatever we
> > > > decide).  The one thing that slightly bothers me is that we will add
> > > > some additional overhead of copying to and from shared memory which
> > > > was earlier from local process memory.  And, the tokenization (finding
> > > > line boundaries) would be serial.  I think that tokenization should be
> > > > a small part of the overall work we do during the copy operation, but
> > > > will do some measurements to ascertain the same.
> > >
> > > I don't think any extra copying is needed.
> > >
> >
> > I am talking about access to shared memory instead of the process
> > local memory.  I understand that an extra copy won't be required.
> >
> > > The reader can directly
> > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > >
> >
> > I am slightly confused here.  AFAIU, the for(;;) loop in
> > CopyReadLineText is about finding the line endings which we thought
> > that the reader process will do.
>
> Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> currently copies data from cstate->raw_buf to the StringInfo in
> cstate->line_buf. In parallel mode it would copy it from the shared data 
buffer
> to local line_buf until it hits the line end found by the data reader. The
> amount of copying done is still exactly the same as it is now.
>

Yeah, on a broader level it will be something like that, but actual
details might vary during implementation.  BTW, have you given any
thoughts on one other approach I have shared above [1]?  We might not
go with that idea, but it is better to discuss different ideas and
evaluate their pros and cons.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

I generally agree with the impression that parsing CSV is tricky and
unlikely to benefit from parallelism in general. There may be cases with
restrictions making it easier (e.g. restrictions on the format) but that
might be a bit too complex to start with.

For example, I had an idea to parallelise the planning by splitting it
into two phases:

1) indexing

Splits the CSV file into equally-sized chunks, make each worker to just
scan through it's chunk and store positions of delimiters, quotes,
newlines etc. This is probably the most expensive part of the parsing
(essentially go char by char), and we'd speed it up linearly.

2) merge

Combine the information from (1) in a single process, and actually parse
the CSV data - we would not have to inspect each character, because we'd
know positions of interesting chars, so this should be fast. We might
have to recheck some stuff (e.g. escaping) but it should still be much
faster.

But yes, this may be a bit complex and I'm not sure it's worth it.

The one piece of information I'm missing here is at least a very rough
quantification of the individual steps of CSV processing - for example
if parsing takes only 10% of the time, it's pretty pointless to start by
parallelising this part and we should focus on the rest. If it's 50% it
might be a different story. Has anyone done any measurements?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy

2020-02-19 Thread Ants Aasma

On Wed, 19 Feb 2020 at 06:22, Amit Kapila  wrote:
>
> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma  wrote:
> >
> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila  wrote:
> > >
> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma  wrote:
> > > >
> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila  
> > > > wrote:
> > > > > This is something similar to what I had also in mind for this idea.  I
> > > > > had thought of handing over complete chunk (64K or whatever we
> > > > > decide).  The one thing that slightly bothers me is that we will add
> > > > > some additional overhead of copying to and from shared memory which
> > > > > was earlier from local process memory.  And, the tokenization (finding
> > > > > line boundaries) would be serial.  I think that tokenization should be
> > > > > a small part of the overall work we do during the copy operation, but
> > > > > will do some measurements to ascertain the same.
> > > >
> > > > I don't think any extra copying is needed.
> > > >
> > >
> > > I am talking about access to shared memory instead of the process
> > > local memory.  I understand that an extra copy won't be required.
> > >
> > > > The reader can directly
> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > > CopyReadLineText() inner loop directly off of the buffer in shared 
> > > > memory.
> > > >
> > >
> > > I am slightly confused here.  AFAIU, the for(;;) loop in
> > > CopyReadLineText is about finding the line endings which we thought
> > > that the reader process will do.
> >
> > Indeed, I somehow misread the code while scanning over it. So 
> > CopyReadLineText
> > currently copies data from cstate->raw_buf to the StringInfo in
> > cstate->line_buf. In parallel mode it would copy it from the shared data 
> > buffer
> > to local line_buf until it hits the line end found by the data reader. The
> > amount of copying done is still exactly the same as it is now.
> >
>
> Yeah, on a broader level it will be something like that, but actual
> details might vary during implementation.  BTW, have you given any
> thoughts on one other approach I have shared above [1]?  We might not
> go with that idea, but it is better to discuss different ideas and
> evaluate their pros and cons.
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

Regards,
Ants Aasma

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-19 Thread Kyotaro Horiguchi

Sorry, just one fix. (omitting some typos, though..)

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> > I think attached v35nm is ready for commit to master.  Would anyone like to
> > talk me out of back-patching this?  I would not enjoy back-patching it, but
> > it's hard to justify lack of back-patch for a data-loss bug.
> > 
> > Notable changes since v34:
> > 
> > - Separate a few freestanding fixes into their own patches.
> 
> All of the three patches look fine.
> 
> > On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > >   /* Record largest maybe-unsynced block of files under tracking  */
> > >   pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> > > HASH_FIND, NULL);
> > > - if (pending)
> > > - {
> > > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > > -
> > > - if (pending->max_truncated < nblocks)
> > > - pending->max_truncated = nblocks;
> > > - }
> > > + pending->is_truncated = true;
> > 
> > - Fix this crashing when "pending" is NULL, as it is in this test case:
> > 
> >   begin;
> >   create temp table t ();
> >   create table t2 ();  -- cause pendingSyncHash to exist
> >   truncate t;
> >   rollback;
> 
> That's terrible... Thanks for fixint it.
> 
> > - Fix the "deleted while still in use" problem that Thomas Munro reported, 
> > by
> >   removing the heap_create() change.  Restoring the saved rd_createSubid had
> >   made obsolete the heap_create() change.  check-world now passes with
> >   wal_level=minimal and CLOBBER_CACHE_ALWAYS.
> 
> Ok, as in the previous mail.
> 
> > - Set rd_droppedSubid in RelationForgetRelation(), not
> >   RelationClearRelation().  RelationForgetRelation() knows it is processing 
> > a
> >   drop, but RelationClearRelation() could only infer that from 
> > circumstantial
> >   evidence.  This seems more future-proof to me.
> 
> Agreed. Different from RelationClearRelatoin, RelationForgetRelation
> is called only for "drop"ing the relation.
> 
> > - When reusing an index build, instead of storing the dropped relid in the
> >   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), 
> > store
> >   the subid fields in the IndexStmt.  This is less code, and I felt
> >   RelationIdGetRelationCache() invited misuse.
> 
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause in AtExecAddIndex handles
storage persistence so I did that there. But I don't strongly against
it.

> Please give a bit more time to look it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

2020-02-19 Thread Kyotaro Horiguchi

At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch  wrote in 
> I think attached v35nm is ready for commit to master.  Would anyone like to
> talk me out of back-patching this?  I would not enjoy back-patching it, but
> it's hard to justify lack of back-patch for a data-loss bug.
> 
> Notable changes since v34:
> 
> - Separate a few freestanding fixes into their own patches.

All of the three patches look fine.

> On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > /* Record largest maybe-unsynced block of files under tracking  */
> > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> >   HASH_FIND, NULL);
> > -   if (pending)
> > -   {
> > -   BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > -
> > -   if (pending->max_truncated < nblocks)
> > -   pending->max_truncated = nblocks;
> > -   }
> > +   pending->is_truncated = true;
> 
> - Fix this crashing when "pending" is NULL, as it is in this test case:
> 
>   begin;
>   create temp table t ();
>   create table t2 ();  -- cause pendingSyncHash to exist
>   truncate t;
>   rollback;

That's terrible... Thanks for fixint it.

> - Fix the "deleted while still in use" problem that Thomas Munro reported, by
>   removing the heap_create() change.  Restoring the saved rd_createSubid had
>   made obsolete the heap_create() change.  check-world now passes with
>   wal_level=minimal and CLOBBER_CACHE_ALWAYS.

Ok, as in the previous mail.

> - Set rd_droppedSubid in RelationForgetRelation(), not
>   RelationClearRelation().  RelationForgetRelation() knows it is processing a
>   drop, but RelationClearRelation() could only infer that from circumstantial
>   evidence.  This seems more future-proof to me.

Agreed. Different from RelationClearRelatoin, RelationForgetRelation
is called only for "drop"ing the relation.

> - When reusing an index build, instead of storing the dropped relid in the
>   IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
>   the subid fields in the IndexStmt.  This is less code, and I felt
>   RelationIdGetRelationCache() invited misuse.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause handles storage persistence
so I did that there. But I don't strongly against it.

Please give a bit more time to look it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Make pg_checksums skip foreign tablespace directories