Is ParsePrepareRecord dead function

2019-07-29 Thread vignesh C
Hi,

I could not locate the caller of ParsePrepareRecord function in twophase.c.
Any idea how it gets called?
or
Is it a dead function?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


[no subject]

2019-07-29 Thread Kyotaro Horiguchi
Hello.

On 2019/07/29 4:17, Dmitry Dolgov wrote:>> On Thu, Jul 25, 2019 at 1:21 PM 
Kyotaro Horiguchi  wrote:
> Yeah, will change both (hopefully soon)

Thanks. 

>> +  /*
>> +   * XXX: In case of index scan quals evaluation happens after
>> +   * ExecScanFetch, which means skip results could be fitered out
>> +   */
>>
>> Why can't we use skipscan path if having filter condition?  If
>> something bad happens, the reason must be written here instead of
>> what we do.
> 
> Sorry, looks like I've failed to express this more clear in the
> commentary. The point is that when an index scan (not for index
> only scan) has some conditions, their evaluation happens after
> skipping, and I don't see any not too much invasive way to
> apply skip correctly.

Yeah, your explanation was perfect for me. What I failed to
understand was what is expected to be done in the case. I
reconsidered and understood that:

For example, the following query:

select distinct (a, b) a, b, c from t where  c < 100;

skip scan returns one tuple for one distinct set of (a, b) with
arbitrary one of c, If the choosed c doesn't match the qual and
there is any c that matches the qual, we miss that tuple.

If this is correct, an explanation like the above might help.


>> + * If advancing direction is different from index direction, we must
>> + * skip right away, but _bt_skip requires a starting point.
>>
>> It doesn't seem needed to me. Could you elaborate on the reason
>> for that?
> 
> This is needed for e.g. scan with a cursor backward without an index 
> condition.
> E.g. if we have:
> 
>  1 1 2 2 3 3
>  1 2 3 4 5 6
> 
> and do
> 
>  DECLARE c SCROLL CURSOR FOR
>  SELECT DISTINCT ON (a) a,b FROM ab ORDER BY a, b;
> 
>  FETCH ALL FROM c;
> 
> we should get
> 
>  1 2 3
>  1 3 5
> 
> When afterwards we do
> 
>  FETCH BACKWARD ALL FROM c;
> 
> we should get
> 
>  3 2 1
>  5 2 1
> 
> 
> If we will use _bt_next first time without _bt_skip, the first pair would be
> 3 6 (the first from the end of the tuples, not from the end of the cursor).

Thanks for the explanation. Sorry, I somehow thought that that is
right. You're right.

>> + * If advancing direction is different from index direction, we must
>> + * skip right away, but _bt_skip requires a starting point.
>> + */
>> +if (direction * indexonlyscan->indexorderdir < 0 &&
>> +  !node->ioss_FirstTupleEmitted)
>>
>> I'm confused by this. "direction" there is the physical scan
>> direction (fwd/bwd) of index scan, which is already compensated
>> by indexorderdir. Thus the condition means we do that when
>> logical ordering (ASC/DESC) is DESC. (Though I'm not sure what
>> "index direction" exactly means...)
> 
> I'm not sure I follow, what do you mean by compensated? In general you're

I meant that the "direction" is already changed to physical order
at the point.

> right, as David Rowley mentioned above, indexorderdir is a general scan
> direction, and direction is flipped estate->es_direction, which is a cursor
> direction. The goal of this condition is catch when those two are different,
> and we need to advance and read in different directions.

Mmm. Sorry and thank you for the explanation. I was
stupid. You're right. I perhaps mistook indexorderdir's
meaning. Maybe something like the following will work *for me*:p

| When we are fetching a cursor in backward direction, return the
| tuples that forward fetching should have returned. In other
| words, we return the last scanned tuple in a DISTINCT set. Skip
| to that tuple before returning the first tuple.

# Of course, I need someone to correct this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Masahiko Sawada
On Sat, Jul 27, 2019 at 12:27 AM Masahiko Sawada  wrote:
>
> On Fri, Jul 26, 2019 at 10:57 AM Jonathan S. Katz  
> wrote:
> >
> > Hi,
> >
> > Before my reply, I wanted to say that I've been lurking on this thread
> > for a bit as I've tried to better inform myself on encryption at rest
> > and how it will apply to what we want to build. I actually built a
> > (poor) prototype in Python of the key management system that Joe &
> > Masahiko both laid out, in addition to performing some "buffer
> > encrpytion" with it. It's not worth sharing at this point.
> >
> > With the disclaimer that I'm not as familiar with a lot of concepts as I
> > would like to be:
> >
> > On 7/25/19 1:54 PM, Masahiko Sawada wrote:
> > > On Fri, Jul 26, 2019 at 2:18 AM Bruce Momjian  wrote:
> > >>
> > >> On Thu, Jul 18, 2019 at 12:04:25PM +0900, Masahiko Sawada wrote:
> > >>> I've re-considered the design of TDE feature based on the discussion
> > >>> so far. The one of the main open question is the granular of
> > >>> encryption objects: cluster encryption or more-granular-than-cluster
> > >>> encryption. The followings describe about the new TDE design when we
> > >>> choose table-level encryption or something-new-group-level encryption.
> > >>>
> > >>> General
> > >>> 
> > >>> We will use AES and support both AES-128 and AES-256. User can specify
> > >>> the new initdb option something like --aes-128 or --aes-256 to enable
> > >>> encryption and must specify --encryption-key-passphrase-command along
> > >>> with. (I guess we also require openssl library.) If these options are
> > >>> specified, we write the key length to the control file and derive the
> > >>> KEK and generate MDEK during initdb. wal_log_hints will be enabled
> > >>> automatically in encryption mode, like we do for checksum mode,
> > >>
> > >> Agreed.  pg_control will store the none/AES128/AES256 indicator.
> > >>
> > >>> Key Management
> > >>> ==
> > >>> We will use 3-tier key architecture as Joe proposed.
> > >>>
> > >>>   1. A master key encryption key (KEK): this is the ley supplied by the
> > >>>  database admin using something akin to ssl_passphrase_command
> > >>>
> > >>>   2. A master data encryption key (MDEK): this is a generated key using 
> > >>> a
> > >>>  cryptographically secure pseudo-random number generator. It is
> > >>>  encrypted using the KEK, probably with Key Wrap (KW):
> > >>>  or maybe better Key Wrap with Padding (KWP):
> > >>>
> > >>>   3a. Per table data encryption keys (TDEK): use MDEK and HKDF to 
> > >>> generate
> > >>>   table specific keys.
> > >>
> > >> What is the value of a per-table encryption key?  How is HKDF derived?
> > >
> > > Per-table encryption key is derived from MDEK with salt and its OID as
> > > info. I think we can store salts for each encryption keys into the
> > > separate file so that off-line tool also can read it.
> >
> > +1 with using the info/salt for the HKDF as described above. The other
> > decision will be the hashing algorithm to use. SHA-256?
>
> Yeah, SHA-256 would be better for safety.

After more thoughts, I'm confused why we need to have MDEK. We can use
KEK derived from passphrase and TDEK and WDEK that are derived from
KEK. That way, we don't need store any key in database file. What is
the advantage of 3-tier key architecture?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Andres Freund
Hi

On 2019-06-26 01:29:57 +0530, Amit Kapila wrote:
> From 67845a7afa675e973bd0ea9481072effa1eb219d Mon Sep 17 00:00:00 2001
> From: Dilip Kumar 
> Date: Wed, 24 Apr 2019 14:36:28 +0530
> Subject: [PATCH 05/14] Add prefetch support for the undo log
> 
> Add prefetching function for undo smgr and also provide mechanism
> to prefetch without relcache.


> +#ifdef USE_PREFETCH
>  /*
> - * PrefetchBuffer -- initiate asynchronous read of a block of a relation
> + * PrefetchBufferGuts -- Guts of prefetching a buffer.

>   * No-op if prefetching isn't compiled in.

This isn't true for the this function, as you've defined it?

>  
> diff --git a/src/backend/storage/smgr/undofile.c 
> b/src/backend/storage/smgr/undofile.c
> index 2aa4952..14ccc52 100644
> --- a/src/backend/storage/smgr/undofile.c
> +++ b/src/backend/storage/smgr/undofile.c
> @@ -117,7 +117,18 @@ undofile_extend(SMgrRelation reln, ForkNumber forknum,
>  void
>  undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber 
> blocknum)
>  {
> - elog(ERROR, "undofile_prefetch is not supported");
> +#ifdef USE_PREFETCH
> + Filefile;
> + off_t   seekpos;
> +
> + Assert(forknum == MAIN_FORKNUM);
> + file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
> + seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
> +
> + Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
> +
> + (void) FilePrefetch(file, seekpos, BLCKSZ, 
> WAIT_EVENT_UNDO_FILE_PREFETCH);
> +#endif   /* USE_PREFETCH 
> */
>  }

This looks like it should be part of the commit that introduces
undofile_prefetch(), rather than separately? Afaics there's no reason to
have it in this commit.


> From 7206c40e4cee3391c537cdb22c854889bb417d0e Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Wed, 6 Mar 2019 16:46:04 +1300
> Subject: [PATCH 03/14] Add undo log manager.

> +/*
> + * If the caller doesn't know the the block_id, but does know the 
> RelFileNode,
> + * forknum and block number, then we try to find it.
> + */
> +XLogRedoAction
> +XLogReadBufferForRedoBlock(XLogReaderState *record,
> +SmgrId smgrid,
> +RelFileNode rnode,
> +ForkNumber forknum,
> +BlockNumber blockno,
> +ReadBufferMode mode,
> +bool get_cleanup_lock,
> +Buffer *buf)

I find that a somewhat odd function comment. Nor does the function name
tell me much. A buffer is always block sized.  And you pass in a block
number.


> @@ -347,7 +409,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
>* Make sure that if the block is marked with WILL_INIT, the caller is
>* going to initialize it. And vice versa.
>*/
> - zeromode = (mode == RBM_ZERO_AND_LOCK || mode == 
> RBM_ZERO_AND_CLEANUP_LOCK);
> + zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK ||
> + mode == RBM_ZERO_AND_CLEANUP_LOCK);
>   willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
>   if (willinit && !zeromode)
>   elog(PANIC, "block with WILL_INIT flag in WAL record must be 
> zeroed by redo routine");
> @@ -463,7 +526,7 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, 
> ForkNumber forknum,
>   {
>   /* page exists in file */
>   buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, 
> blkno,
> - 
>mode, NULL);
> + 
>mode, NULL, RELPERSISTENCE_PERMANENT);
>   }
>   else
>   {
> @@ -488,7 +551,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, 
> ForkNumber forknum,
>   ReleaseBuffer(buffer);
>   }
>   buffer = ReadBufferWithoutRelcache(smgrid, rnode, 
> forknum,
> - 
>P_NEW, mode, NULL);
> + 
>P_NEW, mode, NULL,
> + 
>RELPERSISTENCE_PERMANENT);
>   }
>   while (BufferGetBlockNumber(buffer) < blkno);
>   /* Handle the corner case that P_NEW returns non-consecutive 
> pages */
> @@ -498,7 +562,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, 
> ForkNumber forknum,
>   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>   ReleaseBuffer(buffer);
>

Re: ANALYZE: ERROR: tuple already updated by self

2019-07-29 Thread Dean Rasheed
On Sun, 28 Jul 2019 at 11:15, Tomas Vondra  wrote:
>
> Attached is a patch fixing the error by not building extended stats for
> the inh=true case (as already proposed in this thread). That's about the
> simplest way to resolve this issue for v12. It should add a simple
> regression test too, I guess.
>

Seems like a reasonable thing to do for 10, 11 and possibly also 12
(actually, as you noted, I think it's the only thing that can be done
for 10 and 11).

> To fix this properly we need to add a flag similar to stainherit
> somewhere. And I've started working on that, thinking that maybe we
> could do that even for v12 - it'd be yet another catversion bump, but
> there's already been one since beta2 so maybe it would be OK.
>

Yeah, I think that makes sense, if it's not too hard. Since v12 is
where the stats definition is split out from the stats data, this
might work out quite neatly, since the inh flag would apply only to
the stats data.

> But it's actually a bit more complicated than just adding a column to
> the catalog, for two reasons:
>
> 1) The optimizer part has to be tweaked to pick the right object, with
> the flag set to either true/false. Not trivial, but doable.
>

Isn't it just a matter of passing the inh flag to
get_relation_statistics() from get_relation_info(), so then the
optimiser would get the right kind of stats data, depending on whether
or not inheritance was requested in the query.

Regards,
Dean




Re: using explicit_bzero

2019-07-29 Thread Peter Eisentraut
Another patch, with various fallback implementations.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c040e7aa31a7a12f804c8d32d403861390dec8d5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 29 Jul 2019 11:25:49 +0200
Subject: [PATCH v4] Use explicit_bzero

Use the explicit_bzero() function in places where it is important that
security information such as passwords is cleared from memory.  There
might be other places where it could be useful; this is just an
initial collection.

For platforms that don't have explicit_bzero(), provide various
fallback implementations.  (explicit_bzero() itself isn't standard,
but as Linux/glibc and OpenBSD have it, it's the most common spelling,
so it makes sense to make that the invocation point.)

Discussion: 
https://www.postgresql.org/message-id/flat/42d26bde-5d5b-c90d-87ae-6cab875f73be%402ndquadrant.com
---
 configure| 15 +++-
 configure.in |  2 +
 src/backend/libpq/be-secure-common.c |  3 ++
 src/include/pg_config.h.in   |  6 +++
 src/include/port.h   |  4 ++
 src/interfaces/libpq/fe-connect.c|  8 
 src/port/explicit_bzero.c| 55 
 7 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 src/port/explicit_bzero.c

diff --git a/configure b/configure
index 7a6bfc2339..d3b1108218 100755
--- a/configure
+++ b/configure
@@ -15143,7 +15143,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memset_s memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
@@ -15808,6 +15808,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "explicit_bzero" "ac_cv_func_explicit_bzero"
+if test "x$ac_cv_func_explicit_bzero" = xyes; then :
+  $as_echo "#define HAVE_EXPLICIT_BZERO 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" explicit_bzero.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS explicit_bzero.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "fls" "ac_cv_func_fls"
 if test "x$ac_cv_func_fls" = xyes; then :
   $as_echo "#define HAVE_FLS 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index dde3eec89f..c639a32a79 100644
--- a/configure.in
+++ b/configure.in
@@ -1596,6 +1596,7 @@ AC_CHECK_FUNCS(m4_normalize([
getpeerucred
getrlimit
mbstowcs_l
+   memset_s
memmove
poll
posix_fallocate
@@ -1694,6 +1695,7 @@ fi
 AC_REPLACE_FUNCS(m4_normalize([
crypt
dlopen
+   explicit_bzero
fls
getopt
getrusage
diff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
index 4abbef5bf1..880b758e70 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -86,6 +86,7 @@ run_ssl_passphrase_command(const char *prompt, bool 
is_server_start, char *buf,
{
if (ferror(fh))
{
+   explicit_bzero(buf, size);
ereport(loglevel,
(errcode_for_file_access(),
 errmsg("could not read from command 
\"%s\": %m",
@@ -97,6 +98,7 @@ run_ssl_passphrase_command(const char *prompt, bool 
is_server_start, char *buf,
pclose_rc = ClosePipeStream(fh);
if (pclose_rc == -1)
{
+   explicit_bzero(buf, size);
ereport(loglevel,
(errcode_for_file_access(),
 errmsg("could not close pipe to external 
command: %m")));
@@ -104,6 +106,7 @@ run_ssl_passphrase_command(const char *prompt, bool 
is_server_start, char *buf,
}
else if (pclose_rc != 0)
{
+   explicit_bzero(buf, size);
ereport(loglevel,
(errcode_for_file_access(),
 errmsg("command \"%s\" failed",
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32..8282d11dad 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -201,6 +201,9 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_EDITLINE_READLINE_H
 
+/* Define

Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2019-07-29 Thread Dilip Kumar
On Wed, Jul 17, 2019 at 2:39 PM Amit Langote  wrote:
>
> On Wed, Jul 10, 2019 at 2:43 PM Dilip Kumar  wrote:
> > On Wed, Jul 10, 2019 at 10:15 AM Amit Langote  
> > wrote:
> > > Thanks for checking.  There has been a lot of churn in the inheritance
> > > planning code since my last email on this thread, so I'd like to
> > > reconsider.  I'm busy this week with some things, so I'll try posting
> > > something on next Tuesday.
> > >
> > Sounds good.
>
> I looked at this today and concluded that the problem and the patches
> discussed here are fairly isolated from inheritance planning changes
> committed to PG 12.
>
> I've combined the two patches into one.
Looks fine to me, moved to ready for committer.

  I tried to think up test
> cases to go with the code changes, but couldn't come up with one.

I am also not sure how to test whether we have access to the
statistics of the table.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: ANALYZE: ERROR: tuple already updated by self

2019-07-29 Thread Tomas Vondra

On Mon, Jul 29, 2019 at 10:15:36AM +0100, Dean Rasheed wrote:

On Sun, 28 Jul 2019 at 11:15, Tomas Vondra  wrote:


Attached is a patch fixing the error by not building extended stats for
the inh=true case (as already proposed in this thread). That's about the
simplest way to resolve this issue for v12. It should add a simple
regression test too, I guess.



Seems like a reasonable thing to do for 10, 11 and possibly also 12
(actually, as you noted, I think it's the only thing that can be done
for 10 and 11).



OK, will do.


To fix this properly we need to add a flag similar to stainherit
somewhere. And I've started working on that, thinking that maybe we
could do that even for v12 - it'd be yet another catversion bump, but
there's already been one since beta2 so maybe it would be OK.



Yeah, I think that makes sense, if it's not too hard. Since v12 is
where the stats definition is split out from the stats data, this
might work out quite neatly, since the inh flag would apply only to
the stats data.



Agreed, we need to add the inh flag to the pg_statistic_ext_data
catalog. The trouble is this makes the maintenance somewhat more
complicated, because we suddenly don't have 1:1 mapping :-(

But if we want to address this in master only, I think that's fine.


But it's actually a bit more complicated than just adding a column to
the catalog, for two reasons:

1) The optimizer part has to be tweaked to pick the right object, with
the flag set to either true/false. Not trivial, but doable.



Isn't it just a matter of passing the inh flag to
get_relation_statistics() from get_relation_info(), so then the
optimiser would get the right kind of stats data, depending on whether
or not inheritance was requested in the query.



Yes, you're right. I've only skimmed how the existing code uses the inh
flag (for regular stats) and it seemed somewhat more complex, but you're
right for extended stats it'd be much simpler.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:39 AM Masahiko Sawada  wrote:
> After more thoughts, I'm confused why we need to have MDEK. We can use
> KEK derived from passphrase and TDEK and WDEK that are derived from
> KEK. That way, we don't need store any key in database file. What is
> the advantage of 3-tier key architecture?

The separate MDEK serves a couple purposes:

1. Allows for rotating the passphrase without actually changing any of
the downstream derived keys.
2. Verification that the passphrase itself is correct by checking if
it can unlock and authenticate (via a MAC) the MDEK.
3. Ensures it's generated from a strong random source (ex: /dev/urandom).

If the MDEK was directly derived via a deterministic function of the
passphrase, then that passphrase could never change as all your
derived keys would also change (and thus could not be decrypt their
existing data). The encrypted MDEK provides a level of indirection for
passphrase rotation.

An argument could be made to push that problem upstream, i.e. let the
supplier of the passphrase deal with the indirection. You would still
need to verify the supplied passphrase/key is correct via something
like authenticating against a stored MAC. If you're going to do that,
might as well directly support decrypting and managing your own MDEK.
That also let's you ensure it was properly generated via strong random
source.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: ANALYZE: ERROR: tuple already updated by self

2019-07-29 Thread Tomas Vondra

On Sun, Jul 28, 2019 at 09:53:20PM -0700, Andres Freund wrote:

Hi,

On 2019-07-28 21:21:51 +0200, Tomas Vondra wrote:

AFAICS it applies to 10+ versions, because that's where extended stats
were introduced. We certainly can't mess with catalogs there, so this is
about the only backpatchable fix I can think of.


AFAIU the inh version wouldn't be used anyway, and this has never
worked. So I think it's imo fine to fix it that way for < master. For
master we should have something better, even if perhaps not immediately.



Agreed. I'll get the simple fix committed soon, and put a TODO on my
list for pg13.


thanks

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Fetching timeline during recovery

2019-07-29 Thread Jehan-Guillaume de Rorthais
On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Fri, 26 Jul 2019 10:02:58 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
> > On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
> > Kyotaro Horiguchi  wrote:
> [...]
> > > We have an LSN reporting function each for several objectives.
> > > 
> > >  pg_current_wal_lsn
> > >  pg_current_wal_insert_lsn
> > >  pg_current_wal_flush_lsn
> > >  pg_last_wal_receive_lsn
> > >  pg_last_wal_replay_lsn  
> > 
> > Yes. In fact, my current implementation might be split as:
> > 
> >   pg_current_wal_tl: returns TL on a production cluster
> >   pg_last_wal_received_tl: returns last received TL on a standby
> > 
> > If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
> > *flush_tl would be useful as a cluster in production is not supposed to
> > change its timeline during its lifetime.
> > 
> > > But, I'm not sure just adding further pg_last_*_timeline() to
> > > this list is a good thing..  
> > 
> > I think this is a much better idea than mixing different case (production
> > and standby) in the same function as I did. Moreover, it's much more
> > coherent with other existing functions.
> 
> Please, find in attachment a new version of the patch. It now creates two new
> fonctions: 
> 
>   pg_current_wal_tl()
>   pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Also, TimeLineID is declared as a uint32. So why do we use
PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
See eg. in pg_stat_get_wal_receiver().

Regards,
>From 031d60de3e4239c83554c89c0c382c6390545434 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Thu, 25 Jul 2019 19:36:40 +0200
Subject: [PATCH] Add functions to get timeline

pg_current_wal_tl() returns the current timeline of a cluster in production.

pg_last_wal_received_tl() returns the timeline of the last xlog record
flushed to disk.
---
 src/backend/access/transam/xlog.c  | 17 +
 src/backend/access/transam/xlogfuncs.c | 20 
 src/backend/replication/walreceiver.c  | 19 +++
 src/include/access/xlog.h  |  1 +
 src/include/catalog/pg_proc.dat| 12 
 5 files changed, 69 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index da3d250986..fd30c88534 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12243,3 +12243,20 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Returns current active timeline.
+ */
+TimeLineID
+GetCurrentTimeLine(void)
+{
+	TimeLineID	localTimeLineID;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+
+	localTimeLineID = XLogCtl->ThisTimeLineID;
+
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return localTimeLineID;
+}
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..ae877be351 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -776,3 +776,23 @@ pg_promote(PG_FUNCTION_ARGS)
 			(errmsg("server did not promote within %d seconds", wait_seconds)));
 	PG_RETURN_BOOL(false);
 }
+
+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)
+{
+	TimeLineID currentTL;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("%s cannot be executed during recovery.",
+		 "pg_current_wal_tl()")));
+
+	currentTL = GetCurrentTimeLine();
+
+	PG_RETURN_INT32(currentTL);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6abc780778..9bffd822ff 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1454,3 +1454,22 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/*
+ * Returns the timeline of the last xlog record flushed to WAL
+ */
+Datum
+pg_last_wal_received_tl(PG_FUNCTION_ARGS)
+{
+	TimeLineID	lastReceivedTL;
+	WalRcvData *walrcv = WalRcv;
+
+	SpinLockAcquire(&walrcv->mutex);
+	lastReceivedTL = walrcv->receivedTLI;
+	SpinLockRelease(&walrcv->mutex);
+
+	if (!lastReceivedTL)
+		PG_RETURN_NULL();
+
+	PG_RETURN_INT32(lastReceivedTL);
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..f0502c0b41 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
+extern TimeLineID GetCurrentTimeLine(void);
 
 exte

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Masahiko Sawada
On Mon, Jul 29, 2019 at 7:17 PM Sehrope Sarkuni  wrote:
>
> On Mon, Jul 29, 2019 at 4:39 AM Masahiko Sawada  wrote:
> > After more thoughts, I'm confused why we need to have MDEK. We can use
> > KEK derived from passphrase and TDEK and WDEK that are derived from
> > KEK. That way, we don't need store any key in database file. What is
> > the advantage of 3-tier key architecture?
>
> The separate MDEK serves a couple purposes:
>
> 1. Allows for rotating the passphrase without actually changing any of
> the downstream derived keys.
> 2. Verification that the passphrase itself is correct by checking if
> it can unlock and authenticate (via a MAC) the MDEK.
> 3. Ensures it's generated from a strong random source (ex: /dev/urandom).
>
> If the MDEK was directly derived via a deterministic function of the
> passphrase, then that passphrase could never change as all your
> derived keys would also change (and thus could not be decrypt their
> existing data). The encrypted MDEK provides a level of indirection for
> passphrase rotation.

Understood. Thank you for explanation!

>
> An argument could be made to push that problem upstream, i.e. let the
> supplier of the passphrase deal with the indirection. You would still
> need to verify the supplied passphrase/key is correct via something
> like authenticating against a stored MAC.

So do we need the key for MAC of passphrase/key in order to verify?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada  wrote:
> > An argument could be made to push that problem upstream, i.e. let the
> > supplier of the passphrase deal with the indirection. You would still
> > need to verify the supplied passphrase/key is correct via something
> > like authenticating against a stored MAC.
>
> So do we need the key for MAC of passphrase/key in order to verify?

Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
can be "decrypted" with it in both CTR and CBC mode, you'll just end
up with garbage data if the key does not match. Verification of the
key prior to usage (i.e. starting DB and encrypting/decrypting data)
is a must as otherwise you'll end up with all kinds of corruption or
data loss.

>From a single user supplied passphrase you would derive the MDEK and
compute a MAC (either using the same key or via a separate derived
MDEK-MAC key). If the computed MAC matches against the previously
stored value then you know the MDEK is correct as well.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Masahiko Sawada
On Mon, Jul 29, 2019 at 11:33 AM Bruce Momjian  wrote:
>
> On Thu, Jul 25, 2019 at 01:03:06PM -0400, Bruce Momjian wrote:
> > On Tue, Jul 16, 2019 at 01:24:54PM +0900, Masahiko Sawada wrote:
> > > On Sat, Jul 13, 2019 at 12:33 AM Bruce Momjian  wrote:
> > > > then each row change gets its own LSN.  You are asking if an update that
> > > > just expires one row and adds it to a new page gets the same LSN.  I
> > > > don't know.
> > >
> > > The following scripts can reproduce that different two pages have the 
> > > same LSN.
> > >
> > > =# create table test (a int);
> > > CREATE TABLE
> > > =# insert into test select generate_series(1, 226);
> > > INSERT 0 226
> > > =# update test set a = a where a = 1;
> > > UPDATE 1
> > > =# select lsn from page_header(get_raw_page('test', 0));
> > > lsn
> > > ---
> > >  0/1690488
> > > (1 row)
> > >
> > > =# select lsn from page_header(get_raw_page('test', 1));
> > > lsn
> > > ---
> > >  0/1690488
> > > (1 row)
> > >
> > > So I think it's better to use LSN and page number to create IV. If we
> > > modify different tables by single WAL we also would need OID or
> > > relfilenode but I don't think currently we have such operations.
> >
> > OK, good to know, thanks.
>
> I did some more research on which cases use a single LSN to modify
> multiple 8k pages.  The normal program flow is:
>
> XLogBeginInsert();
> ...
> XLogRegisterBuffer(0, meta, ...
> --> recptr = XLogInsert(RM_BRIN_ID, XLOG_...
>
> page = BufferGetPage(meta);
> PageSetLSN(page, recptr);
>
> XLogInsert() calls BufferGetTag(), which fills in the buffer's
> RelFileNode (which internally is the tablespace, database, and
> pg_class.relfilenode).  So, to use the LSN and page-number for the IV,
> we need to make sure that there is no other encryption of those values
> in a different relation.  What I did was to find cases where
> XLogRegisterBuffer/PageSetLSN are called more than once for a single
> LSN.  I found cases in:
>
> brin_doupdate
> brin_doinsert
> brinRevmapDesummarizeRange
> revmap_physical_extend
> GenericXLogFinish
> ginPlaceToPage
> shiftList
> ginDeletePage
> gistXLogSplit
> gistXLogPageDelete
> gistXLogUpdate
> hashbucketcleanup
> _hash_doinsert
> _hash_vacuum_one_page
> _hash_addovflpage
> _hash_freeovflpage
> _hash_squeezebucket
> _hash_init
> _hash_expandtable
> _hash_splitbucket
> log_heap_visible
> log_heap_update
> _bt_insertonpg
> _bt_split
> _bt_newroot
> _bt_getroot
> _bt_mark_page_halfdead
> _bt_unlink_halfdead_page
> addLeafTuple
> moveLeafs
> doPickSplit
> spgSplitNodeAction
> log_newpage_range
>
> Most of these are either updating the different pages in the same
> relation (so the page-number for the IV would be different), or are
> modifying other types of files, like vm.  (We have not discussed if we
> are going to encrypt vm or fsm.  I am guessing we are not.)
>
> You might say, well, is it terrible if we reuse the LSN in a different
> relation with the same page number?  Yes.  The way CTR works, it
> generates a stream of bits using the key and IV (which will be LSN and
> page number).  It then XORs it with the page contents to encrypt it.  If
> we encrypt an all-zero gap in a page, or a place in the page where the
> page format is known, a user can XOR that with the encrypted data and
> get the bit stream at that point.  They can then go to another page that
> uses the same key and IV and XOR that to get the decrypted data.  CBC
> mode is slightly better because it mixes the user data into the future
> 16-byte blocks, but lots of our early-byte page format is known, so it
> isn't great.
>
> You might say, wow, that is a lot of places to make sure we don't reuse
> the LSN in a different relation with the same page number --- let's mix
> the relfilenode in the IV so we are sure the IV is not reused.
>
> Well, the pg_class.relfilenode is only unique within the
> tablespace/database, i.e., from src/include/storage/relfilenode.h:
>
>  * relNode identifies the specific relation.  relNode corresponds to
>  * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
>  * to assign new physical files to relations in some situations).
>  * Notice that relNode is only unique within a database in a particular
>  * tablespace.
>
> So, we would need to mix the tablespace, database oid, and relfilenode
> into the IV to be unique.  We would then need to have pg_upgrade
> preserve the relfilenode, change CREATE DATABASE to decrypt/encrypt when
> creating a new database, and no longer allow files to be moved between
> tablespaces without decryption/encryption.
>
> There are just a whole host of complexities we add to encryption if we
> add the requirement of preserv

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Masahiko Sawada
On Mon, Jul 29, 2019, 20:43 Masahiko Sawada  wrote:

> That way, we never reuse IV in a different relation with the same page
> number because relNode is unique within a database in a particular
> tablespace as you mentioned.
>

Sorry, I meant that we can ensure IV+key is unique.

--
Maaahiko Sawada


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Bruce Momjian
On Mon, Jul 29, 2019 at 08:43:06PM +0900, Masahiko Sawada wrote:
> > I am thinking of writing some Assert() code that checks that all buffers
> > using a single LSN are from the same relation (and therefore different
> > page numbers).  I would do it by creating a static array, clearing it on
> > XLogBeginInsert(), adding to it for each  XLogInsert(), then checking on
> > PageSetLSN() that everything in the array is from the same file.  Does
> > that make sense?
> 
> I had the same concern before. We could have BKPBLOCK_SAME_REL flag in
> XLogRecordBlockHeader, which indicates that the relation of the block
> is the same as the previous block and therefore we skip to write
> RelFileNode. At first glance I thought it's possible that one WAL
> record can contain different RelFileNodes but I didn't find any code
> attempting to do that.

Yes, the point is that the WAL record makes it possible, so we either
have to test for it or allow it.

> Checking that all buffers using a single LSN are from the same
> relation would be a good idea but I think it's hard to test it and
> regard the test result as okay. Even if we passed 'make checkworld',
> it might still be possible to happen. And even assertion failures

Yes, the problem is that if you embed the relfilenode or tablespace or
database in the encryption IV, you then need to then make sure you
re-encrypt any files that move between these.  I am hesitant to do that
since it then requires these workarounds for encryption going forward.
We know that most people will not be using encryption, so that will not
be well tested either.  For pg_upgrade, I used a minimal-impact
approach, and it has allowed dramatic changes in our code without
requiring changes and retesting of pg_upgrade.

> don't happen in production environment. So I guess it would be better
> to have IV so that we never reuse in different relation with the same
> page. An idea I came up with is that we make  IV from (PageLSN,
> PageNumber, relNode) and have the encryption keys per tablespace.
> That way, we never reuse IV in a different relation with the same page
> number because relNode is unique within a database in a particular
> tablespace as you mentioned.

Yes, this is what we are discussing.  Whether the relfilenode is part of
the IV, or we derive a key with a mix of the master encryption key and
relfilenode is mostly a matter of what fits into which bits.  With CTR,
I think we agreed it has to be LSN and page-number (and CTR counter),
and we only have 5 bits left.  If we wanted to add anything else, it
would be done via the creation of a derived key;  this was covered here:


https://www.postgresql.org/message-id/CAH7T-ap1Q9yHjGSO4ZJaVhU3L=u14tshmr++ccc_hk3eoqk...@mail.gmail.com

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: query1 followed by query2 at maximum distance vs current fixed distance

2019-07-29 Thread Arthur Zakirov

Hello,

On 23.07.2019 09:55, Wh isere wrote:

Is this possible with the current websearch_to_tsquery function?

Thanks.

Hello everyone, I am wondering if
AROUND(N) or  is still possible? I found this thread below and
the original post

https://www.postgresql.org/message-id/fe93ff7e9ad79196486ada79e268%40postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches
if the distance between words(or maybe phrases) is less than or
equal to N.'

currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by
query2 up
to a certain distance? like the AROUND(N) or  mentioned in the
thread?
As far as I know AROUND(N) and  weren't committed, unfortunately. 
And so you can search only using a fixed distance currently.


websearch_to_tsquery() can't help here. It just transforms search 
pattern with OR, AND statements into tsquery syntax.


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




Define jsonpath functions as stable

2019-07-29 Thread Alexander Korotkov
Hi!

During my work on bringing jsonpath patchset to commit, I was always
keeping in mind that we need to make jsonb_path_*() functions
immutable.  Having these functions immutable, users can build
expression indexes over them.  Naturally, in majority of cases one
doesn't need to index whole json documents, but only some parts of
them.  jsonpath provide great facilities to extract indexable parts of
document, much more powerful than our current operator set.

However, we've spotted some deviations between standard and our implementation.
 * like_regex predicate uses our regular expression engine, which
deviates from standard.
 * We always do numeric computations using numeric datatype.  Even if
user explicitly calls .double() method.  Probably, our current
implementation still fits standard.  But in future we may like to use
floating point computation in some cases for performance optimization.

These deviations don't look critical by itself.  But immutable
functions make problematic fixing them in future.  Also, I'm not sure
this is complete list of deviations we have.  We might have, for
example, hidden deviations in handling strict/lax modes, which are
hard to detect and understand.

Therefore, I'm going to mark jsonb_path_*() functions stable, not
immutable.  Nevertheless users will still have multiple options for
indexing:
1) jsonb_path_ops supports jsonpath matching operators in some cases.
2) One can wrap jsonb_path_*() in pl/* function and mark it as
immutable on his own risk.  This approach is widely used to build
indexes over to_date()/to_timestamp().
3) We're going to provide support of jsonpath operators in jsquery
extension before release of PostgreSQL 12.

I'd like to note I don't mean we wouldn't ever have immutable
functions for jsonpath evaluation.  I think once we sure enough that
we know immutable subset of jsonpath, we may define immutable
functions for its evaluation.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


switch_jsonpath_functions_to_stable.patch
Description: Binary data


Re: Define jsonpath functions as stable

2019-07-29 Thread Tom Lane
Alexander Korotkov  writes:
> During my work on bringing jsonpath patchset to commit, I was always
> keeping in mind that we need to make jsonb_path_*() functions
> immutable.  Having these functions immutable, users can build
> expression indexes over them.

Right.

> However, we've spotted some deviations between standard and our 
> implementation.
>  * like_regex predicate uses our regular expression engine, which
> deviates from standard.
>  * We always do numeric computations using numeric datatype.  Even if
> user explicitly calls .double() method.  Probably, our current
> implementation still fits standard.  But in future we may like to use
> floating point computation in some cases for performance optimization.
> ...
> Therefore, I'm going to mark jsonb_path_*() functions stable, not
> immutable.

I dunno, I think you are applying a far more rigorous definition of
"immutable" than we ever have in the past.  The possibility that we
might change the implementation in the future should not be enough
to disqualify a function from being immutable --- if that were the
criterion, nothing more complex than int4pl could be immutable.

Wouldn't it be better that, in the hypothetical major version where
we change the implementation, we tell users that they must reindex
any affected indexes?

As a comparison point, we allow people to build indexes on tsvector
results, which are *easy* to change just by adjusting configuration
files.  The fact that this might force the need for reindexing hasn't
made it unworkable.

regards, tom lane




Re: how to run encoding-dependent tests by default

2019-07-29 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-07-28 20:12, Tom Lane wrote:
>> So I wish we could get rid of the Makefile changes, have the test
>> scripts be completely responsible for whether to run themselves or
>> not, and put them into the schedule files normally.

> Good points.  Updated patch attach.

v3 looks good and passes local testing.  I've marked it RFC.

> (The two tests create the same schema name, so they cannot be run in
> parallel.  I opted against changing that here, since it would blow up
> the patch and increase the diff between the two tests.)

This does create one tiny nit, which is that the order of the
parallel and serial schedule files don't match.  Possibly I'm
overly anal-retentive about that, but I think it's confusing
when they don't.

regards, tom lane




Re: Define jsonpath functions as stable

2019-07-29 Thread Chapman Flack
Hi,

On 7/29/19 10:25 AM, Alexander Korotkov wrote:

>  * like_regex predicate uses our regular expression engine, which
> deviates from standard.

I still favor adding some element to the syntax (like a 'posix' or 'pg'
keyword in the grammar for like_regex) that identifies it as using
a  different regexp flavor, so the way forward to a possible compliant
version later is not needlessly blocked (or consigned to a
standard_conforming_strings-like experience).

That would also resolve much of the case against calling that
predicate immutable.

It looks as if, in my first implementation of XQuery regexps, there
will have to be a "not-quite-standard" flag for those too, because
it turns out the SQL committee made some tweaks to XQuery regexps[1],
whereas any XQuery library one relies on is going to provide untweaked
XQuery regexps out of the box. (The differences only affect ^ $ . \s \S)

Regards,
-Chap


[1]
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions




Re: TopoSort() fix

2019-07-29 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote:
>> I think this is a sufficiently obvious bug, and a sufficiently
>> obvious fix, that we should just fix it and not insist on getting
>> a reproducible test case first.  I think a test case would soon
>> bit-rot anyway, and no longer exercise the problem.
>> I certainly do *not* want to wait so long that we miss the
>> upcoming minor releases.

> +1.  Any volunteers?

If Robert doesn't weigh in pretty soon, I'll take responsibility for it.

regards, tom lane




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 4:10 AM vignesh C  wrote:
> I could not locate the caller of ParsePrepareRecord function in twophase.c.
> Any idea how it gets called?
> or
> Is it a dead function?

It looks like it's not only dead, but stillborn.  Commit
1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf introduced it but without
introducing any code that called it, and nothing has changed since
then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Robert Haas
On Sun, Jul 28, 2019 at 9:38 PM Thomas Munro  wrote:
> Interesting.  One way to bring back posix_fallocate() without
> upsetting people on some filesystem out there would be to turn the new
> wal_init_zero GUC into a choice: write (current default, and current
> behaviour for 'on'), pwrite_hole (write just the final byte, current
> behaviour for 'off'), posix_fallocate (like that 2013 patch that was
> reverted) and posix_fallocate_and_write (do both as you said, to try
> to solve that problem you mentioned that led to the revert).
>
> I suppose there'd be a parallel GUC undo_init_zero.  Or some more
> general GUC for any fixed-sized preallocated files like that (for
> example if someone were to decide to do the same for SLRU files
> instead of growing them block-by-block), called something like
> file_init_zero.

I think it's pretty sane to have a GUC for how we extend files, but to
me it seems like overkill to have one for every separate kind of file.
It's not theoretically impossible that you could have the data and WAL
on separate partitions on separate mount points with, consequently,
separate needs, and the data (including undo) could be split among
multiple tablespaces each of which uses a different filesystem.
Probably, the right design would be a per-tablespace storage option
plus an overall default that is always used for WAL. However, that
strikes me as a lot of complexity for a pretty marginal use case: most
people have a favorite filesystem and stick with it.

And all of that seems like something a bit separate from coming up
with a good undo framework.  Why doesn't undo just do this like we do
it elsewhere, and leave the question of changing the way we do
extend-and-zero for another thread?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade fails with non-standard ACL

2019-07-29 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Bruce Momjian  writes:
> > On Thu, Jul 18, 2019 at 06:53:12PM +0300, Anastasia Lubennikova wrote:
> >> pg_upgrade from 9.6 fails if old cluster had non-standard ACL
> >> on pg_catalog functions that have changed between versions,
> >> for example pg_stop_backup(boolean).
> 
> > Uh, wouldn't this affect any default-installed function where the
> > permission are modified?  Is fixing only a few functions really helpful?
> 
> No, it's just functions whose signatures have changed enough that
> a GRANT won't find them.  I think the idea is that the set of
> potentially-affected functions is determinate.  I have to say that
> the proposed patch seems like a complete kluge, though.  For one
> thing we'd have to maintain the list of affected functions in each
> future release, and I have no faith in our remembering to do that.

Well, we aren't likely to do that ourselves, no, but perhaps we could
manage it with some prodding by having the buildfarm check for such
cases, not unlike how library maintainers check the ABI between versions
of the library they manage.  Extension authors also deal with these
kinds of changes routinely when writing the upgrade scripts to go
between versions of their extension.  I'm not convinced that this is a
great approach to go down either, to be clear.  When going across major
versions, making people update their systems/code and re-test is
typically entirely reasonable to me.

> It's also fair to question whether pg_upgrade should even try to
> cope with such cases.  If the function has changed signature,
> it might well be that it's also changed behavior enough so that
> any previously-made grants need reconsideration.  (Maybe we should
> just suppress the old grant rather than transferring it.)

Suppressing the GRANT strikes me as pretty reasonable as an approach but
wouldn't that require us to similairly track what's changed between
major versions..?  Unless we arrange to ignore the GRANT failing, but
that seems like it would involve a fair bit of hacking around in pg_dump
to have some option to ignore certain GRANTs failing.  Did you have some
other idea about how to suppress the old GRANT?

A way to make things work for users while suppressing the GRANTS would
be to add a default role for things like file-level-backup, which would
be allowed to execute file-level-backup related functions, presumably
even if we make changes to exactly what those function signatures are,
and then encourage users to GRANT that role to the role that's allowed
to log in and run the file-level backup.  Obviously we wouldn't be doing
that in the back-branches, but we could moving forward.

> Still, this does seem like a gap in the pg_init_privs mechanism.
> I wonder if Stephen has any thoughts about what ought to happen.

So, in an interesting sort of way, we have a way to deal with this
problem when it comes to *extensions* and I suppose that's why we
haven't seen it there- namely the upgrade script, which can decide if it
wants to drop an object and recreate it, or if it wants to do a
create-or-replace, which would preserve the privileges (though the API
has to stay the same, so that isn't exactly the same) and avoid dropping
dependant objects.

Unfortunately, we don't have any good way today to add an optional
argument to a function while preserving the privileges on it, which
would make a case like this one (and others where you'd prefer to not
drop/recreate the function due to dependencies) work, for extensions.

Suppressing the GRANT also seems reasonable for the case of objects
which have been renamed- clearly whatever is using those functions is
going to have to be modified to deal with the new name of the function,
requiring that the GRANT be re-issued doesn't seem like it's that much
more to ask of users.  On the other hand, properly written tools that
check the version of PG and use the right function names could possibly
"just work" following a major version upgrade, if the privilege was
brought across to the new major version correctly.

We also don't want to mistakenly GRANT users more access then they
should have though- if pg_stop_backup() one day grows an
optional argument to run some server-side script, I don't think we'd
want to necessairly just give access to that ability to roles who,
today, can execute the current pg_stop_backup() function.  Of course, if
we added such a capability, hopefully we would do so in a way that less
privileged roles could continue to use the existing capability without
having access to run such a server-side script.

I also don't think that the current patch is actually sufficient to deal
with all the changes we've made between the versions- what about column
names on catalog tables/views that were removed, or changed/renamed..?

In an ideal world, it seems like we'd make a judgement call and arrange
to pull the privileges across when we can do so without granting the
user privileges beyond those that 

Re: benchmarking Flex practices

2019-07-29 Thread Tom Lane
John Naylor  writes:
> On Sun, Jul 21, 2019 at 3:14 AM Tom Lane  wrote:
>> So I'm feeling like maybe we should experiment to see what that
>> solution looks like, before we commit to going in this direction.
>> What do you think?

> Given the above wrinkles, I thought it was worth trying. Attached is a
> rough patch (don't mind the #include mess yet :-) ) that works like
> this:

> The lexer returns UCONST from xus and UIDENT from xui. The grammar has
> rules that are effectively:

> SCONST { do nothing}
> | UCONST { esc char is backslash }
> | UCONST UESCAPE SCONST { esc char is from $3 }

> ...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
> conflicts, I added UIDENT to the %nonassoc precedence list to match
> IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
> there's a more principled way. I also added an unsigned char type to
> the %union, but it worked fine on my compiler without it.

I think it might be better to drop the separate "Uescape" production and
just inline that into the calling rules, exactly per your sketch above.
You could avoid duplicating the escape-checking logic by moving that into
the str_udeescape support function.  This would avoid the need for the
"uchr" union variant, but more importantly it seems likely to be more
future-proof: IME, any time you can avoid or postpone shift/reduce
decisions, it's better to do so.

I didn't try, but I think this might allow dropping the %left for
UESCAPE.  That bothers me because I don't understand why it's
needed or what precedence level it ought to have.

> litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
> former had be massaged to give error messages similar to HEAD. They're
> not quite identical, but the position info is preserved. Some of the
> functions I moved around don't seem to have any test coverage, so I
> should eventually do some work in that regard.

I don't terribly like the cross-calls you have between gram.y and scan.l
in this formulation.  If we have to make these functions (hexval() etc)
non-static anyway, maybe we should shove them all into scansup.c?

> -Binary size is very close to v6. That is to say the grammar tables
> grew by about the same amount the scanner table shrank, so the binary
> is still about  200kB smaller than HEAD.

OK.

> -Performance is very close to v6 with the information_schema and
> pgbench-like queries with standard strings, which is to say also very
> close to HEAD. When the latter was changed to use Unicode escapes,
> however, it was about 15% slower than HEAD. That's a big regression
> and I haven't tried to pinpoint why.

I don't quite follow what you changed to produce the slower test case?
But that seems to be something we'd better run to ground before
deciding whether to go this way.

> -The ecpg changes here are only the bare minimum from HEAD to get it
> to compile, since I'm borrowing its additional token names (although
> they mean slightly different things). After a bit of experimentation,
> it's clear there's a bit more work needed to get it functional, and
> it's not easy to debug, so I'm putting that off until we decide
> whether this is the way forward.

On the whole I like this approach, modulo the performance question.
Let's try to work that out before worrying about ecpg.

regards, tom lane




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread vignesh C
On Mon, Jul 29, 2019 at 8:24 PM Robert Haas  wrote:
>
> On Mon, Jul 29, 2019 at 4:10 AM vignesh C  wrote:
> > I could not locate the caller of ParsePrepareRecord function in twophase.c.
> > Any idea how it gets called?
> > or
> > Is it a dead function?
>
> It looks like it's not only dead, but stillborn.  Commit
> 1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf introduced it but without
> introducing any code that called it, and nothing has changed since
> then.
>
I feel the code can be safely removed.
Patch for the same is attached.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


001_Unused_function_ParsePrepareRecord_removal.patch
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Bruce Momjian
On Sun, Jul 28, 2019 at 10:33:03PM -0400, Bruce Momjian wrote:
> I did some more research on which cases use a single LSN to modify
> multiple 8k pages.  The normal program flow is:
> 
> XLogBeginInsert();
>   ...
> --> XLogRegisterBuffer(0, meta, ...
> recptr = XLogInsert(RM_BRIN_ID, XLOG_...
> 
> page = BufferGetPage(meta);
> PageSetLSN(page, recptr);
> 
> XLogInsert() calls BufferGetTag(), which fills in the buffer's

Correction, XLogRegisterBuffer() calls BufferGetTag().  I have updated the
quote above.  That is the function I checked, not XLogInsert().

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Built-in connection pooler

2019-07-29 Thread Konstantin Knizhnik




On 26.07.2019 23:24, Tomas Vondra wrote:

Hi Konstantin,

I've started reviewing this patch and experimenting with it, so let me
share some initial thoughts.


1) not handling session state (yet)

I understand handling session state would mean additional complexity, so
I'm OK with not having it in v1. That being said, I think this is the
primary issue with connection pooling on PostgreSQL - configuring and
running a separate pool is not free, of course, but when people complain
to us it's when they can't actually use a connection pool because of
this limitation.

So what are your plans regarding this feature? I think you mentioned
you already have the code in another product. Do you plan to submit it
in the pg13 cycle, or what's the plan? I'm willing to put some effort
into reviewing and testing that.


I completely agree with you. My original motivation of implementation of 
built-in connection pooler
was to be able to preserve session semantic (prepared statements, GUCs, 
temporary tables) for pooled connections.
Almost all production system have to use some kind of pooling. But in 
case of using pgbouncer&Co we are loosing possibility
to use prepared statements which can cause up to two time performance 
penalty (in simple OLTP queries).

So I have implemented such version of connection pooler of PgPro EE.
It require many changes in Postgres core so I realized that there are no 
chances to commit in community
(taken in account that may other my patches like autoprepare and libpq 
compression are postponed for very long time, although

them are much smaller and less invasive).

Then Dimitri Fontaine proposed me to implement much simple version of 
pooler based on traditional proxy approach.

This patch is result of our conversation with Dimitri.
You are asking me about my plans... I think that it will be better to 
try first to polish this version of the patch and commit it and only 
after it add more sophisticated features

like saving/restoring session state.





FWIW it'd be nice to expose it as some sort of interface, so that other
connection pools can leverage it too. There are use cases that don't
work with a built-in connection pool (say, PAUSE/RESUME in pgbouncer
allows restarting the database) so projects like pgbouncer or odyssey
are unlikely to disappear anytime soon.


Obviously built-in connection pooler will never completely substitute 
external poolers like pgbouncer, which provide more flexibility, i.e. 
make it possible to install pooler at separate host or at client side.




I also wonder if we could make it more permissive even in v1, without
implementing dump/restore of session state.

Consider for example patterns like this:

 BEGIN;
 SET LOCAL enable_nestloop = off;
 ...
 COMMIT;

or

 PREPARE x(int) AS SELECT ...;
 EXECUTE x(1);
 EXECUTE x(2);
 ...
 EXECUTE x(10);
 DEALLOCATE x;

or perhaps even

 CREATE FUNCTION f() AS $$ ... $$
 LANGUAGE sql
 SET enable_nestloop = off;

In all those cases (and I'm sure there are other similar examples) the
connection pool considers the session 'tainted' it marks it as tainted
and we never reset that. So even when an application tries to play nice,
it can't use pooling.

Would it be possible to maybe track this with more detail (number of
prepared statements, ignore SET LOCAL, ...)? That should allow us to do
pooling even without full support for restoring session state.


Sorry, I do not completely understand your idea (how to implement this 
features without maintaining session state).
To implement prepared statements  we need to store them in session 
context or at least add some session specific prefix to prepare 
statement name.
Temporary tables also require per-session temporary table space. With 
GUCs situation is even more complicated - actually most of the time in 
my PgPro-EE pooler version
I have spent in the fight with GUCs (default values, reloading 
configuration, memory alllocation/deallocation,...).
But the "show stopper" are temporary tables: if them are accessed 
through internal (non-shared buffer), then you can not reschedule 
session to some other backend.
This is why I have now patch with implementation of global temporary 
tables (a-la Oracle) which has global metadata and are accessed though 
shared buffers (which also allows to use them

in parallel queries).




2) configuration

I think we need to rethink how the pool is configured. The options
available at the moment are more a consequence of the implementation and
are rather cumbersome to use in some cases.

For example, we have session_pool_size, which is (essentially) the
number of backends kept in the pool. Which seems fine at first, because
it seems like you might say

   max_connections = 100
   session_pool_size = 50

to say the connection pool will only ever use 50 connections, leaving
the rest for "direct" connection. But that does not work at all, because
the number of backends the pool can open is

   session_pool_size * connection_proxies * databas

Re: Built-in connection pooler

2019-07-29 Thread Konstantin Knizhnik



On 27.07.2019 14:49, Thomas Munro wrote:

On Tue, Jul 16, 2019 at 2:04 AM Konstantin Knizhnik
 wrote:

I have committed patch which emulates epoll EPOLLET flag and so should
avoid busy loop with poll().
I will be pleased if you can check it at FreeBSD  box.

I tried your v12 patch and it gets stuck in a busy loop during make
check.  You can see it on Linux with ./configure ...
CFLAGS="-DWAIT_USE_POLL".


--
Thomas Munro
https://enterprisedb.com


New version of the patch is attached which fixes poll() and Win32 
implementations of WaitEventSet.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a3..2758506 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,123 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.
+  The default value is false.
+   
+  
+ 
+
  
   unix_socket_directories (string)
   
diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml
new file mode 100644
index 000..a4b2720
--- /dev/null
+++ b/doc/src/sgml/connpool.sgml
@@ -0,0 +1,173 @@
+
+
+ 
+  Connection pooling
+
+  
+   built-in connection pool proxy
+  
+
+  
+PostgreSQL spawns a separate process (backend) for each client.
+For large number of clients this model can consume a large number of system
+resources and lead to significant performance degradation, especially on computers with large
+number of CPU cores. The reason is high

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Robert Haas
On Fri, Jul 19, 2019 at 7:28 PM Peter Geoghegan  wrote:
> If I'm not mistaken, you're tacitly assuming that you'll always be
> using zheap, or something sufficiently similar to zheap. It'll
> probably never be possible to UNDO changes to something like a GIN
> index on a zheap table, because you can never do that with sensible
> concurrency/deadlock behavior.

I mean, essentially any well-designed framework intended for any sort
of task whatsoever is going to have a design center where one can
foresee that it will work well, and then as a result of working well
for the thing for which it was designed, it will also work well for
other things that are sufficiently similar.  So, I think you're
correct, but I also don't think that's really saying very much. The
trick is to figure out whether and how the ideas you have could be
generalized with reasonable effort to handle other cases, and that's
easier with some projects than others.  I think when it comes to UNDO,
it's actually really hard. The system has some assumptions built into
it which are probably required for good performance and reasonable
complexity, and it's probably got other assumptions in it which are
unnecessary and could be eliminated if we only realized that we were
making those assumptions in the first place. The more involvement we
get from people who aren't coming at this from the point of view of
zheap, the more likely it is that we'll be able to find those
assumptions and wipe them out before they get set in concrete.
Unfortunately, we haven't had many takers so far -- thanks for chiming
in.

I don't really understand your comments about GIN.  My simplistic
understanding of GIN is that it's not very different from btree in
this regard.  Suppose we insert a row, and then the insert aborts;
suppose also that the index wants to use UNDO.  In the case of a btree
index, we're going to go insert an index entry for the new row; upon
abort, we should undo the index insertion by removing that index tuple
or at least marking it dead.  Unless a page split has happened,
triggered either by the insertion itself or by subsequent activity,
this puts the index in a state that is almost perfectly equivalent to
where we were before the now-aborted transaction did any work.  If a
page split has occurred, trying to undo the index insertion is going
to run into two problems. One, we probably can't undo the page split,
so the index will be logically equivalent but not physically
equivalent after we get rid of the new tuple. Two, if the page split
happened after the insertion of the new tuple rather than at the same
time, the index tuple may not be on the page where we left it.
Possibly we can walk right (or left, or sideways, or diagonally at a
35 degree angle, my index-fu is not great here) and be sure of finding
it, assuming the index is not corrupt.

Now, my mental model of a GIN index is that you go find N>=0 index
keys inside each value and do basically the same thing as you would
for a btree index for each one of them. Therefore it seems to me,
possibly stupidly, that you're going to have basically the same
problems, except each problem will now potentially happen up to N
times instead of up to 1 time.  I assume here that in either case -
GIN or btree - you would tentatively record where you left the tuple
that now needs to be zapped and that you can jump to that place
directly to try to zap it. Possibly those assumptions are bad and
maybe that's where you're seeing a concurrency/deadlock problem; if
so, a more detailed explanation would be very helpful.

To me, based on my more limited knowledge of indexing, I'm not really
seeing a concurrency/deadlock issue, but I do see that there's going
to be a horrid efficiency problem if page splits are common.  Suppose
for example that you bulk-load a bunch of rows into an indexed table
in descending order according to the indexed column, with all the new
values being larger than any existing values in that column.  The
insertion point basically doesn't change: you're always inserting
after what was the original high value in the column, and that point
is always on the same page, but that page is going to be repeatedly
split, so that, at the end of the load, almost none of the
newly-inserted rows are going to be on the page into which they were
originally inserted.  Now if you abort, you're going to either have to
walk right a long way from the original insertion point to find each
tuple, or re-find each tuple by traversing from the root of the tree
instead of remembering where you left it. Doing the first for N tuples
is O(N^2), and doing the second is O(N*H) where H is the height of the
btree.  The latter is almost like O(N) given the high fanout of a
btree, but with a much higher constant factor than the
remember-where-you-put-it strategy would be in cases where no splits
have occurred. Neither seems very good.  This seems to be a very
general problem with making undo and indexes work nicely together:
almost a

Re: SimpleLruTruncate() mutual exclusion

2019-07-29 Thread Tom Lane
Noah Misch  writes:
> On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
>>> b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other 
>>> than
>>> AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
>>> checkpoint, or in the startup process.  Hence, also arrange for only one
>>> backend to call SimpleLruTruncate(AsyncCtl) at a time.

>> More specifically, restrict vac_update_datfrozenxid() to one backend per
>> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
>> backend per cluster.  This, attached, was rather straightforward.

> Rebased.  The conflicts were limited to comments and documentation.

I tried to review this, along with your adjacent patch to adjust the
segment-roundoff logic, but I didn't get far.  I see the point that
the roundoff might create problems when we are close to hitting
wraparound.  I do not, however, see why serializing vac_truncate_clog
helps.  I'm pretty sure it was intentional that multiple backends
could run truncation directory scans concurrently, and I don't really
want to give that up if possible.

Also, if I understand the data-loss hazard properly, it's what you
said in the other thread: the latest_page_number could advance after
we make our decision about what to truncate, and then maybe we could
truncate newly-added data.  We surely don't want to lock out the
operations that can advance last_page_number, so how does serializing
vac_truncate_clog help?

I wonder whether what we need to do is add some state in shared
memory saying "it is only safe to create pages before here", and
make SimpleLruZeroPage or its callers check that.  The truncation
logic would advance that value, but only after completing a scan.
(Oh ..., hmm, maybe the point of serializing truncations is to
ensure that we know that we can safely advance that?) 

Can you post whatever script you used to detect/reproduce the problem
in the first place?

regards, tom lane

PS: also, re-reading this code, it's worrisome that we are not checking
for failure of the unlink calls.  I think the idea was that it didn't
really matter because if we did re-use an existing file we'd just
re-zero it; hence failing the truncate is an overreaction.  But probably
some comments about that are in order.




concerns around pg_lsn

2019-07-29 Thread Jeevan Ladhe
Hi all,

While reviewing some code around pg_lsn_in() I came across a couple of
(potential?) issues:

1.
Commit 21f428eb moves lsn conversion functionality from pg_lsn_in() to a new
function pg_lsn_in_internal(). It takes two parameters the lsn string and a
pointer to a boolean (*have_error) to indicate if there was an error while
converting string format to XLogRecPtr.

pg_lsn_in_internal() only sets the *have_error to 'true' if there is an
error,
but leaves it for the caller to make sure it was passed by initializing as
'false'. Currently it is only getting called from pg_lsn_in() and
timestamptz_in()
where it has been taken care that the flag is set to false before making the
call. But I think in general it opens the door for unpredictable bugs if
pg_lsn_in_internal() gets called from other locations in future (if need
maybe) and by mistake, it just checks the return value of the flag without
setting it to false before making a call.

I am attaching a patch that makes sure that *have_error is set to false in
pg_lsn_in_internal() itself, rather than being caller dependent.

Also, I think there might be callers who may not care if there had been an
error
while converting and just ok to use InvalidXLogRecPtr against return value,
and
may pass just a NULL boolean pointer to this function, but for now, I have
left
that untouched. Maybe just adding an Assert would improve the situation for
time being.

I have attached a patch (fix_have_error_flag.patch) to take care of above.

2.
I happened to peep in test case pg_lsn.sql, and I started exploring the
macros
around lsn.

Following macros:

{code}
/*
 * Zero is used indicate an invalid pointer. Bootstrap skips the first
possible
 * WAL segment, initializing the first WAL page at WAL segment size, so no
XLOG
 * record can begin at zero.

 */
#define InvalidXLogRecPtr   0
#define XLogRecPtrIsInvalid(r)  ((r) == InvalidXLogRecPtr)
{code}

IIUC, in the comment above we clearly want to mark 0 as an invalid lsn (also
further IIUC the comment states - lsn would start from (walSegSize + 1)).
Given
this, should not it be invalid to allow "0/0" as the value of type pg_lsn,
or
for that matter any number < walSegSize?

There is a test scenario in test case pg_lsn.sql which tests insertion of
"0/0"
in a table having a pg_lsn column. I think this is contradictory to the
comment.

I am not sure of thought behind this and might be wrong while making the
above
assumption. But, I tried to look around a bit in hackers emails and could
not
locate any related discussion.

I have attached a patch (mark_lsn_0_invalid.patch) that makes above changes.

Thoughts?

Regards,
Jeevan Ladhe


fix_have_error_flag.patch
Description: Binary data


mark_lsn_0_invalid.patch
Description: Binary data


Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 22, 2019 at 4:15 AM Thomas Munro  wrote:
> I had a similar thought: you might regret that choice if you were
> wanting to implement an AM with lock table-based concurrency control
> (meaning that there are lock ordering concerns for row and page locks,
> for DML statements, not just DDL).  That seemed a bit too far fetched
> to mention before, but are you saying the same sort of concerns might
> come up with indexes that support true undo (as opposed to indexes
> that still need VACUUM)?

Yes. It doesn't really make any difference with B-Trees, because the
locks there are very similar to row locks (you still need forwarding
UNDO metadata in index pages, probably for checking the visibility of
index tuples that have their ghost bit set). But when you need to undo
changes to an indexes with coarse grained index tuples (e.g. in a GIN
index), the transaction needs to roll back the index tuple as a whole,
necessitating that locks be held. Heap TIDs need to be completely
stable to avoid a VACUUM-like mechanism -- you cannot just create a
new HOT chain. You even have to be willing to store a single heap row
across two heap pages in extreme cases where an UPDATE makes it
impossible to fit a new row on the same heap page as the original --
this is called row forwarding.

Once heap TIDs are guaranteed to be associated with a logical row for
the lifetime of that row, and once you lock index entries, you're
always able to cleanly undo the changes in the index (i.e. remove new
tuples on abort). Then you have indexes that don't need VACUUMING, and
that have cheap index-only scans.

> For comparison, ARIES[1] has no-deadlock rollbacks as a basic property
> and reacquires locks during restart before new transactions are allow
> to execute.  In its model, the locks in question can be on things like
> rows and pages.  We don't even use our lock table for those (except
> for non-blocking SIREAD locks, irrelevant here).

Right. ARIES has plenty to say about concurrency control, even though
we often think of it as something that is only concerned with crash
recovery. The undo phase is tied to how concurrency control works in
general in ARIES. There is something called ARIES/KVL, and something
else called ARIES/IM [1].

> After crash
> recovery, if zheap encounters a row with pending rollback from an
> aborted transaction, as usual it either needs to read an older version
> from an undo log (for reads) or help execute the rollback before
> updating (for writes).  That only requires page-at-a-time LWLocks
> ("latching"), so it's deadlock-free.  The only deadlock risk comes
> from the need to acquire heavyweight locks on relations which
> typically only conflict when you run DDL, so yeah, it's tempting to
> worry a lot less about those than the fine grained lock traffic from
> DML statements that DB2 and others have to deal with.

I think that DB2 index deletes are synchronous, and immediately remove
space from a leaf page. Rollbacks will re-insert the deleted tuple.
Systems that use a limited form of MVCC based on 2PL [2] set a ghost
bit instead of physically removing the tuple immediately. But I don't
think that that's actually very different to the DB2 classic 2PL
approach, since there is forwarding undo information that makes it
possible to reclaim tuples with the ghost bit set at the earliest
possible opportunity. And because you can immediately do an in-place
update of an index tuple's heap TID in the case of unique indexes,
which can be optimized as a special case. Queries like "UPDATE tab set
tab_pk = tab_pk + 1" work per the SQL standard (no duplicate
violation), and don't even bloat the index, because the changes in the
index can happen almost entirely in-place.

> I might as well put the quote marks on now:  "Perhaps we could
> implement A later."

I don't claim to have any real answers here. I don't claim to
understand how much of a problem this is.

[1] https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- See
"6.7 Standard Practice"
-- 
Peter Geoghegan




Re: Proposal for Signal Detection Refactoring

2019-07-29 Thread Tom Lane
Chris Travers  writes:
> Here's a new patch.  No rush on it.  I am moving it to next commitfest
> anyway because as code documentation I think this is a low priority late in
> the release cycle.
> The changes  mostly address Andres's feedback above.

I took a look through this version.  (BTW, please add a version number
to the patchfile name in future, so people can keep track of which
one we're talking about.)

+src/backend/utils/misc/README.SIGNAL_HANDLING

Hmm ... I'm not very sure of a good place to put this, but utils/misc/
seems kinda random, because there's no code at all in there that's
particularly related to this topic.  One alternate suggestion is
utils/init/, which is at least related by virtue of the signal flags
being defined in globals.c.  Another idea is src/backend/tcop/, since
postgres.c is where the rubber meets the road for these issues,
at least in regular backends.

Also, "README.SIGNAL_HANDLING" seems both overly long and not very
consistent with the naming of other README files.  The ones that aren't
just "README" are

README.git
doc/src/sgml/README.links
src/backend/access/heap/README.HOT
src/backend/access/heap/README.tuplock
src/backend/access/transam/README.parallel
src/backend/libpq/README.SSL
src/backend/statistics/README.dependencies
src/backend/statistics/README.mcv
src/backend/storage/lmgr/README-SSI
src/backend/storage/lmgr/README.barrier
src/interfaces/ecpg/README.dynSQL
src/interfaces/ecpg/preproc/README.parser

so there's a pretty clear preference for "README.lowercasesomething".
Hence I suggest "README.signals".

+Implementation Notes on Globals and Signal/Event Handling
+=

I'd drop "Globals and", it's not very relevant here; it's surely not
the lead keyword.  Maybe "Signal Handling in Postgres" would be even
more apropos.

+The approch to signal handling in PostgreSQL is designed to strictly conform
+with the C89 standard and designed to run with as few platform assumptions as
+possible.

s/approch/approach/, s/conform with/conform to/, s/run with/make/ (or just
drop "and designed..." entirely; C spec compliance is as much as we need
to say).

+Most signals (except SIGSEGV, and SIGKILL) are blocked by PostgreSQL
+during process startup.

The parenthetical remark is wrong, or at least incomplete; there's other
stuff we don't block such as SIGBUS.  I think I'd skip any attempt to
list them here and just say "Most blockable signals are blocked ...".
(Also, reading down, I note you return to this topic where you mention
SIGILL.  Perhaps better to enlarge on never-blocked signals at that
point.)

+An exception is made for SIGQUIT which is used by the postmaster to terminate
+backend sessions quickly when another backend dies so that the postmaster
+may re-initialize shared memory and otherwise return to a known-good state.

It'd likely be worth the verbiage to enlarge on this.  There is a clear
risk of the SIGQUIT handler malfunctioning because it interrupts something
that's not safe to interrupt.  We live with this on the theory that we're
just trying to kill the process anyway, so corruption of its internal
state is tolerable; plus the postmaster has SIGKILL-after-a-timeout logic
in case a child's SIGQUIT handler gets completely stuck.

+CHECK_FOR_INTERRUPTS() may be called only when there is no more cleanup to do
+because the query or process may be aborted.

This seems pretty wrong or at least confusing.  Maybe better
"CHECK_FOR_INTERRUPTS must be called only in places where it is safe to
invoke transaction abort and/or process termination.  Because Postgres
generally expects backend subsystems to maintain enough state to clean
up after themselves at transaction abort, this is a weak restriction.
However, functions that execute only straight-line code are allowed to
put persistent variables into transiently inconsistent states; that's
safe exactly as long as no CHECK_FOR_INTERRUPTS is executed while the
process state is unsafe for transaction abort".

+SIGHUP  - Reload config file (Postmaster Only)

Nope, that's not postmaster-only.  Backends and most other child
processes also re-read the config file on SIGHUP.

+SIGINT  - (Postmaster) Fast shutdown, (Backend) Cancel Query
+SIGQUIT - (Postmaster) Immediate Shutdown, (Backend) Exit Immediately
+SIGTERM - Terminate backend gracefully

IIRC, SIGTERM is also Smart Shutdown in the postmaster; why isn't
this documented similarly to SIGINT/SIGQUIT?

+SIGUSR1 - Pending IPC or LWLock event

Probably better to explain this as a general purpose signal multiplexor,
because that's what it is.  LWLock etc are specific use-cases.

+SIGINFO - Parent died
+SIGPWR  - Parent died (alternative)

I was totally confused by this to start with, until grepping reminded me
that POSTMASTER_DEATH_SIGNAL is implememented as one or the other of
them.  It might be better to use that name in this documentation, and
then explain that under the hood it is SIGINFO, SIGPWR, or possibly
some other 

Re: minor fixes after pgindent prototype fixes

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-28, Tom Lane wrote:

> Andres Freund  writes:
> > a few prototypes look odd. It appears to be cases where previously the
> > odd indentation was put to some use, by indenting parameters less:
> > ...
> > but now that looks odd:
> > extern void DefineCustomBoolVariable(
> >  const char *name,
> >  const char *short_desc,
> 
> > Unless somebody protests I'm going to remove the now pretty useless
> > looking newline in the cases I can find.
> 
> +1.  I think Alvaro was muttering something about doing this,
> but you beat him to it.

No, this is a different issue ...  I was talking about function *calls*
ending in parens, and it changed because of the previous round of
pgindent changes, not the last one.  The number of affected places was a
lot larger than the patch Andres posted.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Robert Haas
On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar  wrote:
> I think, even though there might not be a correctness issue with the
> current code as it stands, we should still use a local variable.
> Updating MyUndoWorker is a big side-effect, which the caller is not
> supposed to be aware of, because all that function should do is just
> get the slot info.

Absolutely right.  It's just routine good practice to avoid using
global variables when there is no compelling reason to do otherwise.
The reason you state here is one of several good ones.

> Yes, I also think that the function would error out only because of
> can't-happen cases, like "too many locks taken" or "out of binary heap
> slots" or "out of memory" (this last one is not such a can't happen
> case). These cases happen probably due to some bugs, I suppose. But I
> was wondering : Generally when the code errors out with such
> can't-happen elog() calls, worst thing that happens is that the
> transaction gets aborted. Whereas, in this case, the worst thing that
> could happen is : the undo action would never get executed, which
> means selects for this tuple will keep on accessing the undo log ?
> This does not sound like any data consistency issue, so we should be
> fine after all ?

I don't think so.  Every XID present in undo has to be something we
can look up in CLOG to figure out which transactions are aborted and
which transactions are committed, so that we know which transactions
need undo.  If we forget to undo the transaction, we can't discard it,
which means we can't advance the CLOG transaction horizon, which means
we'll eventually start failing to assign XIDs, leading to a refusal of
all write transactions.  Oops.

More generally, it's not OK for the generic undo layer to make
assumptions about whether the operations performed by the undo
handlers are essential or not.  We don't want to impose a design
constraint the undo can only be used for things that are not actually
critical, because that will make it hard to write AMs that use it.
And there's no reason to live with such a design constraint anyway,
because, as noted above, CLOG truncation requires it.

More generally still, some can't-happen situations should be checked
via Assert() and others via elog(). For example, consider some code
that looks up a syscache tuple and pulls data from the returned tuple.
If the code that handles DDL is written in such a way that the tuple
should always exist, then this is a can't-happen situation, but
generally the code checks this via elog(), not Assert(), because it
could also happen due to the catalog contents being corrupted.  If
Assert() were used, the checks would not run in production builds, and
a corrupt catalog would lead to a seg fault. An elog() is much
friendlier. As a general principle, when a certain thing ought to
always be true, but it being true depends on a whole lot of
assumptions elsewhere in the code, and especially if it also depends
on assumptions like "the database is not corrupted," I think elog() is
preferable.  Assert() is better for things that are more localized and
that really can't go wrong for any reason other than a bug.  In this
case, I think I would tend towards elog(PANIC), but it's arguable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 2:24 PM Peter Geoghegan  wrote:
> Yes. It doesn't really make any difference with B-Trees, because the
> locks there are very similar to row locks (you still need forwarding
> UNDO metadata in index pages, probably for checking the visibility of
> index tuples that have their ghost bit set). But when you need to undo
> changes to an indexes with coarse grained index tuples (e.g. in a GIN
> index), the transaction needs to roll back the index tuple as a whole,
> necessitating that locks be held. Heap TIDs need to be completely
> stable to avoid a VACUUM-like mechanism -- you cannot just create a
> new HOT chain. You even have to be willing to store a single heap row
> across two heap pages in extreme cases where an UPDATE makes it
> impossible to fit a new row on the same heap page as the original --
> this is called row forwarding.

I find this hard to believe, because an UPDATE can always be broken up
into a DELETE and an INSERT.  If that were to be done, you would not
have a stable heap TID and you would have a "new HOT chain," or your
AM's equivalent of that concept.  So if we can't handle an UPDATE that
changes the TID, then we also can't handle a DELETE + INSERT.  But
surely handling that case is a hard requirement for any AM.

Sorry if I'm being dense here, but I feel like you're making some
assumptions that I'm not quite following.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 29, 2019 at 9:35 AM Robert Haas  wrote:
> I mean, essentially any well-designed framework intended for any sort
> of task whatsoever is going to have a design center where one can
> foresee that it will work well, and then as a result of working well
> for the thing for which it was designed, it will also work well for
> other things that are sufficiently similar.  So, I think you're
> correct, but I also don't think that's really saying very much.

I agree that it's quite unclear how important this is. I don't
necessarily think it matters if zheap doesn't do that well with GIN
indexes. I think it's probably going to be useful to imagine how GIN
indexing might work for zheap because it clarifies the strengths and
weaknesses of your design. It's perfectly fine for there to be
weaknesses, provided that they are well understood.

> The
> trick is to figure out whether and how the ideas you have could be
> generalized with reasonable effort to handle other cases, and that's
> easier with some projects than others.  I think when it comes to UNDO,
> it's actually really hard.

I agree.

> Unfortunately, we haven't had many takers so far -- thanks for chiming
> in.

I don't have the ability to express my general concerns here in a very
crisp way. This is complicated stuff. Thanks for tolerating the
hand-wavy nature of my feedback about this.

> I don't really understand your comments about GIN.  My simplistic
> understanding of GIN is that it's not very different from btree in
> this regard.

GIN is quite similar to btree from a Postgres point of view -- GIN is
simply a btree that is good at storing duplicates (and has higher
level infrastructure to make things like FTS work). So I'd say that
your understanding is fairly complete, at least as far as traditional
Postgres goes. But if we imagine a system in which we have to roll
back in indexes, it's quite a different story. See my remarks to
Thomas just now about that.

> Suppose we insert a row, and then the insert aborts;
> suppose also that the index wants to use UNDO.  In the case of a btree
> index, we're going to go insert an index entry for the new row; upon
> abort, we should undo the index insertion by removing that index tuple
> or at least marking it dead.  Unless a page split has happened,
> triggered either by the insertion itself or by subsequent activity,
> this puts the index in a state that is almost perfectly equivalent to
> where we were before the now-aborted transaction did any work.  If a
> page split has occurred, trying to undo the index insertion is going
> to run into two problems. One, we probably can't undo the page split,
> so the index will be logically equivalent but not physically
> equivalent after we get rid of the new tuple. Two, if the page split
> happened after the insertion of the new tuple rather than at the same
> time, the index tuple may not be on the page where we left it.

Actually, page splits are the archetypal case where undo cannot
restore the original physical state. In general, we cannot expect the
undo process to reverse page splits. Undo might be able to merge the
pages together, but it also might not be able to. It won't be terribly
different to the situation with deletes where the transaction commits,
most likely.

Some other systems have something called "system transactions" for
things like page splits. They don't need to have their commit record
flushed synchronously, and occur in the foreground of the xact that
needs to split the page. That way, rollback doesn't have to concern
itself with rolling back things that are pretty much impossible to
roll back, like page splits.

> Now, my mental model of a GIN index is that you go find N>=0 index
> keys inside each value and do basically the same thing as you would
> for a btree index for each one of them. Therefore it seems to me,
> possibly stupidly, that you're going to have basically the same
> problems, except each problem will now potentially happen up to N
> times instead of up to 1 time.  I assume here that in either case -
> GIN or btree - you would tentatively record where you left the tuple
> that now needs to be zapped and that you can jump to that place
> directly to try to zap it. Possibly those assumptions are bad and
> maybe that's where you're seeing a concurrency/deadlock problem; if
> so, a more detailed explanation would be very helpful.

Imagine a world in which zheap cannot just create a new TID (or HOT
chain) for the same logical tuple, which is something that I believe
should be an important goal for zheap (again, see my remarks to
Thomas). Simplicity for rollbacks in access methods like GIN demands
that you lock the entire index tuple, which may point to hundreds of
logical rows (or TIDs, since they have a 1:1 correspondence with
logical rows in this imaginary world). Rolling back with more granular
locking seems very hard for the same reason that rolling back a page
split would be very hard -- you cannot possibly have enough

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 29, 2019 at 12:11 PM Robert Haas  wrote:
> I find this hard to believe, because an UPDATE can always be broken up
> into a DELETE and an INSERT.  If that were to be done, you would not
> have a stable heap TID and you would have a "new HOT chain," or your
> AM's equivalent of that concept.  So if we can't handle an UPDATE that
> changes the TID, then we also can't handle a DELETE + INSERT.  But
> surely handling that case is a hard requirement for any AM.

I'm not saying you can't handle it. But that necessitates "write
amplification", in the sense that you must now create new index tuples
even for indexes where the indexed columns were not logically altered.
Isn't zheap supposed to fix that problem, at least at in version 2 or
version 3? I also think that stable heap TIDs make index-only scans a
lot easier and more effective.

I think that indexes (or at least B-Tree indexes) will ideally almost
always have tuples that are the latest versions with zheap. The
exception is tuples whose ghost bit is set, whose visibility varies
based on the MVCC snapshot in use. But the instant that the
deleting/updating xact commits it becomes legal to recycle the old
heap TID. We don't need to go back to the index to permanently zap the
tuple whose ghost bit we already set, because there is an undo pointer
in the same leaf page, so nobody is in danger of getting confused and
following the now-recycled heap TID.

This ghost bit design owes plenty to 2PL (which will fully remove the
index tuple synchronously, rather than just setting a ghost bit). You
could say that it's a 2PL/MVCC hybrid, while classic Postgres is
"pure" MVCC because it uses explicit row versioning -- it doesn't need
to impose restrictions on TID stability. Which seems to be why we
offer such a large variety of index access methods -- it's relatively
straight forward for Postgres to add niche index AMs, such as SP-GiST.

-- 
Peter Geoghegan




Re: Remove page-read callback from XLogReaderState.

2019-07-29 Thread Heikki Linnakangas

On 12/07/2019 10:10, Kyotaro Horiguchi wrote:

At Tue, 28 May 2019 04:45:24 -0700, Andres Freund  wrote in 
<20190528114524.dvj6ymap2virl...@alap3.anarazel.de>

Hi,

On 2019-04-18 21:02:57 +0900, Kyotaro HORIGUCHI wrote:

Hello. As mentioned before [1], read_page callback in
XLogReaderState is a cause of headaches. Adding another
remote-controlling stuff to xlog readers makes things messier [2].

I refactored XLOG reading functions so that we don't need the
callback. In short, ReadRecrod now calls XLogPageRead directly
with the attached patch set.

| while (XLogReadRecord(xlogreader, RecPtr, &record, &errormsg)
|== XLREAD_NEED_DATA)
| XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess);

On the other hand, XLogReadRecord became a bit complex. The patch
makes XLogReadRecord a state machine. I'm not confident that the
additional complexity is worth doing. Anyway I'll gegister this
to the next CF.


Just FYI, to me this doesn't clearly enough look like an improvement,
for a change of this size.


Thanks for the opiniton. I kinda agree about size but it is a
decision between "having multiple callbacks called under the
hood" vs "just calling a series of functions".  I think the
patched XlogReadRecord is easy to use in many situations.

It would be better if I could completely refactor the function
without the syntax tricks but I think the current patch is still
smaller and clearer than overhauling it.


I like the idea of refactoring XLogReadRecord() to not use a callback, 
and return a XLREAD_NEED_DATA value instead. It feels like a nicer, 
easier-to-use, interface, given that all the page-read functions need 
quite a bit of state and internal logic themselves. I remember that I 
felt that that would be a nicer interface when we originally extracted 
xlogreader.c into a reusable module, but I didn't want to make such big 
changes to XLogReadRecord() at that point.


I don't much like the "continuation" style of implementing the state 
machine. Nothing wrong with such a style in principle, but we don't do 
that anywhere else, and the macros seem like overkill, and turning the 
local variables static is pretty ugly. But I think XLogReadRecord() 
could be rewritten into a more traditional state machine.


I started hacking on that, to get an idea of what it would look like and 
came up with the attached patch, to be applied on top of all your 
patches. It's still very messy, it needs quite a lot of cleanup before 
it can be committed, but I think the resulting switch-case state machine 
in XLogReadRecord() is quite straightforward at high level, with four 
states.


I made some further changes to the XLogReadRecord() interface:

* If you pass a valid ReadPtr (i.e. the starting point to read from) 
argument to XLogReadRecord(), it always restarts reading from that 
record, even if it was in the middle of reading another record 
previously. (Perhaps it would be more convenient to provide a separate 
function to set the starting point, and remove the RecPtr argument from 
XLogReadRecord altogther?)


* XLogReaderState->readBuf is now allocated and controlled by the 
caller, not by xlogreader.c itself. When XLogReadRecord() needs data, 
the caller makes the data available in readBuf, which can point to the 
same buffer in all calls, or the caller may allocate a new buffer, or it 
may point to a part of a larger buffer, whatever is convenient for the 
caller. (Currently, all callers just allocate a BLCKSZ'd buffer, 
though). The caller also sets readPagPtr, readLen and readPageTLI to 
tell XLogReadRecord() what's in the buffer. So all these read* fields 
are now set by the caller, XLogReadRecord() only reads them.


* In your patch, if XLogReadRecord() was called with state->readLen == 
-1, XLogReadRecord() returned an error. That seemed a bit silly; if an 
error happened while reading the data, why call XLogReadRecord() at all? 
You could just report the error directly. So I removed that.


I'm not sure how intelligible this patch is in its current state. But I 
think the general idea is good. I plan to clean it up further next week, 
but feel free to work on it before that, either based on this patch or 
by starting afresh from your patch set.


- Heikki
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e7a086b71e8..8f07450f503 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1385,6 +1385,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
+	XLogRecPtr ptr;
 
 	xlogreader = XLogReaderAllocate(wal_segment_size);
 	if (!xlogreader)
@@ -1392,10 +1393,15 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory"),
  errdetail("Failed while allocating a WAL reading processor.")));
+	xlogreader->readBuf = palloc(XLOG_BLCKSZ);
 
-	while (X

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 29, 2019 at 12:39 PM Peter Geoghegan  wrote:
> I think that indexes (or at least B-Tree indexes) will ideally almost
> always have tuples that are the latest versions with zheap. The
> exception is tuples whose ghost bit is set, whose visibility varies
> based on the MVCC snapshot in use. But the instant that the
> deleting/updating xact commits it becomes legal to recycle the old
> heap TID.

Sorry, I meant the instant the ghost bit index tuple cannot be visible
to any possible MVCC snapshot. Which, in general, will be pretty soon
after the deleting/updating xact commits.

-- 
Peter Geoghegan




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 3:39 PM Peter Geoghegan  wrote:
> I'm not saying you can't handle it. But that necessitates "write
> amplification", in the sense that you must now create new index tuples
> even for indexes where the indexed columns were not logically altered.
> Isn't zheap supposed to fix that problem, at least at in version 2 or
> version 3? I also think that stable heap TIDs make index-only scans a
> lot easier and more effective.

I think there's a cost-benefit analysis here.  You're completely
correct that inserting new index tuples causes write amplification
and, yeah, that's bad. On the other hand, row forwarding has its own
costs. If a row ends up persistently moved to someplace else, then
every subsequent access to that row has an extra level of indirection.
If it ends up split between two places, every read of that row incurs
two reads. The "someplace else" where moved rows or ends of split rows
are stored has to be skipped by sequential scans, which is complex and
possibly inefficient if it breaks up a sequential I/O pattern. Those
things are bad, too.

It's a little difficult to compare the kinds of badness.  My thought
is that in the short run, the redirect strategy probably wins, because
there could be and likely are a bunch of indexes and it's cheaper to
just insert one redirect. But in the long term, the redirect thing
seems like a loser, because you have to keep following it. That
(perhaps naive) analysis is why zheap doesn't try to maintain TID
stability.  Instead it wants to do in-place updates (no new TID) as
often as possible, but the fallback strategy is simply to do a
non-in-place update (new TID) rather than a redirect.

> I think that indexes (or at least B-Tree indexes) will ideally almost
> always have tuples that are the latest versions with zheap. The
> exception is tuples whose ghost bit is set, whose visibility varies
> based on the MVCC snapshot in use. But the instant that the
> deleting/updating xact commits it becomes legal to recycle the old
> heap TID. We don't need to go back to the index to permanently zap the
> tuple whose ghost bit we already set, because there is an undo pointer
> in the same leaf page, so nobody is in danger of getting confused and
> following the now-recycled heap TID.

I haven't run across the "ghost bit" terminology before.  Is there a
good place to read about the technique you're assuming here?  A major
question is how you handle inserted rows, that are new now and thus
not yet visible to everyone, but which will later become all-visible.
One idea is: if the undo pointer is new enough that a write
transaction which modified the page could still be in-flight, check
the undo log to ascertain visibility of index tuples.  If not, then
any potentially-deleted index tuples are in fact deleted, and any
others are all-visible. With this design, you don't set the ghost bit
on new tuples, but are still able to stop following the undo pointers
for them after a while.

To put that another way, there seems to be pretty clearly a need for a
bit, but what does the bit mean?  It could mean "please check the undo
log," in which case it'd have to be set on insert, eventually cleared,
and then reset on delete, but I think that's likely to suck.  I think
therefore that the bit should mean
is-deleted-but-not-necessarily-all-visible-yet, which avoids that
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 9:44 AM Bruce Momjian  wrote:

> > Checking that all buffers using a single LSN are from the same
> > relation would be a good idea but I think it's hard to test it and
> > regard the test result as okay. Even if we passed 'make checkworld',
> > it might still be possible to happen. And even assertion failures
>
> Yes, the problem is that if you embed the relfilenode or tablespace or
> database in the encryption IV, you then need to then make sure you
> re-encrypt any files that move between these.  I am hesitant to do that
> since it then requires these workarounds for encryption going forward.
> We know that most people will not be using encryption, so that will not
> be well tested either.  For pg_upgrade, I used a minimal-impact
> approach, and it has allowed dramatic changes in our code without
> requiring changes and retesting of pg_upgrade.
>

Will there be a per-relation salt stored in a separate file? I saw it
mentioned in a few places (most recently
https://www.postgresql.org/message-id/aa386c3f-fb89-60af-c7a3-9263a633ca1a%40postgresql.org)
but there's also discussion of trying to make the TDEK unique without a
separate salt so I'm unsure.

With a per-relation salt there is no need to include fixed attributes
(database, relfilenode, or tablespace) to ensure the derived key is unique
per relation. A long salt (32-bytes from /dev/urandom) alone guarantees
that uniqueness. Copying or moving files would then be possible by also
copying the salt. It does not need to be a salt per file on disk either,
one salt can be used for many files for the same relation by including the
fork number, type, or segment in the TDEK derivation (so each file on disk
for that relation ends up with a unique TDEK).

There's the usual gotchas of copying encrypted data, i.e. it's exactly the
same so clearly they're equal. But any subsequent changes would have a
different LSN and encrypt differently going forward. If the main use cases
are copying an entire database or moving a tablespace, having that be
simpler/faster seems like a good idea. It could be a known limitation like
the promoting of multiple replicas. Plus with a key rotation tool anyone
that wants everything re-encrypted could run one after the copy.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-27, Bruce Momjian wrote:

> I think using LSN and page number, we will _never_ reuse the IV, except
> for cases like promoting two standbys, which I think we have to document
> as an insecure practice.

Actually, why is it an insecure practice?  If you promote two standbys,
then the encrypted pages are the same pages, so it's not two different
messages with the same key/IV -- they're still *one* message.  And as
soon as they start getting queries, they will most likely diverge
because the LSNs of records after the promotion will (most likely) no
longer match.  It takes one different WAL record length for the
"encryption histories" to diverge completely ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-27, Sehrope Sarkuni wrote:

> Given the non-cryptographic nature of CRC and its 16-bit size, I'd
> round down the malicious tamper detection it provides to zero. At best
> it catches random disk errors so might as well keep it in plain text
> and checkable offline.

But what attack are we protecting against?  We fear that somebody will
steal a disk or a backup.  We don't fear that they will *write* data.
The CRC is there to protect against data corruption.  So whether or not
the CRC protects against malicious tampering is beside the point.

If we were trying to protect against an attacker having access to
*writing* data in the production server, this encryption scheme is
useless: they could just as well read unencrypted data from shared
buffers anyway.

I think trying to protect against malicious data tampering is a second
step *after* this one is done.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-07-29 Thread Robert Haas
On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
 wrote:
> In attachments, you can find a prototype of incremental pg_basebackup,
> which consists of 2 features:
>
> 1) To perform incremental backup one should call pg_basebackup with a
> new argument:
>
> pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
>
> where lsn is a start_lsn of parent backup (can be found in
> "backup_label" file)
>
> It calls BASE_BACKUP replication command with a new argument
> PREV_BACKUP_START_LSN 'lsn'.
>
> For datafiles, only pages with LSN > prev_backup_start_lsn will be
> included in the backup.
> They are saved into 'filename.partial' file, 'filename.blockmap' file
> contains an array of BlockNumbers.
> For example, if we backuped blocks 1,3,5, filename.partial will contain
> 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.

I think it's better to keep both the information about changed blocks
and the contents of the changed blocks in a single file.  The list of
changed blocks is probably quite short, and I don't really want to
double the number of files in the backup if there's no real need. I
suspect it's just overall a bit simpler to keep everything together.
I don't think this is a make-or-break thing, and welcome contrary
arguments, but that's my preference.

> 2) To merge incremental backup into a full backup call
>
> pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
> --merge-backups
>
> It will move all files from 'incremental_basedir' to 'basedir' handling
> '.partial' files correctly.

This, to me, looks like it's much worse than the design that I
proposed originally.  It means that:

1. You can't take an incremental backup without having the full backup
available at the time you want to take the incremental backup.

2. You're always storing a full backup, which means that you need more
disk space, and potentially much more I/O while taking the backup.
You save on transfer bandwidth, but you add a lot of disk reads and
writes, costs which have to be paid even if the backup is never
restored.

> 1) Whether we collect block maps using simple "read everything page by
> page" approach
> or WAL scanning or any other page tracking algorithm, we must choose a
> map format.
> I implemented the simplest one, while there are more ideas:

I think we should start simple.

I haven't had a chance to look at Jeevan's patch at all, or yours in
any detail, as yet, so these are just some very preliminary comments.
It will be good, however, if we can agree on who is going to do what
part of this as we try to drive this forward together.  I'm sorry that
I didn't communicate EDB's plans to work on this more clearly;
duplicated effort serves nobody well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: TopoSort() fix

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 10:56 AM Tom Lane  wrote:
> Michael Paquier  writes:
> > On Fri, Jul 26, 2019 at 08:24:16PM -0400, Tom Lane wrote:
> >> I think this is a sufficiently obvious bug, and a sufficiently
> >> obvious fix, that we should just fix it and not insist on getting
> >> a reproducible test case first.  I think a test case would soon
> >> bit-rot anyway, and no longer exercise the problem.
> >> I certainly do *not* want to wait so long that we miss the
> >> upcoming minor releases.
>
> > +1.  Any volunteers?
>
> If Robert doesn't weigh in pretty soon, I'll take responsibility for it.

That's fine, or if you prefer that I commit it, I will.

I just got back from a week's vacation and am only very gradually
unburying myself from mounds of email.  (Of course, the way
pgsql-hackers is getting, there's sort of always a mound of email
these days.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-29, vignesh C wrote:

> On Mon, Jul 29, 2019 at 8:24 PM Robert Haas  wrote:
> >
> > On Mon, Jul 29, 2019 at 4:10 AM vignesh C  wrote:
> > > I could not locate the caller of ParsePrepareRecord function in 
> > > twophase.c.
> > > Any idea how it gets called?
> > > or
> > > Is it a dead function?
> >
> > It looks like it's not only dead, but stillborn.  Commit
> > 1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf introduced it but without
> > introducing any code that called it, and nothing has changed since
> > then.
>
> I feel the code can be safely removed.
> Patch for the same is attached.

I think there's a patch from Fujii Masao that touches that?  Might be
premature to remove it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 29, 2019 at 1:04 PM Robert Haas  wrote:
> I think there's a cost-benefit analysis here.  You're completely
> correct that inserting new index tuples causes write amplification
> and, yeah, that's bad. On the other hand, row forwarding has its own
> costs. If a row ends up persistently moved to someplace else, then
> every subsequent access to that row has an extra level of indirection.

The devil is in the details. It doesn't seem that optimistic to assume
that a good implementation could practically always avoid it, by being
clever about heap fillfactor. It can work a bit like external TOAST
pointers. The oversized datums can go on the other heap page, which
presumably not be in the SELECT list of most queries. It won't be one
of the indexed columns in typical cases, so index scans will generally
only have to visit one heap page.

It occurs to me that the zheap design is still sensitive to heap
fillfactor in much the same way as it would be with reliably-stable
TIDs, combined with some amount of row forwarding. It's not essential
for correctness that you avoid creating a new HOT chain (or whatever
it's called in zheap) with new index tuples, but it is still quite
preferable on performance grounds. It's still worth going to a lot of
work to avoid having that happen, such as using external TOAST
pointers with some of the larger datums on the existing heap page.

> If it ends up split between two places, every read of that row incurs
> two reads. The "someplace else" where moved rows or ends of split rows
> are stored has to be skipped by sequential scans, which is complex and
> possibly inefficient if it breaks up a sequential I/O pattern. Those
> things are bad, too.
>
> It's a little difficult to compare the kinds of badness.

I would say that it's extremely difficult. I'm not going to speculate
about how the two approaches might compare today.

> I haven't run across the "ghost bit" terminology before.  Is there a
> good place to read about the technique you're assuming here?

"5.2 Key Range Locking and Ghost Records" from "A Survey of B-Tree
Locking Techniques" seems like a good place to start. As I said
earlier, the paper is available from:
https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf

This description won't define the term ghost record/bit in a precise
way that you can just adopt, since the details will vary somewhat
based on considerations like whether or not MVCC is used. But you'll
get the general idea from the paper, I think.

> To put that another way, there seems to be pretty clearly a need for a
> bit, but what does the bit mean?  It could mean "please check the undo
> log," in which case it'd have to be set on insert, eventually cleared,
> and then reset on delete, but I think that's likely to suck.  I think
> therefore that the bit should mean
> is-deleted-but-not-necessarily-all-visible-yet, which avoids that
> problem.

That sounds about right to me.

-- 
Peter Geoghegan




Re: psql - add SHOW_ALL_RESULTS option

2019-07-29 Thread Fabien COELHO


Hello Kyotaro-san,


Thanks. I looked this more closely.


Indeed! Thanks for this detailed review.


+ * Marshal the COPY data.  Either subroutine will get the
+ * connection out of its COPY state, then call PQresultStatus()
+ * once and report any error.

This comment doesn't explain what the result value means.


Ok, added.


+ * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
+ * the PGresult associated with these commands must be processed.  In that
+ * event, we'll marshal data for the COPY.

I think this is not needed. This phrase was needed to explain why
we need to loop over subsequent results after PQexec in the
current code, but in this patch PQsendQuery is used instead,
which doesn't suffer somewhat confusing behavior. All results are
handled without needing an unusual processing.


Hmmm. More or less. "COPY" commands have two results, one for taking over 
the input or output streams more or less directly at the protocol level, 
and one for the final summary, which is quite special compared to other 
commands, all that managed in "copy.c". So ISTM that the comment is 
somehow still appropriate.


The difference with the previous behavior is that other results could be 
immediately discarded but these ones, while now they are all processed.


I've kept this comment and added another one to try to make that clear.


+ * Update result if further processing is necessary.  (Returning NULL
+ * prevents the command status from being printed, which we want in that
+ * case so that the status line doesn't get taken as part of the COPY data.)

It seems that the purpose of the returned PGresult is only
printing status of this COPY. If it is true, I'd like to see
something like the following example.

| Returns result in the case where queryFout is safe to output
| result status. That is, in the case of COPY IN, or in the case
| where COPY OUT is written to other than pset.queryFout.


I have tried to improved the comment based on your suggestion.


+if (!AcceptResult(result, false))
+{
+  /* some error occured, record that */
+  ShowNoticeMessage(¬es);

The comment in the original code was:

-  /*
-   * Failure at this point is always a server-side failure or a
-   * failure to submit the command string.  Either way, we're
-   * finished with this command string.
-   */

The first half of the comment seems to be true for this
patch. Don't we preserve that comment?


Ok. Some form put back.


+success = handleCopyOut(pset.db,
+copystream,
+©_result)
+  && success
+  && (copystream != NULL);

success is always true at thie point so "&& success" is no longer
useful.


Ok.


(It is same for the COPY IN case).


Ok.


+/* must handle COPY before changing the current result */
+result_status = PQresultStatus(result);
+if (result_status == PGRES_COPY_IN ||
+  result_status == PGRES_COPY_OUT)

I didn't get "before changing the curren result" in the comment. Isn't 
"handle COPY stream if any" enough?


Alas, I think not.

The issue is that I need to know whether this is the last result (eg \gset 
applies only on the last result), so I'll call PQgetResult() to get that.


However, on COPY, this is the second "final" result which says how much 
was copied. If I have not send/received the data, the count will not be 
right.



+if (result_status == PGRES_COPY_IN ||
+  result_status == PGRES_COPY_OUT)
+{
+  ShowNoticeMessage(¬es);
+  HandleCopyResult(&result);
+}

It seems that it is wrong that this ignores the return value of
HandleCopyResult().


Yep. Fixed.


+/* timing measure before printing the last result */
+if (last && pset.timing)

I'm not sure whether we reached any consensus with ths
behavior. This means the timing includes result-printing time of
other than the last one. If we don't want include printing time
at all, we can exclude it with a small amount of additional
complexity.


I think that this point is desperate, because the way timing is 
implemented client-side.


Although we could try to stop timing before each result processing, it 
would not prevent the server to go on with other queries and send back 
results, psql to receive further results (next_result), so the final 
figures would be stupid anyway, just another form of stupid.


Basically the approach cannot work with combined queries: It only worked 
before because the intermediate results were coldly discarded.


Maybe the server to report its execution times for each query somehow, but 
then the communication time would not be included.


I have added a comment about why timing does not make much sense with 
combined queries.


Attached a v5.

--
Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 6787ec1efd..de59a5cf65 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ 

Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Thomas Munro
On Tue, Jul 30, 2019 at 7:12 AM Peter Geoghegan  wrote:
> SQL Server 2019 has something called "instantaneous transaction
> rollback", which seems to make SQL Server optionally behave a lot more
> like Postgres [1], apparently with many of the same disadvantages as
> Postgres. I agree that there is probably a middle way that more or
> less has the advantages of both approaches. I don't really know what
> that should look like, though.
>
> [1] 
> https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf

Thanks for sharing that.  I see they're giving that paper at VLDB next
month in LA... I hope the talk video will be published on the web.
While we've been working on a hybrid vaccum/undo design, they've built
a hybrid undo/vacuum system.  I've only skimmed this, but one of their
concerns that caught my eye is log volume in the presence of long
running transactions ("3.6 Aggressive Log Truncation").  IIUC they
have only a single log for both redo and undo, so a long running
transaction requires them to keep all log data around as long as it
might be needed for that transaction, in traditional SQL Server.
That's basically the flip side of the problem we're trying to solve,
in-heap bloat.  I think we might have a different solution to that
problem, with our finer grained undo logs.  Our undo data is not mixed
in with redo data (though redo can recreated it, it's not needed after
that), and we have multiple undo logs with their own discard pointers,
so a long running transaction only prevents only one single undo log
from being truncated, while other undo logs holding other transactions
can be truncated as soon as those transactions are committed/rolled
back and are either all visible (admittedly tracked with a system-wide
xmin approach for now, but could probably be made more granular) or a
snapshot-too-old threshold is reached (not implemented yet).

-- 
Thomas Munro
https://enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:10 PM Alvaro Herrera 
wrote:

> On 2019-Jul-27, Bruce Momjian wrote:
>
> > I think using LSN and page number, we will _never_ reuse the IV, except
> > for cases like promoting two standbys, which I think we have to document
> > as an insecure practice.
>
> Actually, why is it an insecure practice?  If you promote two standbys,
> then the encrypted pages are the same pages, so it's not two different
> messages with the same key/IV -- they're still *one* message.  And as
> soon as they start getting queries, they will most likely diverge
> because the LSNs of records after the promotion will (most likely) no
> longer match.  It takes one different WAL record length for the
> "encryption histories" to diverge completely ...
>

You could have a sequence of post promotion events like:

# Replica 1
LSN=t+0 Operation A
LSN=t+1 Operation B
...
LSN=t+n Operation C

# Replica 2
LSN=t+0 Operation X
LSN=t+1 Operation Y
...
LSN=t+n Operation Z

If the LSN and modified page numbers of C and Z are the same
... and the net effect of Z is known (ex: setting a bunch of bytes on the
row to zero)
... and you can read the encrypted pages of both replicas (ex: have access
to the encrypted storage tier but not necessarily the live server)
... then you can XOR the encrypted pages to get the plain text for the
bytes after operation C.

Yes, it's not likely and yes it has a lot of "if..." involved, but it is
possible.

I don't think this will be an issue in practice, but it should be
documented. Otherwise, it's not unreasonable for someone to expect that a
promoted replica would use be using new keys for everything after each
promotion.

Encryption for WAL can avoid this type of problem entirely by generating a
new random salt and adding a "Use new salt XYZ for WDEK going forward"
record. The two replicas would generate different salts so all subsequent
encrypted WAL data would be different (even the exact same records).
Unfortunately, that doesn't work for pages without a lot more complexity to
keep track of which key version to use based upon the LSN.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: TopoSort() fix

2019-07-29 Thread Tom Lane
[ removing , as that mailing address seems to be MIA ]

Robert Haas  writes:
> On Mon, Jul 29, 2019 at 10:56 AM Tom Lane  wrote:
>> If Robert doesn't weigh in pretty soon, I'll take responsibility for it.

> That's fine, or if you prefer that I commit it, I will.

FYI, I just got done inventing a way to reach that code, and I have
to suspect that it's impossible to do so in production, because under
ordinary circumstances no parallel worker will take any exclusive lock
that isn't already held by its leader.  (If you happen to know an
easy counterexample, let's see it.)

The attached heavily-hacked version of deadlock-soft.spec makes it go by
forcing duplicate advisory locks to be taken in worker processes, which
of course first requires disabling PreventAdvisoryLocksInParallelMode().
I kind of wonder if we should provide some debug-only, here-be-dragons
way to disable that restriction so that we could make this an official
regression test, because I'm now pretty suspicious that none of this code
has ever executed before.

Anyway, armed with this, I was able to prove that HEAD just hangs up
on this test case; apparently the deadlock checker never detects that
the additional holders of the advisory lock need to be rearranged.
And removing that "break" fixes it.

So I'll go commit the break-ectomy, but what do people think about
testing this better?

regards, tom lane

diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index ffd1970..1e815a2 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -658,10 +658,6 @@ pg_isolation_test_session_is_blocked(PG_FUNCTION_ARGS)
 static void
 PreventAdvisoryLocksInParallelMode(void)
 {
-	if (IsInParallelMode())
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
- errmsg("cannot use advisory locks during a parallel operation")));
 }
 
 /*
diff --git a/src/test/isolation/expected/deadlock-soft.out b/src/test/isolation/expected/deadlock-soft.out
index 24a35da..7abbd19 100644
--- a/src/test/isolation/expected/deadlock-soft.out
+++ b/src/test/isolation/expected/deadlock-soft.out
@@ -1,17 +1,47 @@
 Parsed test spec with 4 sessions
 
 starting permutation: d1a1 d2a2 e1l e2l d1a2 d2a1 d1c e1c d2c e2c
-step d1a1: LOCK TABLE a1 IN ACCESS SHARE MODE;
-step d2a2: LOCK TABLE a2 IN ACCESS SHARE MODE;
-step e1l: LOCK TABLE a1 IN ACCESS EXCLUSIVE MODE; 
-step e2l: LOCK TABLE a2 IN ACCESS EXCLUSIVE MODE; 
-step d1a2: LOCK TABLE a2 IN ACCESS SHARE MODE; 
-step d2a1: LOCK TABLE a1 IN ACCESS SHARE MODE; 
+step d1a1: select lock_share(1,x) from bigt limit 1;
+lock_share 
+
+1  
+step d2a2: select lock_share(2,x) from bigt limit 1;
+lock_share 
+
+1  
+step e1l: select lock_excl(1,x) from bigt limit 1; 
+step e2l: select lock_excl(2,x) from bigt limit 1; 
+step d1a2: SET force_parallel_mode = on;
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set parallel_leader_participation=off;
+set max_parallel_workers_per_gather=4;
+ select sum(lock_share(2,x)) from bigt; 
+step d2a1: SET force_parallel_mode = on;
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set parallel_leader_participation=off;
+set max_parallel_workers_per_gather=4;
+ select sum(lock_share(1,x)) from bigt; 
 step d1a2: <... completed>
+sum
+
+1  
 step d1c: COMMIT;
 step e1l: <... completed>
-step e1c: COMMIT;
+lock_excl  
+
+1  
 step d2a1: <... completed>
+sum
+
+1  
+step e1c: COMMIT;
 step d2c: COMMIT;
 step e2l: <... completed>
+lock_excl  
+
+1  
 step e2c: COMMIT;
diff --git a/src/test/isolation/specs/deadlock-soft.spec b/src/test/isolation/specs/deadlock-soft.spec
index 49d16e0..b03f0c7 100644
--- a/src/test/isolation/specs/deadlock-soft.spec
+++ b/src/test/isolation/specs/deadlock-soft.spec
@@ -1,3 +1,12 @@
+# Test deadlock resolution with parallel process groups.
+
+# It's fairly hard to get parallel worker processes to block on locks,
+# since generally they don't want any locks their leader didn't already
+# take.  We cheat like mad here by making a function that takes a lock,
+# and is incorrectly marked immutable so that it can execute in a worker.
+# (This also requires disabling the lockfuncs.c code that prevents that.)
+
+# Otherwise, this is morally equivalent to standard deadlock-soft.spec:
 # Four-process deadlock with two hard edges and two soft edges.
 # d2 waits for e1 (soft edge), e1 waits for d1 (hard edge),
 # d1 waits for e2 (soft edge), e2 waits for d2 (hard edge).
@@ -6,35 +15,67 @@
 
 setup
 {
-  CREATE TABLE a1 ();
-  CREATE TABLE a2 ();
+  create function lock_share(int,int) returns int language sql as
+  'select pg_advisory_xact_lock_shared($1); select 1;' immutable parallel safe;
+
+  create function lock_excl(int,int) returns int language sql as
+  'select pg_advisory_xact_lock($1); sel

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Sehrope Sarkuni
On Mon, Jul 29, 2019 at 4:15 PM Alvaro Herrera 
wrote:

> On 2019-Jul-27, Sehrope Sarkuni wrote:
>
> > Given the non-cryptographic nature of CRC and its 16-bit size, I'd
> > round down the malicious tamper detection it provides to zero. At best
> > it catches random disk errors so might as well keep it in plain text
> > and checkable offline.
>
> But what attack are we protecting against?  We fear that somebody will
> steal a disk or a backup.  We don't fear that they will *write* data.
> The CRC is there to protect against data corruption.  So whether or not
> the CRC protects against malicious tampering is beside the point.
>

That was in response to using an encrypted CRC for tamper detection. I
agree that it does not provide meaningful protection so there is no point
in adding complexity to use it for that.

I agree it's better to leave the CRC as-is for detecting corruption which
also has the advantage of playing nice with existing checksum tooling.


> If we were trying to protect against an attacker having access to
> *writing* data in the production server, this encryption scheme is
> useless: they could just as well read unencrypted data from shared
> buffers anyway.
>

The attack situation is someone being able to modify pages at the storage
tier. They cannot necessarily read server memory or the encryption key, but
they could make changes to existing data or an existing backup that would
be subsequently read by the server.

Dealing with that is way out of scope but similar to the replica promotion
I think it should be kept track of and documented.


> I think trying to protect against malicious data tampering is a second
> step *after* this one is done.
>

+1

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: Define jsonpath functions as stable

2019-07-29 Thread Alexander Korotkov
On Mon, Jul 29, 2019 at 5:36 PM Tom Lane  wrote:
> > However, we've spotted some deviations between standard and our 
> > implementation.
> >  * like_regex predicate uses our regular expression engine, which
> > deviates from standard.
> >  * We always do numeric computations using numeric datatype.  Even if
> > user explicitly calls .double() method.  Probably, our current
> > implementation still fits standard.  But in future we may like to use
> > floating point computation in some cases for performance optimization.
> > ...
> > Therefore, I'm going to mark jsonb_path_*() functions stable, not
> > immutable.
>
> I dunno, I think you are applying a far more rigorous definition of
> "immutable" than we ever have in the past.  The possibility that we
> might change the implementation in the future should not be enough
> to disqualify a function from being immutable --- if that were the
> criterion, nothing more complex than int4pl could be immutable.
>
> Wouldn't it be better that, in the hypothetical major version where
> we change the implementation, we tell users that they must reindex
> any affected indexes?
>
> As a comparison point, we allow people to build indexes on tsvector
> results, which are *easy* to change just by adjusting configuration
> files.  The fact that this might force the need for reindexing hasn't
> made it unworkable.

Thank you for the explanation.  Given that there is no need to mark
existing json_path_*() functions as stable.  We can just advise users
to rebuild their indexes if we have incompatible changes.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Joe Conway
On 7/29/19 6:11 PM, Sehrope Sarkuni wrote:
> On Mon, Jul 29, 2019 at 4:15 PM Alvaro Herrera  > wrote:
> 
> On 2019-Jul-27, Sehrope Sarkuni wrote:
> 
> > Given the non-cryptographic nature of CRC and its 16-bit size, I'd
> > round down the malicious tamper detection it provides to zero. At best
> > it catches random disk errors so might as well keep it in plain text
> > and checkable offline.
> 
> But what attack are we protecting against?  We fear that somebody will
> steal a disk or a backup.  We don't fear that they will *write* data.
> The CRC is there to protect against data corruption.  So whether or not
> the CRC protects against malicious tampering is beside the point.
> 
> 
> That was in response to using an encrypted CRC for tamper detection. I
> agree that it does not provide meaningful protection so there is no
> point in adding complexity to use it for that.
> 
> I agree it's better to leave the CRC as-is for detecting corruption
> which also has the advantage of playing nice with existing checksum tooling.
>  
> 
> If we were trying to protect against an attacker having access to
> *writing* data in the production server, this encryption scheme is
> useless: they could just as well read unencrypted data from shared
> buffers anyway.
> 
> 
> The attack situation is someone being able to modify pages at the
> storage tier. They cannot necessarily read server memory or the
> encryption key, but they could make changes to existing data or an
> existing backup that would be subsequently read by the server.
> 
> Dealing with that is way out of scope but similar to the replica
> promotion I think it should be kept track of and documented.
>  
> 
> I think trying to protect against malicious data tampering is a second
> step *after* this one is done.
> 
> 
> +1

Well said; +1

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Define jsonpath functions as stable

2019-07-29 Thread Alexander Korotkov
On Mon, Jul 29, 2019 at 5:55 PM Chapman Flack  wrote:
> On 7/29/19 10:25 AM, Alexander Korotkov wrote:
>
> >  * like_regex predicate uses our regular expression engine, which
> > deviates from standard.
>
> I still favor adding some element to the syntax (like a 'posix' or 'pg'
> keyword in the grammar for like_regex) that identifies it as using
> a  different regexp flavor, so the way forward to a possible compliant
> version later is not needlessly blocked (or consigned to a
> standard_conforming_strings-like experience).

What do you think about renaming existing operator from like_regex to
pg_like_regex?  Or introducing special flag indicating that PostgreSQL
regex engine is used ('p' for instance)?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Andres Freund
Hi,

I realize that this might not be the absolutely newest version of the
undo storage part of this patchset - but I'm trying to understand the
whole context, and that's hard without reading through the whole stack
in a situation where the layers actually fit together


On 2019-07-29 01:48:30 -0700, Andres Freund wrote:
> < more tomorrow >

> + /* Move the high log number pointer past this one. */
> + ++UndoLogShared->next_logno;

Fwiw, I find having "next" and "low" as variable names, and then
describing "next" as high in comments somewhat confusing.


> +/* check_hook: validate new undo_tablespaces */
> +bool
> +check_undo_tablespaces(char **newval, void **extra, GucSource source)
> +{
> + char   *rawname;
> + List   *namelist;
> +
> + /* Need a modifiable copy of string */
> + rawname = pstrdup(*newval);
> +
> + /*
> +  * Parse string into list of identifiers, just to check for
> +  * well-formedness (unfortunateley we can't validate the names in the
> +  * catalog yet).
> +  */
> + if (!SplitIdentifierString(rawname, ',', &namelist))
> + {
> + /* syntax error in name list */
> + GUC_check_errdetail("List syntax is invalid.");
> + pfree(rawname);
> + list_free(namelist);
> + return false;
> + }

Why can't you validate the catalog here? In a lot of cases this will be
called in a transaction, especially when changing it in a
session. E.g. temp_tablespaces does so?


> + /*
> +  * Make sure we aren't already in a transaction that has been assigned 
> an
> +  * XID.  This ensures we don't detach from an undo log that we might 
> have
> +  * started writing undo data into for this transaction.
> +  */
> + if (GetTopTransactionIdIfAny() != InvalidTransactionId)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  (errmsg("undo_tablespaces cannot be changed 
> while a transaction is in progress";

Hm. Is this really a great proxy? Seems like it'll block changing the
tablespace unnecessarily in a lot of situations, and like there could
even be holes in the future - it doesn't seem crazy that we'd want to
emit undo without assigning an xid in some situations (e.g. for deleting
files in error cases, or for more aggressive cleanup of dead index
entries during reads or such).

It seems like it'd be pretty easy to just check
CurrentSession->attached_undo_slots[i].slot->meta.unlogged.this_xact_start
or such?


> +static bool
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> +{

> + else
> + {
> + /*
> +  * Choose an OID using our pid, so that if several backends 
> have the
> +  * same multi-tablespace setting they'll spread out.  We could 
> easily
> +  * do better than this if more serious load balancing is judged
> +  * useful.
> +  */

We're not really choosing an oid, we're choosing a tablespace. Obviously
one can understand it as is, but it confused me for a second.


> + int index = MyProcPid % length;

Hm. Is MyProcPid a good proxy here? Wouldn't it be better to use
MyProc->pgprocno or such? That's much more guaranteed to space out
somewhat evenly?


> + int first_index = index;
> + Oid oid = InvalidOid;
> +
> + /*
> +  * Take the tablespace create/drop lock while we look the name 
> up.
> +  * This prevents the tablespace from being dropped while we're 
> trying
> +  * to resolve the name, or while the called is trying to create 
> an
> +  * undo log in it.  The caller will have to release this lock.
> +  */
> + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);

Why exclusive?

I think any function that acquires a lock it doesn't release (or the
reverse) ought to have a big honking comment in its header warning of
that.  And an explanation as to why that is.


> + for (;;)
> + {
> + const char *name = list_nth(namelist, index);
> +
> + oid = get_tablespace_oid(name, true);
> + if (oid == InvalidOid)
> + {
> + /* Unknown tablespace, try the next one. */
> + index = (index + 1) % length;
> + /*
> +  * But if we've tried them all, it's time to 
> complain.  We'll
> +  * arbitrarily complain about the last one we 
> tried in the
> +  * error message.
> +  */
> + if (index == first_index)
> + ereport(ERROR,
> +  

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Bruce Momjian
On Mon, Jul 29, 2019 at 04:09:52PM -0400, Alvaro Herrera wrote:
> On 2019-Jul-27, Bruce Momjian wrote:
> 
> > I think using LSN and page number, we will _never_ reuse the IV, except
> > for cases like promoting two standbys, which I think we have to document
> > as an insecure practice.
> 
> Actually, why is it an insecure practice?  If you promote two standbys,
> then the encrypted pages are the same pages, so it's not two different
> messages with the same key/IV -- they're still *one* message.  And as
> soon as they start getting queries, they will most likely diverge
> because the LSNs of records after the promotion will (most likely) no
> longer match.  It takes one different WAL record length for the
> "encryption histories" to diverge completely ...

That is a very good point, but if the LSN was reused in _any_ table with
the same page number, it would be insecure, and it would be easy to scan
for such cases.  However, you are right that it is more rare than I
thought.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Bruce Momjian
On Mon, Jul 29, 2019 at 05:53:40PM -0400, Sehrope Sarkuni wrote:
> I don't think this will be an issue in practice, but it should be documented.
> Otherwise, it's not unreasonable for someone to expect that a promoted replica
> would use be using new keys for everything after each promotion.
> 
> Encryption for WAL can avoid this type of problem entirely by generating a new
> random salt and adding a "Use new salt XYZ for WDEK going forward" record. The
> two replicas would generate different salts so all subsequent encrypted WAL
> data would be different (even the exact same records). Unfortunately, that
> doesn't work for pages without a lot more complexity to keep track of which 
> key
> version to use based upon the LSN.

Oh, yeah, WAL is the big issue here, not the heap/index files, since we
know they will use the same segment number in both clusters.  We can't
use the timeline in the WAL IV since they will both be on the same
timeline.  Anyway, I think the heap/index is still an issue so we should
just document "don't do that".

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-29 Thread Tomas Vondra

On Sun, Jul 21, 2019 at 01:34:22PM +0200, Tomas Vondra wrote:


...

I wonder if we're approaching this wrong. Maybe we should not reverse
engineer queries for the various places, but just start with a set of
queries that we want to optimize, and then identify which places in the
planner need to be modified.



I've decided to do a couple of experiments, trying to make my mind about
which modified places matter to diffrent queries. But instead of trying
to reverse engineer the queries, I've taken a different approach - I've
compiled a list of queries that I think are sensible and relevant, and
then planned them with incremental sort enabled in different places.

I don't have any clear conclusions at this point - it does show some of
the places don't change plan for any of the queries, although there may
be some additional query where it'd make a difference.

But I'm posting this mostly because it might be useful. I've initially
planned to move changes that add incremental sort paths to separate
patches, and then apply/skip different subsets of those patches. But
then I realized there's a better way to do this - I've added a bunch of
GUCs, one for each such place. This allows doing this testing without
having to rebuild repeatedly.

I'm not going to post the patch(es) with extra GUCs here, because it'd
just confuse the patch tester, but it's available here:

 https://github.com/tvondra/postgres/tree/incremental-sort-20190730

There are 10 GUCs, one for each place in planner where incremental sort
paths are constructed. By default all those are set to 'false' so no
incremental sort paths are built. If you do

 SET devel_create_ordered_paths = on;

it'll start creating the paths in non-parallel in create_ordered_paths.
Then you may enable devel_create_ordered_paths_parallel to also consider
parallel paths, etc.

The list of queries (synthetic, but hopefully sufficiently realistic)
and a couple of scripts to collect the plans is in this repository:

 https://github.com/tvondra/incremental-sort-tests-2

There's also a spreadsheet with a summary of results, with a visual
representation of which GUCs affect which queries.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Define jsonpath functions as stable

2019-07-29 Thread Chapman Flack
On 07/29/19 18:27, Alexander Korotkov wrote:

> What do you think about renaming existing operator from like_regex to
> pg_like_regex?  Or introducing special flag indicating that PostgreSQL
> regex engine is used ('p' for instance)?

Renaming the operator is simple and certainly solves the problem.

I don't have a strong technical argument for or against any of:


$.** ? (@ pg_like_regex "O(w|v)" flag "i")
$.** ? (@ pg_like_regex "O(w|v)")


$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)


$.** ? (@ like_regex "O(w|v)" flag "ip")
$.** ? (@ like_regex "O(w|v)" flag "p")


It seems more of an aesthetic judgment (on which I am no particular
authority).

I think I would be -0.3 on the third approach just because of the need
to still spell out ' flag "p"' when there is no other flag you want.

I assume the first two approaches would be about equally easy to
implement, assuming there's a parser that already has an optional
production for "flag" STRING.

Both of the first two seem pretty safe from colliding with a
future addition to the standard.

To my aesthetic sense, pg_like_regex feels like "another operator
to remember" while like_regex ... pg feels like "ok, a slight variant
on the operator from the spec".

Later on, if a conformant version is added, the grammar might be a bit
simpler with just one name and an optional pg.

Going with a flag, there is some question of the likelihood of
the chosen flag letter being usurped by the standard at some point.

I'm leaning toward a flag for now in my own effort to provide the five SQL
functions (like_regex, occurrences_regex, position_regex, substring_regex,
and translate_regex), as for the time being it will be as an extension,
so no custom grammar for me, and I don't really want to make five
pg_* variant function names, and have that expand to ten function names
someday if the real ones are implemented. (Hmm, I suppose I could add
an optional function argument, distinct from flags; that would be
analogous to adding a pg in the grammar ... avoids overloading the flags,
avoids renaming the functions.)

I see in the Saxon library there is already a convention where it
allows a few flags undefined by the standard, after a semicolon in the
flag string. That has no official status; the XQuery spec
defines [smixq] and requires an error for anything else. But it
does have the advantage that the flag string can just be chopped
at the semicolon to eliminate all but the standard flags, and the
advantage (?) that at least one thing is already doing it.

Regards,
-Chap




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-29 Thread Bruce Momjian
On Sun, Jul 28, 2019 at 10:33:03PM -0400, Bruce Momjian wrote:
> I am thinking of writing some Assert() code that checks that all buffers
> using a single LSN are from the same relation (and therefore different
> page numbers).  I would do it by creating a static array, clearing it on
> XLogBeginInsert(), adding to it for each  XLogInsert(), then checking on
> PageSetLSN() that everything in the array is from the same file.  Does
> that make sense?

So, I started looking at how to implement the Assert checks and found
that Heikki has already added (in commit 2c03216d83) Assert checks to
avoid duplicate block numbers in WAL.  I just added the attached patch
to check that all RelFileNodes are the same.

I ran the regression tests with asserts on and got no failures, so I
think we are good.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
new file mode 100644
index 3ec67d4..f6c678c
*** a/src/backend/access/transam/xloginsert.c
--- b/src/backend/access/transam/xloginsert.c
*** XLogRegisterBuffer(uint8 block_id, Buffe
*** 235,241 
  
  	/*
  	 * Check that this page hasn't already been registered with some other
! 	 * block_id.
  	 */
  #ifdef USE_ASSERT_CHECKING
  	{
--- 235,241 
  
  	/*
  	 * Check that this page hasn't already been registered with some other
! 	 * block_id, and check for different RelFileNodes in the WAL record.
  	 */
  #ifdef USE_ASSERT_CHECKING
  	{
*** XLogRegisterBuffer(uint8 block_id, Buffe
*** 248,255 
  			if (i == block_id || !regbuf_old->in_use)
  continue;
  
! 			Assert(!RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode) ||
!    regbuf_old->forkno != regbuf->forkno ||
     regbuf_old->block != regbuf->block);
  		}
  	}
--- 248,267 
  			if (i == block_id || !regbuf_old->in_use)
  continue;
  
! 			/*
! 			 * The initialization vector (IV) is used for page-level
! 			 * encryption.  We use the LSN and page number as the IV, and IV
! 			 * values must never be reused since it is insecure.	It is safe
! 			 * to use the LSN on multiple pages in the same relation since
! 			 * the page number is part of the IV.  It is unsafe to reuse the
! 			 * LSN in different relations because the page number might be
! 			 * the same, and hence the IV.  Therefore, we check here that
! 			 * we don't have WAL records for different relations using the
! 			 * same LSN.
! 			 */
! 			Assert(RelFileNodeEquals(regbuf_old->rnode, regbuf->rnode));
! 
! 			Assert(regbuf_old->forkno != regbuf->forkno ||
     regbuf_old->block != regbuf->block);
  		}
  	}


Re: TopoSort() fix

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 5:55 PM Tom Lane  wrote:
> FYI, I just got done inventing a way to reach that code, and I have
> to suspect that it's impossible to do so in production, because under
> ordinary circumstances no parallel worker will take any exclusive lock
> that isn't already held by its leader.  (If you happen to know an
> easy counterexample, let's see it.)

I think the way you could make that happen would be to run a parallel
query that calls a user-defined function which does LOCK TABLE.

> Anyway, armed with this, I was able to prove that HEAD just hangs up
> on this test case; apparently the deadlock checker never detects that
> the additional holders of the advisory lock need to be rearranged.
> And removing that "break" fixes it.

Nice!

> So I'll go commit the break-ectomy, but what do people think about
> testing this better?

I think it's a great idea.  I was never very happy with the amount of
exercise I was able to give this code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Built-in connection pooler

2019-07-29 Thread Tomas Vondra

On Mon, Jul 29, 2019 at 07:14:27PM +0300, Konstantin Knizhnik wrote:



On 26.07.2019 23:24, Tomas Vondra wrote:

Hi Konstantin,

I've started reviewing this patch and experimenting with it, so let me
share some initial thoughts.


1) not handling session state (yet)

I understand handling session state would mean additional complexity, so
I'm OK with not having it in v1. That being said, I think this is the
primary issue with connection pooling on PostgreSQL - configuring and
running a separate pool is not free, of course, but when people complain
to us it's when they can't actually use a connection pool because of
this limitation.

So what are your plans regarding this feature? I think you mentioned
you already have the code in another product. Do you plan to submit it
in the pg13 cycle, or what's the plan? I'm willing to put some effort
into reviewing and testing that.


I completely agree with you. My original motivation of implementation 
of built-in connection pooler
was to be able to preserve session semantic (prepared statements, 
GUCs, temporary tables) for pooled connections.
Almost all production system have to use some kind of pooling. But in 
case of using pgbouncer&Co we are loosing possibility
to use prepared statements which can cause up to two time performance 
penalty (in simple OLTP queries).

So I have implemented such version of connection pooler of PgPro EE.
It require many changes in Postgres core so I realized that there are 
no chances to commit in community
(taken in account that may other my patches like autoprepare and libpq 
compression are postponed for very long time, although

them are much smaller and less invasive).

Then Dimitri Fontaine proposed me to implement much simple version of 
pooler based on traditional proxy approach.

This patch is result of our conversation with Dimitri.
You are asking me about my plans... I think that it will be better to 
try first to polish this version of the patch and commit it and only 
after it add more sophisticated features

like saving/restoring session state.



Well, I understand the history of this patch, and I have no problem with
getting a v1 of a connection pool without this feature. After all,
that's the idea of incremental development. But that only works when v1
allows adding that feature in v2, and I can't quite judge that. Which
is why I've asked you about your plans, because you clearly have more
insight thanks to writing the pooler for PgPro EE.






FWIW it'd be nice to expose it as some sort of interface, so that other
connection pools can leverage it too. There are use cases that don't
work with a built-in connection pool (say, PAUSE/RESUME in pgbouncer
allows restarting the database) so projects like pgbouncer or odyssey
are unlikely to disappear anytime soon.


Obviously built-in connection pooler will never completely substitute 
external poolers like pgbouncer, which provide more flexibility, i.e. 
make it possible to install pooler at separate host or at client side.




Sure. But that wasn't really my point - I was suggesting to expose this
hypothetical feature (managing session state) as some sort of API usable
from other connection pools.



I also wonder if we could make it more permissive even in v1, without
implementing dump/restore of session state.

Consider for example patterns like this:

 BEGIN;
 SET LOCAL enable_nestloop = off;
 ...
 COMMIT;

or

 PREPARE x(int) AS SELECT ...;
 EXECUTE x(1);
 EXECUTE x(2);
 ...
 EXECUTE x(10);
 DEALLOCATE x;

or perhaps even

 CREATE FUNCTION f() AS $$ ... $$
 LANGUAGE sql
 SET enable_nestloop = off;

In all those cases (and I'm sure there are other similar examples) the
connection pool considers the session 'tainted' it marks it as tainted
and we never reset that. So even when an application tries to play nice,
it can't use pooling.

Would it be possible to maybe track this with more detail (number of
prepared statements, ignore SET LOCAL, ...)? That should allow us to do
pooling even without full support for restoring session state.


Sorry, I do not completely understand your idea (how to implement this 
features without maintaining session state).


My idea (sorry if it wasn't too clear) was that we might handle some
cases more gracefully.

For example, if we only switch between transactions, we don't quite care
about 'SET LOCAL' (but the current patch does set the tainted flag). The
same thing applies to GUCs set for a function. 


For prepared statements, we might count the number of statements we
prepared and deallocated, and treat it as 'not tained' when there are no
statements. Maybe there's some risk I can't think of.

The same thing applies to temporary tables - if you create and drop a
temporary table, is there a reason to still treat the session as tained?


To implement prepared statements  we need to store them in session 
context or at least add some session specific prefix to prepare 
statement name.
Temporary tables also require per-session te

Re: block-level incremental backup

2019-07-29 Thread Jeevan Ladhe
Hi Jeevan


I reviewed first two patches -


0001-Add-support-for-command-line-option-to-pass-LSN.patch and

0002-Add-TAP-test-to-test-LSN-option.patch


from the set of incremental backup patches, and the changes look good to me.


I had some concerns around the way we are working around with the fact that

pg_lsn_in() accepts the lsn with 0 as a valid lsn and I think that itself is

contradictory to the definition of InvalidXLogRecPtr. I have started a
separate

new thread[1] for the same.


Also, I observe that now commit 21f428eb, has already moved the lsn decoding

logic to a separate function pg_lsn_in_internal(), so the function

decode_lsn_internal() from patch 0001 will go away and the dependent code
needs

to be modified.


I shall review the rest of the patches, and post the comments.


Regards,

Jeevan Ladhe


[1]
https://www.postgresql.org/message-id/CAOgcT0NOM9oR0Hag_3VpyW0uF3iCU=bdufspfk9jrwxrcwq...@mail.gmail.com

On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi Anastasia,
>
> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>> > I'm volunteering to write a draft patch or, more likely, set of
>> > patches, which
>> > will allow us to discuss the subject in more detail.
>> > And to do that I wish we agree on the API and data format (at least
>> > broadly).
>> > Looking forward to hearing your thoughts.
>>
>> Though the previous discussion stalled,
>> I still hope that we could agree on basic points such as a map file
>> format and protocol extension,
>> which is necessary to start implementing the feature.
>>
>
> It's great that you too come up with the PoC patch. I didn't look at your
> changes in much details but we at EnterpriseDB too working on this feature
> and started implementing it.
>
> Attached series of patches I had so far... (which needed further
> optimization and adjustments though)
>
> Here is the overall design (as proposed by Robert) we are trying to
> implement:
>
> 1. Extend the BASE_BACKUP command that can be used with replication
> connections. Add a new [ LSN 'lsn' ] option.
>
> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
> the option added to the server in #1.
>
> Here are the implementation details when we have a valid LSN
>
> sendFile() in basebackup.c is the function which mostly does the thing for
> us. If the filename looks like a relation file, then we'll need to consider
> sending only a partial file. The way to do that is probably:
>
> A. Read the whole file into memory.
>
> B. Check the LSN of each block. Build a bitmap indicating which blocks
> have an LSN greater than or equal to the threshold LSN.
>
> C. If more than 90% of the bits in the bitmap are set, send the whole file
> just as if this were a full backup. This 90% is a constant now; we might
> make it a GUC later.
>
> D. Otherwise, send a file with .partial added to the name. The .partial
> file contains an indication of which blocks were changed at the beginning,
> followed by the data blocks. It also includes a checksum/CRC.
> Currently, a .partial file format looks like:
>  - start with a 4-byte magic number
>  - then store a 4-byte CRC covering the header
>  - then a 4-byte count of the number of blocks included in the file
>  - then the block numbers, each as a 4-byte quantity
>  - then the data blocks
>
>
> We are also working on combining these incremental back-ups with the full
> backup and for that, we are planning to add a new utility called
> pg_combinebackup. Will post the details on that later once we have on the
> same page for taking backup.
>
> Thanks
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>
>


Re: should there be a hard-limit on the number of transactions pending undo?

2019-07-29 Thread Peter Geoghegan
On Mon, Jul 29, 2019 at 2:52 PM Thomas Munro  wrote:
> Thanks for sharing that.  I see they're giving that paper at VLDB next
> month in LA... I hope the talk video will be published on the web.
> While we've been working on a hybrid vaccum/undo design, they've built
> a hybrid undo/vacuum system.

It seems that this will be in a stable release soon, so it's not
pie-in-the-sky stuff. AFAICT, they have indexes that always point to
the latest row version. Getting an old version always required working
backwards from the latest. Perhaps the constant time recovery stuff is
somewhat like Postgres heapam when it comes to SELECTs, INSERTs, and
DELETEs, but much less similar when it comes to UPDATEs. This seems
like it might be an important distinction.

As the MVCC survey paper out of CMU [1] from a couple of years back says:

"The main idea of using logical pointers is that the DBMS uses a fixed
identifier that does not change for each tuple in its index entry.
Then, as shown in Fig. 5a, the DBMS uses an indirection layer that
maps a tuple’s identifier to the HEAD of its version chain. This
avoids the problem of having to update all of a table’s indexes to
point to a new physical location whenever a tuple is modified. (even
if the indexed attributes were not changed)."

To me, this suggests that zheap ought to make heap TIDs "more logical"
than they are with heapam today (heap TIDs are hybrid physical/logical
identifiers today). "Row forwarding" across heap pages is the
traditional way of ensuring that TIDs in indexes are stable even in
the worst case, apparently, but other approaches also seem possible.

[1] http://www.vldb.org/pvldb/vol10/p781-Wu.pdf
-- 
Peter Geoghegan




Re: TopoSort() fix

2019-07-29 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jul 29, 2019 at 5:55 PM Tom Lane  wrote:
>> FYI, I just got done inventing a way to reach that code, and I have
>> to suspect that it's impossible to do so in production, because under
>> ordinary circumstances no parallel worker will take any exclusive lock
>> that isn't already held by its leader.  (If you happen to know an
>> easy counterexample, let's see it.)

> I think the way you could make that happen would be to run a parallel
> query that calls a user-defined function which does LOCK TABLE.

I tried that first.  There are backstops preventing doing LOCK TABLE
in a worker, just like for advisory locks.

I believe the only accessible route to taking any sort of new lock
in a parallel worker is catalog lookups causing AccessShareLock on
a catalog.  In principle, maybe you could make a deadlock situation
by combining parallel workers with something that takes
AccessExclusiveLock on a catalog ... but making that into a reliable
test case sounds about impossible, because AEL on a catalog will
have all sorts of unpleasant side-effects, such as blocking
isolationtester's own queries.  (Getting it to work in a
CLOBBER_CACHE_ALWAYS build seems right out, for instance.)

regards, tom lane




Re: TopoSort() fix

2019-07-29 Thread Robert Haas
On Mon, Jul 29, 2019 at 9:48 PM Tom Lane  wrote:
> I tried that first.  There are backstops preventing doing LOCK TABLE
> in a worker, just like for advisory locks.
>
> I believe the only accessible route to taking any sort of new lock
> in a parallel worker is catalog lookups causing AccessShareLock on
> a catalog.

Can't the worker just query a previously-untouched table, maybe by
constructing a string and then using EXECUTE to execute it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: TopoSort() fix

2019-07-29 Thread Michael Paquier
On Mon, Jul 29, 2019 at 10:56:05AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> +1.  Any volunteers?
> 
> If Robert doesn't weigh in pretty soon, I'll take responsibility for it.

Thanks Tom for taking care of it!
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Amit Kapila
On Tue, Jul 30, 2019 at 12:18 AM Robert Haas  wrote:
>
> On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar  
> wrote:
>
> > Yes, I also think that the function would error out only because of
> > can't-happen cases, like "too many locks taken" or "out of binary heap
> > slots" or "out of memory" (this last one is not such a can't happen
> > case). These cases happen probably due to some bugs, I suppose. But I
> > was wondering : Generally when the code errors out with such
> > can't-happen elog() calls, worst thing that happens is that the
> > transaction gets aborted. Whereas, in this case, the worst thing that
> > could happen is : the undo action would never get executed, which
> > means selects for this tuple will keep on accessing the undo log ?
> > This does not sound like any data consistency issue, so we should be
> > fine after all ?
>
> I don't think so.  Every XID present in undo has to be something we
> can look up in CLOG to figure out which transactions are aborted and
> which transactions are committed, so that we know which transactions
> need undo.  If we forget to undo the transaction, we can't discard it,
> which means we can't advance the CLOG transaction horizon, which means
> we'll eventually start failing to assign XIDs, leading to a refusal of
> all write transactions.  Oops.
>
> More generally, it's not OK for the generic undo layer to make
> assumptions about whether the operations performed by the undo
> handlers are essential or not.  We don't want to impose a design
> constraint the undo can only be used for things that are not actually
> critical, because that will make it hard to write AMs that use it.
> And there's no reason to live with such a design constraint anyway,
> because, as noted above, CLOG truncation requires it.
>
> More generally still, some can't-happen situations should be checked
> via Assert() and others via elog(). For example, consider some code
> that looks up a syscache tuple and pulls data from the returned tuple.
> If the code that handles DDL is written in such a way that the tuple
> should always exist, then this is a can't-happen situation, but
> generally the code checks this via elog(), not Assert(), because it
> could also happen due to the catalog contents being corrupted.  If
> Assert() were used, the checks would not run in production builds, and
> a corrupt catalog would lead to a seg fault. An elog() is much
> friendlier. As a general principle, when a certain thing ought to
> always be true, but it being true depends on a whole lot of
> assumptions elsewhere in the code, and especially if it also depends
> on assumptions like "the database is not corrupted," I think elog() is
> preferable.  Assert() is better for things that are more localized and
> that really can't go wrong for any reason other than a bug.  In this
> case, I think I would tend towards elog(PANIC), but it's arguable.
>

Agreed, elog(PANIC) seems like a better way for this as compared to Assert.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread Amit Kapila
On Tue, Jul 30, 2019 at 2:34 AM Alvaro Herrera  wrote:
>
> On 2019-Jul-29, vignesh C wrote:
>
> > On Mon, Jul 29, 2019 at 8:24 PM Robert Haas  wrote:
> > >
> > > On Mon, Jul 29, 2019 at 4:10 AM vignesh C  wrote:
> > > > I could not locate the caller of ParsePrepareRecord function in 
> > > > twophase.c.
> > > > Any idea how it gets called?
> > > > or
> > > > Is it a dead function?
> > >
> > > It looks like it's not only dead, but stillborn.  Commit
> > > 1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf introduced it but without
> > > introducing any code that called it, and nothing has changed since
> > > then.
> >
> > I feel the code can be safely removed.
> > Patch for the same is attached.
>
> I think there's a patch from Fujii Masao that touches that?  Might be
> premature to remove it.
>

Okay, can you point to that patch?  Recently, Robert/Thomas has raised
a comment on undo machinery wherein we are considering to store
FullTransactionId in two-phase file.  So, in that connection, we need
to modify this function as well.  It is not impossible to test some
unused function (we can try it by superficially calling it at
someplace in code for the purpose of test), but it would have been
better if it is used in someplace.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmob1Oby7Wc5ryB_VBccU9N%2BuSKjXXocgT9dY_edfxqSA8Q%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Patch for SortSupport implementation on inet/cdir

2019-07-29 Thread Brandur Leach
And a slightly amended version of the last patch with a bug
fixed where IPv4 abbreviated keys were were not being
initialized correctly on big-endian machines.


v5-0001-SortSupport-for-inet-cidr-types.patch
Description: Binary data


Re: TopoSort() fix

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-29, Tom Lane wrote:

> FYI, I just got done inventing a way to reach that code, and I have
> to suspect that it's impossible to do so in production, because under
> ordinary circumstances no parallel worker will take any exclusive lock
> that isn't already held by its leader.

Hmm, okay, so this wasn't a bug that would have bit anyone in practice,
yeah?  That's reassuring.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread Alvaro Herrera
On 2019-Jul-30, Amit Kapila wrote:

> On Tue, Jul 30, 2019 at 2:34 AM Alvaro Herrera  
> wrote:

> > I think there's a patch from Fujii Masao that touches that?  Might be
> > premature to remove it.
> 
> Okay, can you point to that patch?

https://postgr.es/m/cahgqgwfqgrwmoorfboohxy1vdgm-ykwdwvwr_bd0tqxftjd...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Is ParsePrepareRecord dead function

2019-07-29 Thread Michael Paquier
On Tue, Jul 30, 2019 at 08:42:06AM +0530, Amit Kapila wrote:
> Okay, can you point to that patch?

Here you go:
https://commitfest.postgresql.org/23/2105/
The thread is mostly waiting after Fujii-san for an update.
--
Michael


signature.asc
Description: PGP signature


Re: Is ParsePrepareRecord dead function

2019-07-29 Thread vignesh C
On Tue, Jul 30, 2019 at 2:34 AM Alvaro Herrera  wrote:
>
> On 2019-Jul-29, vignesh C wrote:
>
> > On Mon, Jul 29, 2019 at 8:24 PM Robert Haas  wrote:
> > >
> > > On Mon, Jul 29, 2019 at 4:10 AM vignesh C  wrote:
> > > > I could not locate the caller of ParsePrepareRecord function in 
> > > > twophase.c.
> > > > Any idea how it gets called?
> > > > or
> > > > Is it a dead function?
> > >
> > > It looks like it's not only dead, but stillborn.  Commit
> > > 1eb6d6527aae264b3e0b9c95aa70bb7a594ad1cf introduced it but without
> > > introducing any code that called it, and nothing has changed since
> > > then.
> >
> > I feel the code can be safely removed.
> > Patch for the same is attached.
>
> I think there's a patch from Fujii Masao that touches that?  Might be
> premature to remove it.
>
Ok, it makes sense not to remove it when there is some work being done in
different thread.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: TopoSort() fix

2019-07-29 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Jul-29, Tom Lane wrote:
>> FYI, I just got done inventing a way to reach that code, and I have
>> to suspect that it's impossible to do so in production, because under
>> ordinary circumstances no parallel worker will take any exclusive lock
>> that isn't already held by its leader.

> Hmm, okay, so this wasn't a bug that would have bit anyone in practice,
> yeah?  That's reassuring.

At the least, you'd have to go well out of your way to make it happen.

regards, tom lane




Re: block-level incremental backup

2019-07-29 Thread Jeevan Chalke
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas  wrote:

>
> I haven't had a chance to look at Jeevan's patch at all, or yours in
> any detail, as yet, so these are just some very preliminary comments.
> It will be good, however, if we can agree on who is going to do what
> part of this as we try to drive this forward together.  I'm sorry that
> I didn't communicate EDB's plans to work on this more clearly;
> duplicated effort serves nobody well.
>

I had a look over Anastasia's PoC patch to understand the approach she has
taken and here are my observations.

1.
The patch first creates a .blockmap file for each relation file containing
an array of all modified block numbers. This is done by reading all blocks
(in a chunk of 4 (32kb in total) in a loop) from a file and checking the
page
LSN with given LSN. Later, to create .partial file, a relation file is
opened
again and all blocks are read in a chunk of 4 in a loop. If found modified,
it is copied into another memory and after scanning all 4 blocks, all copied
blocks are sent to the .partial file.

In this approach, each file is opened and read twice which looks more
expensive
to me. Whereas in my patch, I do that just once. However, I read the entire
file in memory to check which blocks are modified but in Anastasia's design
max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
that as we wanted to know how heavily the file got modified so that we can
send the entire file if it was modified beyond the threshold (currently
90%).

2.
Also, while sending modified blocks, they are copied in another buffer,
instead
they can be just sent from the read files contents (in BLCKSZ block size).
Here, the .blockmap created earlier was not used. In my implementation, we
are
sending just a .partial file with a header containing all required details
like
the number of blocks changes along with the block numbers including CRC
followed by the blocks itself.

3.
I tried compiling Anastasia's patch, but getting an error. So could not see
or
test how it goes. Also, like a normal backup option, the incremental backup
option needs to verify the checksum if requested.

4.
While combining full and incremental backup, files from the incremental
backup
are just copied into the full backup directory. While the design I posted
earlier, we are trying another way round to avoid over-writing and other
issues
as I explained earlier.

I am almost done writing the patch for pg_combinebackup and will post soon.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: concerns around pg_lsn

2019-07-29 Thread Michael Paquier
On Mon, Jul 29, 2019 at 10:55:29PM +0530, Jeevan Ladhe wrote:
> I am attaching a patch that makes sure that *have_error is set to false in
> pg_lsn_in_internal() itself, rather than being caller dependent.

Agreed about making the code more defensive as you do.  I would keep
the initialization in check_recovery_target_lsn and pg_lsn_in_internal
though.  That does not hurt and makes the code easier to understand,
aka we don't expect an error by default in those paths.

> IIUC, in the comment above we clearly want to mark 0 as an invalid lsn (also
> further IIUC the comment states - lsn would start from (walSegSize + 1)).
> Given this, should not it be invalid to allow "0/0" as the value of
> type pg_lsn, or for that matter any number < walSegSize?

You can rely on "0/0" as a base point to calculate the offset in a
segment, so my guess is that we could break applications by generating
an error.  Please note that the behavior is much older than the
introduction of pg_lsn, as the original parsing logic has been removed
in 6f289c2b with validate_xlog_location() in xlogfuncs.c. 
--
Michael


signature.asc
Description: PGP signature


Re: Contribution to Perldoc for TestLib module in Postgres

2019-07-29 Thread Michael Paquier
On Fri, Jul 26, 2019 at 09:51:34AM -0400, Andrew Dunstan wrote:
> I've fixed the bitrot and some other infelicities on this patch. It's
> not commitable yet but I think it's more reviewable.

Thanks, I had a look at this version.

+  # Returns the real directory for a virtual path directory under msys
+  real_dir(dir)
real_dir() is no more.

perl2host() is missing.

+  #TODO
+  command_like_safe(cmd, expected_stdout, test_name)
[...]
+=pod
+
+=item command_like_safe(cmd, expected_stdout, test_name)
+
+TODO
+
+=cut
Update not to miss.

+Runs the command which is passed as argument to the function. On failure it
+abandons further tests and exits the program.
"On failure the test suite exits immediately."


I think that the SYNOPSIS could be shaped better.  As of now it is a
simple succession of the same commands listed without any link to each
other, which is contrary for example to what we do in PostgresNode.pm,
where it shows up a set of small examples which are useful to
understand how to shape the tests and the possible interactions
between the routines of the module.  My take would be to keep it
simple and minimal as TestLib.pm is the lowest level of our TAP test
infrastructure.  So here are some simple suggestions, and we could go
with this set to begin with:
# Test basic output of a command.
program_help_ok('initdb');
program_version_ok('initdb');
program_options_handling_ok('initdb');

# Test option combinations
command_fails(['initdb', '--invalid-option'],
  'command fails with invalid option');
my $tempdir = TestLib::tempdir;
command_ok('initdb', '-D', $tempdir);

Another thing is that the examples should not overlap with what
PostgresNode.pm presents, and that it is not necessary to show up all
the routines.  It also makes little sense to describe in the synopsis
the routines in a way which duplicates with the descriptions on top of
each routine.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Dilip Kumar
On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila  wrote:
>
> On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar  wrote:
> >
> > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas  wrote:
> > >
> > > I don't like the fact that undoaccess.c has a new global,
> > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > really need that?  I think it's never modified (so it could just be
> > > declared const),
> >
> > Actually, this will get modified otherwise across undo record
> > insertion how we will know what was the values of the common fields in
> > the first record of the page.  Another option could be that every time
> > we insert the record, read the value from the first complete undo
> > record on the page but that will be costly because for every new
> > insertion we need to read the first undo record of the page.
> >
>
> This information won't be shared across transactions, so can't we keep
> it in top transaction's state?   It seems to me that will be better
> than to maintain it as a global state.

I think this idea is good for the DO time but during REDO time it will
not work as we will not have the transaction state.  Having said that
the current idea of keeping in the global variable will also not work
during REDO time because the WAL from the different transaction can be
interleaved.  There are few ideas to handle this issue

1.  At DO time keep in TopTransactionState as you suggested and during
recovery time read from the first complete record on the page.
2.  Just to keep the code uniform always read from the first complete
record of the page.

After putting more though I am more inclined towards idea-2.  Because
we are anyway inserting our current record into that page basically we
have read the buffer and also holds the exclusive lock on the buffer.
So reading a few extra bytes from the buffer will not hurt us IMHO.

If someone has a better solution please suggest.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: using explicit_bzero

2019-07-29 Thread Michael Paquier
On Mon, Jul 29, 2019 at 11:30:53AM +0200, Peter Eisentraut wrote:
> Another patch, with various fallback implementations.

I have spotted some issues with this patch:
1) The list of port files @pgportfiles in Mkvcbuild.pm has not been
updated with the new file explicit_bzero.c, so the compilation would
fail with MSVC.
2) pg_config.h.win32 does not include the two new flags (same as
https://www.postgresql.org/message-id/20190624050850.ge1...@paquier.xyz)
3) What about CreateRole() and AlterRole() which can manipulate a
password in plain format before hashing?  (same message as previous
point).

Nit: src/port/explicit_bzero.c misses its IDENTIFICATION tag.
--
Michael


signature.asc
Description: PGP signature


Re: using explicit_bzero

2019-07-29 Thread Andres Freund
Hi,

On 2019-07-29 11:30:53 +0200, Peter Eisentraut wrote:
> For platforms that don't have explicit_bzero(), provide various
> fallback implementations.  (explicit_bzero() itself isn't standard,
> but as Linux/glibc and OpenBSD have it, it's the most common spelling,
> so it makes sense to make that the invocation point.)

I think it's better to have a pg_explicit_bzero or such, and implement
that via the various platform dependant mechanisms.  It's considerably
harder to understand code when one is surprised that a function normally
not available is called, the buildsystem part is really hard to
understand (with runtime and code filenames differing etc), and invites
API breakages.  And it's not really more work to have our own name.


> +/*-
> + *
> + * explicit_bzero.c
> + *
> + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *-
> + */
> +
> +#include "c.h"
> +
> +#if defined(HAVE_MEMSET_S)
> +
> +void
> +explicit_bzero(void *buf, size_t len)
> +{
> + (void) memset_s(buf, len, 0, len);
> +}
> +
> +#elif defined(WIN32)
> +
> +#include "c.h"

Hm?


> +/*
> + * Indirect call through a volatile pointer to hopefully avoid dead-store
> + * optimisation eliminating the call.  (Idea taken from OpenSSH.)  We can't
> + * assume bzero() is present either, so for simplicity we define our own.
> + */
> +
> +static void
> +bzero2(void *buf, size_t len)
> +{
> + memset(buf, 0, len);
> +}
> +
> +static void (* volatile bzero_p)(void *, size_t) = bzero2;

Hm, I'm not really sure that this does that much. Especially when the
call is via a function in the same translation unit.


> +void
> +explicit_bzero(void *buf, size_t len)
> +{
> + bzero_p(buf, len);

I've not followed this discussion. But why isn't the obvious
implementation here memset(...); pg_compiler_barrier()?

A quick web search indicates that that's what a bunch of projects in the
cryptography space also ended up with (well, __asm__ __volatile__("" :::
"memory"), which is what pg_compiler_barrier boils down to for
gcc/clang/compatibles).

Greetings,

Andres Freund




Re: tap tests driving the database via psql

2019-07-29 Thread Craig Ringer
On Sun, 28 Jul 2019 at 03:15, Andres Freund  wrote:

> 1) Just depend on DBD::Pg being installed. It's fairly common, after
>all. It'd be somewhat annoying that we'd often end up using a
>different version of libpq than what we're testing against. But in
>most cases that'd not be particularly problematic.
>

I advocated for this in the past, and still think it's the best option.

>
> 4) We develop a fairly minimal pure perl database driver, that doesn't
>depend on DBI. Include it somewhere as part of the test code, instead
>of src/interfaces, so it's clearer that it's not ment as an actual
>official driver.
>

Why not write a new language interpreter while we're at it, and maybe a
compiler and runtime? :p

The community IMO wastes *so* much time on not-invented-here make-work and
on jumping through hoops to avoid depending on anything newer than the late
'90s. I'd rather not go deeper down that path. If someone on SunOS or SCO
OpenServer or whatever doesn't want to install DBD::Pg, have the TAP tests
just skip the tests on that platform and make it the platform owner's
problem.


> Craig, IIRC you'd thought about this before too?
>

Right. But IIRC Tom vetoed it on grounds of not wanting to expect buildfarm
operators to install it, something like that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: allow online change primary_conninfo

2019-07-29 Thread Michael Paquier
On Mon, Jul 01, 2019 at 02:33:39PM +0300, Sergei Kornilov wrote:
> Updated version attached. Merge conflict was about tests count in
> 001_stream_rep.pl. Nothing else was changed. My approach can be
> still incorrect, any redesign ideas are welcome. Thanks in advance! 

It has been some time, and I am finally catching up with this patch.

+ 
+  WAL receiver will be restarted after 
primary_slot_name
+  was changed.
  
The sentence sounds strange.  Here is a suggestion:
The WAL receiver is restarted after an update of primary_slot_name (or
primary_conninfo).

The comment at the top of the call of ProcessStartupSigHup() in
HandleStartupProcInterrupts() needs to be updated as it mentions a
configuration file re-read, but that's not the case anymore.

pendingRestartSource's name does not represent what it does, as it is
only associated with the restart of a WAL receiver when in streaming
state, and that's a no-op for the archive mode and the local mode.

So, the patch splits the steps taken when checking for a WAL source by
adding an extra step after the failure handling that you are calling
the restart step.  When a failure happens for the stream mode
(shutdown of WAL receiver, promotion. etc), there is a switch to the
archive mode, and nothing is changed in this case in your patch.  So
when shutting down the WAL receiver after a parameter change, what
happens is that the startup process waits for retrieve_retry_interval
before moving back to the archive mode.  Only after scanning again the
archives do we restart a new WAL receiver.  However, if the restart of
the WAL receiver is planned because of an update of primary_conninfo
(or slot), shouldn't the follow-up mode be XLOG_FROM_STREAM without
waiting for wal_retrieve_retry_interval ms for extra WAL to become
available?
--
Michael


signature.asc
Description: PGP signature


Re: POC: Cleaning up orphaned files using undo logs

2019-07-29 Thread Thomas Munro
On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar  wrote:
> On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila  wrote:
> > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar  wrote:
> > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas  
> > > wrote:
> > > > I don't like the fact that undoaccess.c has a new global,
> > > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > > really need that?  I think it's never modified (so it could just be
> > > > declared const),
> > >
> > > Actually, this will get modified otherwise across undo record
> > > insertion how we will know what was the values of the common fields in
> > > the first record of the page.  Another option could be that every time
> > > we insert the record, read the value from the first complete undo
> > > record on the page but that will be costly because for every new
> > > insertion we need to read the first undo record of the page.
> > >
> >
> > This information won't be shared across transactions, so can't we keep
> > it in top transaction's state?   It seems to me that will be better
> > than to maintain it as a global state.
>
> I think this idea is good for the DO time but during REDO time it will
> not work as we will not have the transaction state.  Having said that
> the current idea of keeping in the global variable will also not work
> during REDO time because the WAL from the different transaction can be
> interleaved.  There are few ideas to handle this issue
>
> 1.  At DO time keep in TopTransactionState as you suggested and during
> recovery time read from the first complete record on the page.
> 2.  Just to keep the code uniform always read from the first complete
> record of the page.
>
> After putting more though I am more inclined towards idea-2.  Because
> we are anyway inserting our current record into that page basically we
> have read the buffer and also holds the exclusive lock on the buffer.
> So reading a few extra bytes from the buffer will not hurt us IMHO.
>
> If someone has a better solution please suggest.

Hi Dilip,

Here's some initial review of the following patch (from your public
undo_interface_v1 branch as of this morning).  I haven't tested this
version yet, because my means of testing this stuff involves waiting
for undoprocessing to be rebased, so that I can test it with my
orphaned files stuff and other test programs.  It contains another
suggestion for that problem you just mentioned (and also me pointing
out what you just pointed out, since I wrote it earlier) though I'm
not sure if it's better than your options above.

> commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> Author: Dilip Kumar 
> Date:   Thu May 2 11:28:13 2019 +0530
>
>Provide interfaces to store and fetch undo records.

+#include "commands/tablecmds.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"

"miscadmin.h" comes before "storage...".

+/*
+ * Compute the size of the partial record on the undo page.
+ *
+ * Compute the complete record size by uur_info and variable field length
+ * stored in the page header and then subtract the offset of the record so that
+ * we can get the exact size of partial record in this page.
+ */
+static inline Size
+UndoPagePartialRecSize(UndoPageHeader phdr)
+{
+Sizesize;

We decided to use size_t everywhere in new code (except perhaps
functions conforming to function pointer types that historically use
Size in their type).

+/*
+ * Compute the header size from undo record uur_info, stored in the page
+ * header.
+ */
+size = UndoRecordHeaderSize(phdr->uur_info);
+
+/*
+ * Add length of the variable part and undo length. Now, we know the
+ * complete length of the undo record.
+ */
+size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
+
+/*
+ * Subtract the size which is stored in the previous page to get the
+ * partial record size stored in this page.
+ */
+size -= phdr->record_offset;
+
+return size;

This is probably a stupid question but why isn't it enough to just
store the offset of the first record that begins on this page, or 0
for none yet?  Why do we need to worry about the partial record's
payload etc?

+UndoRecPtr
+PrepareUndoInsert(UndoRecordInsertContext *context,
+  UnpackedUndoRecord *urec,
+  Oid dbid)
+{
...
+/* Fetch compression info for the transaction. */
+compression_info = GetTopTransactionUndoCompressionInfo(category);

How can this work correctly in recovery?  [Edit: it doesn't, as you
just pointed out]

I had started reviewing an older version of your patch (the version
that had made it as far as the undoprocessing branch as of recently),
before I had the bright idea to look for a newer version.  I was going
to object to the global variable you had there in the earlier version.
It seems to me that you have to be able to reproduce the exact same
compressio