date:20190304

Re: Re: \describe*

2019-03-04 Thread David Steele


On 2/25/19 9:44 PM, Robert Haas wrote:

On Sat, Feb 23, 2019 at 7:19 PM Andres Freund  wrote:

Sure, but it was late, and we have far more patches than we can deal
with. Many of them much much older than this.


More importantly, at least in my opinion, is that this is one of those
questions that people tend to have very strong feelings about.  Doing
something at the last minute risks people not feeling that they had an
adequate time to express those feelings before something got shipped.
Not everybody reads this list every day, or tests every new commit as
soon as it goes into the tree.


I agree with Andres and Robert.  This patch should be pushed to PG13.

I'll do that on March 8 unless there is a compelling argument not to.

Regards,
--
-David
da...@pgmasters.net

Re: psql show URL with help

2019-03-04 Thread Peter Eisentraut

On 2019-03-04 17:55, Magnus Hagander wrote:
> If the first thing we do when we move from devel to some other state
> (beta, RC, etc.) is to publish the docs under the major version
> number, then maybe this test should be more along the lines of looking
> for anything that's neither devel nor a number, extract the number,
> and use that.
> 
> 
> Well, alpha versions do go under the numeric URL. Whether we load the
> docs at that time or not we can just choose -- but there is no reason
> not to. So yeah, that sounds like it would work better. 

Can you put your proposal in the form of some logical pseudocode?  I
don't understand the free-form description.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote

Imai-san,

Thanks for checking.

On 2019/03/05 15:03, Imai, Yoshikazu wrote:
> I've also done code review of 0001 and 0002 patch so far.
> 
> [0001]
> 1. Do we need to open/close a relation in add_appendrel_other_rels()? 
> 
> + if (rel->part_scheme)
>   {
> - ListCell   *l;
> ...
> - Assert(cnt_parts == nparts);
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
> 
> + if (relation)
> + table_close(relation, NoLock);

Ah, it should be moved to another patch.  Actually, to patch 0003, which
makes it necessary to inspect the PartitionDesc.

> 2. We can sophisticate the usage of Assert in add_appendrel_other_rels().
> 
> + if (rel->part_scheme)
>   {
> ...
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
>   ...
> + i  = 0;
> + foreach(l, root->append_rel_list)
> + {
> ...
> + if (rel->part_scheme != NULL)
> + {
> + Assert(rel->nparts > 0 && i < rel->nparts);
> + rel->part_rels[i] = childrel;
> + }
> +
> + i++;
> 
> as below;
> 
> + if (rel->part_scheme)
>   {
> ...
> Assert(rel->nparts > 0);
> + rel->part_rels = (RelOptInfo **)
> + palloc0(sizeof(RelOptInfo *) * rel->nparts);
> + relation = table_open(rte->relid, NoLock);
>   }
>   ...
> + i  = 0;
> + foreach(l, root->append_rel_list)
> + {
> ...
> + if (rel->part_scheme != NULL)
> + {
> + Assert(i < rel->nparts);
> + rel->part_rels[i] = childrel;
> + }
> +
> + i++;

You're right.  That's what makes sense in this context.

> [0002]
> 3. If using variable like is_source_inh_expansion, the code would be easy to 
> read. (I might mistakenly understand root->simple_rel_array_size > 0 means 
> source inheritance expansion though.)

root->simple_rel_array_size > 0  *does* mean source inheritance expansion,
so you're not mistaken.  As you can see, expand_inherited_rtentry is
called by inheritance_planner() to expand target inheritance and by
add_appendrel_other_rels() to expand source inheritance.  Since the latter
is called by query_planner, simple_rel_array would have been initialized
and hence root->simple_rel_array_size > 0 is true.

Maybe it'd be a good idea to introduce is_source_inh_expansion variable
for clarity as you suggest and set it to (root->simple_rel_array_size > 0).

> 4. I didn't see much carefully, but should the introduction of 
> contains_inherit_children be in 0003 patch...?

You're right.

Thanks again for the comments.  I've made changes to my local repository.

Thanks,
Amit

Re: Patch to document base64 encoding

2019-03-04 Thread Fabien COELHO




Hello Karl,

Doc patch, against master.  Documents encode() and decode() base64 
format.


It is already documented. Enhance documentation, though.


Builds for me.


For me as well. Looks ok.


Attached: doc_base64_v1.patch

References RFC2045 section 6.8 to define base64.


Did you consider referencing RFC 4648 instead?

--
Fabien.

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-05 16:01:50 +1300, David Rowley wrote:
> On Tue, 5 Mar 2019 at 12:47, Andres Freund  wrote:
> > CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) 
> > USING heap2;
> >
> > SET default_table_access_method = 'heap';
> > CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> > VALUES IN ('a');
> 
> 
> > But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> > quite as clear.  I think it'd both be sensible for new partitions to
> > inherit the AM from the root, but it'd also be sensible to use the
> > current default.
> 
> I'd suggest it's made to work the same way as ca4103025dfe26 made
> tablespaces work.

Hm, is that actually correct?  Because as far as I can tell that doesn't
have the necessary pg_dump code to make this behaviour persistent:

CREATE TABLESPACE frak LOCATION '/tmp/frak';
CREATE TABLE test_tablespace (a text, b int) PARTITION BY list (a) TABLESPACE 
frak ;
CREATE TABLE test_tablespace_1 PARTITION OF test_tablespace FOR VALUES in ('a');
CREATE TABLE test_tablespace_2 PARTITION OF test_tablespace FOR VALUES in ('b') 
TABLESPACE pg_default;
CREATE TABLE test_tablespace_3 PARTITION OF test_tablespace FOR VALUES in ('c') 
TABLESPACE frak;

SELECT relname, relkind, reltablespace FROM pg_class WHERE relname LIKE 
'test_tablespace%' ORDER BY 1;
┌───┬─┬───┐
│  relname  │ relkind │ reltablespace │
├───┼─┼───┤
│ test_tablespace   │ p   │ 16384 │
│ test_tablespace_1 │ r   │ 16384 │
│ test_tablespace_2 │ r   │ 0 │
│ test_tablespace_3 │ r   │ 16384 │
└───┴─┴───┘
(4 rows)

but a dump outputs (abbreviated)

SET default_tablespace = frak;
CREATE TABLE public.test_tablespace (
a text,
b integer
)
PARTITION BY LIST (a);
CREATE TABLE public.test_tablespace_1 PARTITION OF public.test_tablespace
FOR VALUES IN ('a');
SET default_tablespace = '';
CREATE TABLE public.test_tablespace_2 PARTITION OF public.test_tablespace
FOR VALUES IN ('b');
SET default_tablespace = frak;
CREATE TABLE public.test_tablespace_3 PARTITION OF public.test_tablespace
FOR VALUES IN ('c');

which restores to:

postgres[32125][1]=# SELECT relname, relkind, reltablespace FROM pg_class WHERE 
relname LIKE 'test_tablespace%' ORDER BY 1;
┌───┬─┬───┐
│  relname  │ relkind │ reltablespace │
├───┼─┼───┤
│ test_tablespace   │ p   │ 16384 │
│ test_tablespace_1 │ r   │ 16384 │
│ test_tablespace_2 │ r   │ 16384 │
│ test_tablespace_3 │ r   │ 16384 │
└───┴─┴───┘
(4 rows)

because public.test_tablespace_2 assumes it's ought to inherit the
tablespace from the partitioned table.

I also find it far from clear that:

  The tablespace_name is the 
name
  of the tablespace in which the new table is to be created.
  If not specified,
   is consulted, or
   if the table is temporary.  For
  partitioned tables, since no storage is required for the table itself,
  the tablespace specified here only serves to mark the default tablespace
  for any newly created partitions when no other tablespace is explicitly
  specified.

is handled correctly. The above says that the *specified* tablespaces -
which seems to exclude the default tablespace - is what's going to
determine what partitions use as their default tablespace. But in fact
that's not true, the partitioned table's pg_class.retablespace is set to
what default_tablespaces was at the time of the creation.

> i.e. if they specify the storage type when creating
> the partition, then always use that, unless they mention otherwise. If
> nothing was mentioned when they created the partition, then use
> default_table_access_method.

Hm. That'd be doable, but given the above ambiguities I'm not convinced
that's the best approach.  As far as I can see that'd require:

1) At relation creation, for partitioned tables only, do not take
   default_table_access_method into account.

2) At partition creation, if the AM is not specified and if the
   partitioned table's relam is 0, use the default_table_access_method.

3) At pg_dump, for partitioned tables only, explicitly emit a USING
   ... rather than use the method of manipulating default_table_access_method.

As far as I can tell, the necessary steps are also what'd need to be
done to actually implement the described behaviour for TABLESPACE (with
s/default_table_access_method/default_tablespace/ and s/USING/TABLESPACE
of course).

Greetings,

Andres Freund

RE: speeding up planning with partitions

2019-03-04 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 5, 2019 at 0:51 AM, Amit Langote wrote:
> Hi Jesper,
> 
> Thanks for the review.  I've made all of the changes per your comments
> in my local repository.  I'll post the updated patches after diagnosing
> what I'm suspecting a memory over-allocation bug in one of the patches.
> If you configure build with --enable-cassert, you'll see that throughput
> decreases as number of partitions run into many thousands, but it doesn't
> when asserts are turned off.
> 
> On 2019/03/05 1:20, Jesper Pedersen wrote:
> > I'll run some tests using a hash partitioned setup.
> 
> Thanks.

I've also done code review of 0001 and 0002 patch so far.

[0001]
1. Do we need to open/close a relation in add_appendrel_other_rels()? 

+   if (rel->part_scheme)
{
-   ListCell   *l;
...
-   Assert(cnt_parts == nparts);
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}

+   if (relation)
+   table_close(relation, NoLock);


2. We can sophisticate the usage of Assert in add_appendrel_other_rels().

+   if (rel->part_scheme)
{
...
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}
  ...
+   i  = 0;
+   foreach(l, root->append_rel_list)
+   {
...
+   if (rel->part_scheme != NULL)
+   {
+   Assert(rel->nparts > 0 && i < rel->nparts);
+   rel->part_rels[i] = childrel;
+   }
+
+   i++;

as below;

+   if (rel->part_scheme)
{
...
Assert(rel->nparts > 0);
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}
  ...
+   i  = 0;
+   foreach(l, root->append_rel_list)
+   {
...
+   if (rel->part_scheme != NULL)
+   {
+   Assert(i < rel->nparts);
+   rel->part_rels[i] = childrel;
+   }
+
+   i++;


[0002]
3. If using variable like is_source_inh_expansion, the code would be easy to 
read. (I might mistakenly understand root->simple_rel_array_size > 0 means 
source inheritance expansion though.)

In expand_inherited_rtentry() and expand_partitioned_rtentry():

+* Expand planner arrays for adding the child relations.  Can't 
do
+* this if we're not being called from query_planner.
+*/
+   if (root->simple_rel_array_size > 0)
+   {
+   /* Needed when creating child RelOptInfos. */
+   rel = find_base_rel(root, rti);
+   expand_planner_arrays(root, list_length(inhOIDs));
+   }

+   /* Create the otherrel RelOptInfo too. */
+   if (rel)
+   (void) build_simple_rel(root, childRTindex, 
rel);

would be:

+* Expand planner arrays for adding the child relations.  Can't 
do
+* this if we're not being called from query_planner.
+*/
+   if (is_source_inh_expansion)
+   {
+   /* Needed when creating child RelOptInfos. */
+   rel = find_base_rel(root, rti);
+   expand_planner_arrays(root, list_length(inhOIDs));
+   }

+   /* Create the otherrel RelOptInfo too. */
+   if (is_source_inh_expansion)
+   (void) build_simple_rel(root, childRTindex, 
rel);


4. I didn't see much carefully, but should the introduction of 
contains_inherit_children be in 0003 patch...?


I'll continue to do code review of rest patches.


--
Yoshikazu Imai

Re: Delay locking partitions during query execution

2019-03-04 Thread David Rowley

On Sat, 2 Feb 2019 at 02:52, Robert Haas  wrote:
> I think the key question here is whether or not you can cope with
> someone having done arbitrary AEL-requiring modifications to the
> delaylocked partitions.  If you can, the fact that the plan might
> sometimes be out-of-date is an inevitable consequence of doing this at
> all, but I think we can argue that it's an acceptable consequence as
> long as the resulting behavior is not too bletcherous.  If you can't,
> then this patch is dead.

I spent some time looking at this patch again and thinking about the
locking. I ended up looking through each alter table subcommand by way
of AlterTableGetLockLevel() and going through scenarios with each of
them in my head to try to understand:

a) If the subcommand can even be applied to a leaf partition; and
b) Can the subcommand cause a cached plan to become invalid?

I did all the easy ones first then started on the harder ones. I came
to AT_DropConstraint and imagined the following scenario:

-- ** session 1
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);

create index listp1_a_idx on listp1 (a);
create index listp2_a_idx on listp2 (a);

set enable_seqscan = off;
set plan_cache_mode = force_generic_plan;
prepare q1 (int) as select * from listp where a = $1;
execute q1 (1);
begin;
execute q1 (1);

-- ** session 2
drop index listp2_a_idx;

-- ** session 1
execute q1 (2);
ERROR:  could not open relation with OID 16401

The only way I can think to fix this is to just never lock partitions
at all, and if a lock is to be obtained on a partition, it must be
instead obtained on the top-level partitioned table.  That's a rather
large change that could have large consequences, and I'm not even sure
it's possible since we'd need to find the top-level parent before
obtaining the lock, by then the hierarchy might have changed and we'd
need to recheck, which seems like quite a lot of effort just to obtain
a lock... Apart from that, it's not this patch, so looks like I'll
need to withdraw this one :-(

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [bug fix] Produce a crash dump before main() on Windows

2019-03-04 Thread Haribabu Kommi

On Mon, Mar 4, 2019 at 3:23 PM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> At Tue, 6 Nov 2018 15:53:37 +1100, Haribabu Kommi <
> kommi.harib...@gmail.com> wrote in <
> cajrrpgcxzi4z_sttnuuvyoaw+sadk7+cjgypuf7ao43vujl...@mail.gmail.com>
> > Thanks for confirmation of that PostgreSQL runs as service.
> >
> > Based on the following details, we can decide whether this fix is
> required
> > or not.
> > 1. Starting of Postgres server using pg_ctl without service is of
> > production use or not?
> > 2. Without this fix, how difficult is the problem to find out that
> > something is preventing the
> > server to start? In case if it is easy to find out, may be better to
> > provide some troubleshoot
> > guide for windows users can help.
> >
> > I am in favor of doc fix if it easy to find the problem instead of
> assuming
> > the user usage.
>
> I don't have an idea about which is better behavior, but does
> this work for you?
>
>
> https://docs.microsoft.com/en-us/windows/desktop/wer/collecting-user-mode-dumps
>

No, this option is not generating local dumps for postgresql, but this
option is useful to
generate dumps for the applications that don't have a specific WER
reporting.



> > These dumps are configured and controlled independently of the
> > rest of the WER infrastructure. You can make use of the local
> > dump collection even if WER is disabled or if the user cancels
> > WER reporting. The local dump can be different than the dump sent
> > to Microsoft.
>
> I found that symantec offers this in the steps for error
> reporting of its products.
>

Adding some doc changes for the users to refer what can be the issue
windows if the
PostgreSQL server doesn't start and there is no core file available.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Haribabu Kommi

On Tue, Mar 5, 2019 at 10:47 AM Andres Freund  wrote:

> Hi,
>
> In the pluggable storage patch [1], one thing that I'm wondering about
> is how exactly to inherit the storage AM across partitions. I think
> that's potentially worthy of a discussion with a wider audience than I'd
> get in that thread.  It seems also related to the recent discussion in [2]
>
> Consider (excerpted from the tests):
>
> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a)
> USING heap2;
>
> SET default_table_access_method = 'heap';
> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR
> VALUES IN ('a');
>
> SET default_table_access_method = 'heap2';
> CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR
> VALUES IN ('b');
>
> CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR
> VALUES IN ('c') USING heap;
> CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR
> VALUES IN ('d') USING heap2;
>
> It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
> would be stored via heap2, and tableam_parted_c_heap2 via heap.
>
> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> quite as clear.  I think it'd both be sensible for new partitions to
> inherit the AM from the root, but it'd also be sensible to use the
> current default.
>
> Out of laziness (it's how it works rn) I'm inclined to to go with using
> the current default, but I'd be curious if others disagree.
>


As other said that, I also agree to go with default_table_access_method to
be
preferred if not explicitly specified the access method during the table
creation.

This discussion raises a point that, in case if the user wants to change the
access method of a table later once it is created, currently there is no
option.
currently there are no other alternative table access methods that are
available
for the user to switch, but definitely it may be required later.

I will provide a patch to alter the access method of a table for v13.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Amit Langote

On 2019/03/05 11:59, Amit Langote wrote:
> On 2019/03/05 8:47, Andres Freund wrote:
>> Hi,
>>
>> In the pluggable storage patch [1], one thing that I'm wondering about
>> is how exactly to inherit the storage AM across partitions. I think
>> that's potentially worthy of a discussion with a wider audience than I'd
>> get in that thread.  It seems also related to the recent discussion in [2]
>>
>> Consider (excerpted from the tests):
>>
>> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) 
>> USING heap2;
>>
>> SET default_table_access_method = 'heap';
>> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
>> VALUES IN ('a');
>>
>> SET default_table_access_method = 'heap2';
>> CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
>> VALUES IN ('b');
>>
>> CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
>> VALUES IN ('c') USING heap;
>> CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
>> VALUES IN ('d') USING heap2;
>>
>> It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
>> would be stored via heap2, and tableam_parted_c_heap2 via heap.
>>
>> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
>> quite as clear.  I think it'd both be sensible for new partitions to
>> inherit the AM from the root, but it'd also be sensible to use the
>> current default.
> 
> Given that many people expected this behavior to be the sane one in other
> cases that came up, +1 to go this way.

Reading my own reply again, it may not be clear what I was +1'ing.  I
meant to vote for the behavior that David described in his reply.

Thanks,
Amit

Re: Fix memleaks and error handling in jsonb_plpython

2019-03-04 Thread Michael Paquier

On Fri, Mar 01, 2019 at 05:24:39AM +0300, Nikita Glukhov wrote:
> Unfortunately, contrib/jsonb_plpython still contain a lot of problems in error
> handling that can lead to memory leaks:
>  - not all Python function calls are checked for the success
>  - not in all places PG exceptions are caught to release Python references
> But it seems that this errors can happen only in OOM case.
> 
> Attached patch with the fix. Back-patch for PG11 is needed.

That looks right to me.  Here are some comments.

One thing to be really careful of when using PG_TRY/PG_CATCH blocks is
that variables modified in the try block and then referenced in the
catch block need to be marked as volatile.  If you don't do that, the
value when reaching the catch part is indeterminate.

With your patch the result variable used in two places of
PLyObject_FromJsonbContainer() is not marked as volatile.  Similarly,
it seems that "items" in PLyMapping_ToJsonbValue() and "seq" in
"PLySequence_ToJsonbValue" should be volatile because they get changed
in the try loop, and referenced afterwards.

Another issue: in ltree_plpython we don't check the return state of
PyList_SetItem(), which we should complain about I think.
--
Michael

signature.asc
Description: PGP signature

Re: libpq debug log

2019-03-04 Thread Robert Haas

On Mon, Mar 4, 2019 at 10:25 PM Tom Lane  wrote:
> Robert Haas  writes:
> > The basic idea being:
> > - Each line is a whole message.
> > - The line begins with <<< for a message received and >>> for a message 
> > sent.
>
> +1, though do we really need to repeat the direction marker thrice?

Perhaps not.

> > - Strings in single quotes are those sent/received as a fixed number of 
> > bytes.
> > - Strings in double quotes are those sent/received as a string.
> > - 4-byte integers are printed unadorned.
> > - 2-byte integers are prefixed by #.
> > - I guess 1-byte integers would need some other prefix, maybe @ or ##.
>
> I doubt that anybody gives a fig for those distinctions, except when
> they're writing actual code that speaks the protocol --- and I do not
> think that that's the target use-case.  So strings and integers seem
> like plenty.  I'd also suggest that just because the protocol has
> single-letter codes for message types doesn't mean that average users
> have memorized those codes; and that framing data like the message
> length is of no interest.

I don't agree with that.  For one thing, I'm someone, and I give a fig.

I would put it this way: with a very small amount of additional
notation it's possible to preserve the level of detail that we have
currently, and I think that's worth it.  Your proposed format for the
sample message I showed is very slightly shorter, which will almost
certainly not matter to anyone, but it leaves some slight ambiguity
about what was happening at the protocol level, which might.  If you
don't care, the additional detail in my proposed format is easy enough
to ignore.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq debug log

2019-03-04 Thread Tom Lane

Robert Haas  writes:
> The basic idea being:

> - Each line is a whole message.
> - The line begins with <<< for a message received and >>> for a message sent.

+1, though do we really need to repeat the direction marker thrice?

> - Strings in single quotes are those sent/received as a fixed number of bytes.
> - Strings in double quotes are those sent/received as a string.
> - 4-byte integers are printed unadorned.
> - 2-byte integers are prefixed by #.
> - I guess 1-byte integers would need some other prefix, maybe @ or ##.

I doubt that anybody gives a fig for those distinctions, except when
they're writing actual code that speaks the protocol --- and I do not
think that that's the target use-case.  So strings and integers seem
like plenty.  I'd also suggest that just because the protocol has
single-letter codes for message types doesn't mean that average users
have memorized those codes; and that framing data like the message
length is of no interest.

In short, rather than

<<< 'T' 101 4 "Schema" 2615 #2 19 #64 -1 #0 "Name" 1259 #2 19 #64 -1 #0 
"Owner" 0 #0 19 #64 -1 #0

I'd envision something more like

< RowDescription "Schema" 2615 2 19 64 -1 0 "Name" 1259 2 19 64 -1 0 
"Owner" 0 0 19 64 -1 0

> But I still don't really see a need for different levels or whatever.
> I mean, you either want a dump of all of the protocol traffic, or you
> don't, I think.  Or maybe I am confused as to what the goal of all
> this really is.

Yeah, me too.  But a lot of this detail would only be useful if you
were trying to diagnose something like a discrepancy between the server
and libpq as to the width of some field.  And the number of users for
that can be counted without running out of fingers.  I think what would
be of use for a trace facility is as high-level a view as possible of
the message contents.

Or, in other words: a large part of the problem with the existing PQtrace
facility is that it *was* designed to help debug libpq itself, and that
use-case is no longer very interesting.  We should assume that the library
knows how to parse protocol messages.

regards, tom lane

Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-03-04 Thread Alvaro Herrera

On 2019-Mar-04, Bruce Momjian wrote:

> On Thu, Feb 28, 2019 at 10:28:44AM +0300, Sergei Kornilov wrote:
> > Hello
> > 
> > postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
> > ab1d9f066aee4f9b81abde6136771debe0191ae8)
> > So config will be changed in next minor release anyway. We have another 
> > reason to not fix bgwriter_lru_maxpages comment?
> 
> Frankly, I was surprised postgresql.conf.sample was changed in a back
> branch since it will cause people who diff $PGDATA/postgresql.conf with
> share/postgresql.conf.sample to see differences they didn't make.

I think the set of people that execute diffs of their production conf
file against the sample file to be pretty small -- maybe even empty.  If
you're really interested in knowing the changes you've done, you're more
likely to use a version control system on the file anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-04 Thread Michael Paquier

On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
> I still don't understand what issue you see in how basebackup verifies
> checksums. Can you point me to the explanation you've sent after 11 was
> released?

The history is mostly on this thread:
https://www.postgresql.org/message-id/20181020044248.gd2...@paquier.xyz

> So you have a workload/configuration that actually results in data
> corruption yet we fail to detect that? Or we generate false positives?
> Or what do you mean by "100% safe" here?

What's proposed on this thread could generate false positives.  Checks
which have deterministic properties and clean failure handling are
reliable when it comes to reports.
--
Michael


signature.asc
Description: PGP signature

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread David Rowley

On Tue, 5 Mar 2019 at 16:01, David Rowley  wrote:
> I'd suggest it's made to work the same way as ca4103025dfe26 made
> tablespaces work.  i.e. if they specify the storage type when creating
> the partition, then always use that, unless they mention otherwise. If
> nothing was mentioned when they created the partition, then use
> default_table_access_method.

"when creating the partition" should read "when creating the partitioned table"

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: libpq debug log

2019-03-04 Thread Robert Haas

On Mon, Mar 4, 2019 at 3:13 AM Iwata, Aya  wrote:
> 2. Prepare "output level". There are 3 type of levels;
> - TRADITIONAL   :  if set, outputs protocol messages
> - LEVEL1:  if set, outputs phase and time
> - LEVEL2:  if set, outputs both info TRADITIONAL and 
> LEVEL1

I am not impressed by this proposal.  I think what we should be
focusing on here is how to clearly display the contents of a message.
I think we should be looking for a way to display each message on a
single line in a way that indicates the data types of the constituent
fields.  For example, here's a DataRow message as output by PQtrace
today:

>From backend> D
>From backend (#4)> 42
>From backend (#2)> 4
>From backend (#4)> 6
>From backend (6)> public
>From backend (#4)> 4
>From backend (4)> tab1
>From backend (#4)> 5
>From backend (5)> table
>From backend (#4)> 5
>From backend (5)> rhaas

What I'd like to see for a case like this is something like:

<<< 'D' 42 #4 6 'public' 4 'tab1' 5 'table' 5 'rhaas'

And here's a RowDescription message today:

>From backend> T
>From backend (#4)> 101
>From backend (#2)> 4
>From backend> "Schema"
>From backend (#4)> 2615
>From backend (#2)> 2
>From backend (#4)> 19
>From backend (#2)> 64
>From backend (#4)> -1
>From backend (#2)> 0
>From backend> "Name"
>From backend (#4)> 1259
>From backend (#2)> 2
>From backend (#4)> 19
>From backend (#2)> 64
>From backend (#4)> -1
>From backend (#2)> 0
>From backend> "Type"
>From backend (#4)> 0
>From backend (#2)> 0
>From backend (#4)> 25
>From backend (#2)> 65535
>From backend (#4)> -1
>From backend (#2)> 0
>From backend> "Owner"
>From backend (#4)> 0
>From backend (#2)> 0
>From backend (#4)> 19
>From backend (#2)> 64
>From backend (#4)> -1
>From backend (#2)> 0

And I propose that it should look like this:

<<< 'T' 101 4 "Schema" 2615 #2 19 #64 -1 #0 "Name" 1259 #2 19 #64 -1
#0 "Owner" 0 #0 19 #64 -1 #0

The basic idea being:

- Each line is a whole message.
- The line begins with <<< for a message received and >>> for a message sent.
- Strings in single quotes are those sent/received as a fixed number of bytes.
- Strings in double quotes are those sent/received as a string.
- 4-byte integers are printed unadorned.
- 2-byte integers are prefixed by #.
- I guess 1-byte integers would need some other prefix, maybe @ or ##.

Now if we want to timestamp those lines too, that'd be fine:

2019-03-04 21:33:39.338 EST <<< 'T' 101 4 "Schema" 2615 #2 19 #64 -1
#0 "Name" 1259 #2 19 #64 -1 #0 "Owner" 0 #0 19 #64 -1 #0
2019-03-04 21:33:39.342 EST <<< 'D' 42 #4 6 'public' 4 'tab1' 5
'table' 5 'rhaas'

But I still don't really see a need for different levels or whatever.
I mean, you either want a dump of all of the protocol traffic, or you
don't, I think.  Or maybe I am confused as to what the goal of all
this really is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Alvaro Herrera

Hi Rahila,

Thanks for looking.

On 2019-Mar-04, Rahila wrote:

> 1. Thank you for incorporating review comments.
> Can you please rebase the latest 0001-Report-progress-of-
> CREATE-INDEX-operations.patch on master? Currently it does not apply on
> 754b90f657bd54b482524b73726dae4a9165031c

Hmm, rebased to current master.  Didn't conflict at all when rebasing,
so it's strange that it didn't apply.

> >   15:56:44.694 | building index (3 of 8): initializing (1/5)|   
> > 442478 |  442399 |0 |   0 |0 |  
> >  0
> >   15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
> > 442478 |  442399 |1 |   0 |0 |  
> >  0
> >   15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
> > 442478 |  442399 |1 |   0 |0 |  
> >  0
> >   15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |   
> > 442478 |  442399 |1 |   79057 |0 |  
> >  0
> >   15:56:44.738 | building index (3 of 8): final btree sort & load (5/5) |   
> > 442478 |  442399 |1 |  217018 |0 |  
> >  0
> >   15:56:44.75  | building index (3 of 8): final btree sort & load (5/5) |   
> > 442478 |  442399 |1 |  353804 |0 |  
> >  0
> 2. In the above report, even though we are reporting progress in terms of
> tuples_done for final btree sort & load phase we have not cleared
> the blocks_done entry from previous phases. I think this can be confusing as
> the blocks_done does not correspond to the tuples_done in the final btree
> sort & load phase.

Good point.  Done in the attached version, wherein I also added comments
to explain the IndexBuildHeapScan API change.  I didn't change the hash
AM implementation here.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 52770a9255950b4da74579d7da212246d29a268b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 2 Jan 2019 16:14:39 -0300
Subject: [PATCH v5 1/2] Report progress of CREATE INDEX operations

---
 contrib/amcheck/verify_nbtree.c   |   2 +-
 contrib/bloom/blinsert.c  |   2 +-
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/indexam.sgml |  13 ++
 doc/src/sgml/monitoring.sgml  | 227 +-
 src/backend/access/brin/brin.c|   5 +-
 src/backend/access/gin/gininsert.c|   2 +-
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/gist/gistbuild.c   |   2 +-
 src/backend/access/hash/hash.c|   3 +-
 src/backend/access/nbtree/nbtree.c|   9 +
 src/backend/access/nbtree/nbtsort.c   |  57 ++-
 src/backend/access/nbtree/nbtutils.c  |  24 +++
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/catalog/index.c   | 147 -
 src/backend/catalog/system_views.sql  |  27 +++
 src/backend/commands/indexcmds.c  |  52 +-
 src/backend/storage/ipc/standby.c |   2 +-
 src/backend/storage/lmgr/lmgr.c   |  46 +-
 src/backend/storage/lmgr/lock.c   |   7 +-
 src/backend/utils/adt/amutils.c   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c   |   2 +
 src/include/access/amapi.h|   4 +
 src/include/access/genam.h|   1 +
 src/include/access/nbtree.h   |  11 ++
 src/include/catalog/index.h   |   2 +
 src/include/catalog/pg_proc.dat   |  10 +-
 src/include/commands/progress.h   |  36 
 src/include/pgstat.h  |   5 +-
 src/include/storage/lmgr.h|   4 +-
 src/include/storage/lock.h|   2 +-
 33 files changed, 695 insertions(+), 38 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a7678..99f6ed6bc44 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -534,7 +534,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 			 RelationGetRelationName(state->rel),
 			 RelationGetRelationName(state->heaprel));
 
-		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true, false,
 		   bt_tuple_present_callback, (void *) state, scan);
 
 		ereport(DEBUG1,
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index e43fbe0005f..947ee74881f 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -141,7 +141,7 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	initCachedPage(&buildstate);
 
 	/* Do the heap scan */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
+	reltuples = IndexBuildHeapScan(hea

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread David Rowley

On Tue, 5 Mar 2019 at 12:47, Andres Freund  wrote:
> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
> heap2;
>
> SET default_table_access_method = 'heap';
> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('a');

> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> quite as clear.  I think it'd both be sensible for new partitions to
> inherit the AM from the root, but it'd also be sensible to use the
> current default.

I'd suggest it's made to work the same way as ca4103025dfe26 made
tablespaces work.  i.e. if they specify the storage type when creating
the partition, then always use that, unless they mention otherwise. If
nothing was mentioned when they created the partition, then use
default_table_access_method.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Amit Langote

On 2019/03/05 8:47, Andres Freund wrote:
> Hi,
> 
> In the pluggable storage patch [1], one thing that I'm wondering about
> is how exactly to inherit the storage AM across partitions. I think
> that's potentially worthy of a discussion with a wider audience than I'd
> get in that thread.  It seems also related to the recent discussion in [2]
> 
> Consider (excerpted from the tests):
> 
> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
> heap2;
> 
> SET default_table_access_method = 'heap';
> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('a');
> 
> SET default_table_access_method = 'heap2';
> CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('b');
> 
> CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('c') USING heap;
> CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('d') USING heap2;
> 
> It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
> would be stored via heap2, and tableam_parted_c_heap2 via heap.
> 
> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> quite as clear.  I think it'd both be sensible for new partitions to
> inherit the AM from the root, but it'd also be sensible to use the
> current default.

Given that many people expected this behavior to be the sane one in other
cases that came up, +1 to go this way.

Thanks,
Amit

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Masahiko Sawada

On Tue, Mar 5, 2019 at 5:11 AM Andres Freund  wrote:
>
> Hi,
>
> On 2019-03-04 20:03:37 +, Bossart, Nathan wrote:
> > On 3/3/19, 9:23 PM, "Masahiko Sawada"  wrote:
> > > FWIW, I agree that we have options for vacuum as vacuum
> > > command options. But for reloptions, I think if the persistence the
> > > setting could be problematic we should not. According to the
> > > discussions so far, I think VACUUM_SHRINK_ENABLED is the one option
> > > that can be available as both vacuum command option and reloptions.
> > > But I'm not sure there is good use case even if we can set
> > > DISABLE_INDEX_CLEANUP as reloptions.
> >
> > +1
> >
> > The DISABLE_INDEX_CLEANUP option is intended to help avoid transaction
> > ID wraparound and should not be used as a long-term VACUUM strategy
> > for a table.
>
> I'm not quite convinced this is right.  There's plenty sites that
> practically can't use autovacuum because it might decide to vacuum the
> 5TB index because of 300 dead tuples in the middle of busy periods.  And
> without an reloption that's not controllable.

I understood the use case. I'm inclined to add DISABLE_INDEX_CLEANUP
as a reloption.

It's an improvement but it seems to me that the specifying a threshold
or scale factor would be more useful for that case than just turning
on and off. It's something like autovaucum_index_vacuum_scale_factor,
0 by default means always trigger index vacuuming and -1 means never
trigger.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] CLUSTER command progress monitor

2019-03-04 Thread Robert Haas

On Mon, Mar 4, 2019 at 5:38 AM Tatsuro Yamada
 wrote:
> === Current design ===
>
> CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
> Depending on which one is chosen, the command will proceed in the
> following sequence of phases:
>
>* Scan method: Seq Scan
>  0. initializing (*2)
>  1. seq scanning heap(*1)
>  3. sorting tuples   (*2)
>  4. writing new heap (*1)
>  5. swapping relation files  (*2)
>  6. rebuilding index (*2)
>  7. performing final cleanup (*2)
>
>* Scan method: Index Scan
>  0. initializing (*2)
>  2. index scanning heap  (*1)
>  5. swapping relation files  (*2)
>  6. rebuilding index (*2)
>  7. performing final cleanup (*2)
>
> VACUUM FULL command will proceed in the following sequence of phases:
>
>  1. seq scanning heap(*1)
>  5. swapping relation files  (*2)
>  6. rebuilding index (*2)
>  7. performing final cleanup (*2)
>
> (*1): increasing the value in heap_tuples_scanned column
> (*2): only shows the phase in the phase column

All of that sounds good.

> The view provides the information of CLUSTER command progress details as 
> follows
> # \d pg_stat_progress_cluster
>View "pg_catalog.pg_stat_progress_cluster"
>Column   |  Type   | Collation | Nullable | Default
> ---+-+---+--+-
>   pid   | integer |   |  |
>   datid | oid |   |  |
>   datname   | name|   |  |
>   relid | oid |   |  |
>   command   | text|   |  |
>   phase | text|   |  |
>   cluster_index_relid   | bigint  |   |  |
>   heap_tuples_scanned   | bigint  |   |  |
>   heap_tuples_vacuumed  | bigint  |   |  |

Still not sure if we need heap_tuples_vacuumed.  We could try to
report heap_blks_scanned and heap_blks_total like we do for VACUUM, if
we're using a Seq Scan.

> === Discussion points ===
>
>   - Progress counter for "3. sorting tuples" phase
>  - Should we add pgstat_progress_update_param() in tuplesort.c like a
>"trace_sort"?
>Thanks to Peter Geoghegan for the useful advice!

How would we avoid an abstraction violation?

>   - Progress counter for "6. rebuilding index" phase
>  - Should we add "index_vacuum_count" in the view like a vacuum progress 
> monitor?
>If yes, I'll add pgstat_progress_update_param() to reindex_relation() 
> of index.c.
>However, I'm not sure whether it is okay or not.

Doesn't seem unreasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Amit Kapila

On Tue, Mar 5, 2019 at 5:17 AM Andres Freund  wrote:
>
> Hi,
>
> In the pluggable storage patch [1], one thing that I'm wondering about
> is how exactly to inherit the storage AM across partitions. I think
> that's potentially worthy of a discussion with a wider audience than I'd
> get in that thread.  It seems also related to the recent discussion in [2]
>
> Consider (excerpted from the tests):
>
> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
> heap2;
>
> SET default_table_access_method = 'heap';
> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('a');
>
> SET default_table_access_method = 'heap2';
> CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('b');
>
> CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('c') USING heap;
> CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('d') USING heap2;
>
> It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
> would be stored via heap2, and tableam_parted_c_heap2 via heap.
>
> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> quite as clear.  I think it'd both be sensible for new partitions to
> inherit the AM from the root, but it'd also be sensible to use the
> current default.
>

Yeah, we can go either way.

> Out of laziness (it's how it works rn) I'm inclined to to go with using
> the current default, but I'd be curious if others disagree.
>

I think using the current default should be okay as that will be the
behavior for non-partitioned tables as well.  However, if people have
good reasons to go other way, then that is fine too.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: psql display of foreign keys

2019-03-04 Thread Amit Langote

On 2019/03/05 4:41, Alvaro Herrera wrote:
> Here's the patch I'm really interested about :-)

Thanks for the updated patch.  I applied it and rebased the
foreign-keys-referencing-partitioned-tables patch on top.  Here's
something I think you may have missed:

-- partitioned primary key table
create table p (a int primary key) partition by list (a);
create table p1 partition of p for values in (1) partition by list (a);
create table p11 partition of p1 for values in (1);

-- regular primary key table
create table pk (a int primary key);

-- another partitioned table to define FK on
create table q (a int) partition by list (a);
create table q1 partition of q for values in (1) partition by list (a);
create table q11 partition of q1 for values in (1);

-- FK on q referencing p
alter table q add foreign key (a) references p;

-- seems OK

\d p
   Partitioned table "public.p"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition key: LIST (a)
Indexes:
"p_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)
Number of partitions: 1 (Use \d+ to list them.)

\d p1
   Partitioned table "public.p1"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition of: p FOR VALUES IN (1)
Partition key: LIST (a)
Indexes:
"p1_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)
Number of partitions: 1 (Use \d+ to list them.)

\d p11
Table "public.p11"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition of: p1 FOR VALUES IN (1)
Indexes:
"p11_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)

-- change the FK to reference regular table
alter table q drop constraint q_a_fkey ;
alter table q add foreign key (a) references pk;

-- not OK?
\d pk
 Table "public.pk"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Indexes:
"pk_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)
TABLE "q1" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)
TABLE "q11" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)

Shouldn't the above only list the constraint on q as follows?

Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)


Maybe:

@@ -2488,7 +2488,8 @@ describeOneTableDetails(const char *schemaname,
   "SELECT conname,
conrelid::pg_catalog.regclass,\n"
   "
pg_catalog.pg_get_constraintdef(c.oid, true) as condef\n"
   "FROM pg_catalog.pg_constraint c\n"
-  "WHERE c.confrelid = '%s' AND c.contype
= 'f' ORDER BY 1;",
+  "WHERE c.confrelid = '%s' AND c.contype
= 'f' AND conparentid = 0\n"
+  "ORDER BY conname;",

Thanks,
Amit

Re: Ordered Partitioned Table Scans

2019-03-04 Thread David Rowley

Thanks a lot for taking the time to look at this.

On Tue, 5 Mar 2019 at 03:03, Antonin Houska  wrote:
> As for the latest version (v8-0001-...) I've only caught a small typo: "When
> the first subpath needs sorted ...". It was probably meant "... needs sort
> ...".

That was a sort of short way of saying "needs [to be] sorted". I've
added in the missing "to be" in the attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch
Description: Binary data

Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote

Hi Jesper,

Thanks for the review.  I've made all of the changes per your comments in
my local repository.  I'll post the updated patches after diagnosing what
I'm suspecting a memory over-allocation bug in one of the patches.  If you
configure build with --enable-cassert, you'll see that throughput
decreases as number of partitions run into many thousands, but it doesn't
when asserts are turned off.

On 2019/03/05 1:20, Jesper Pedersen wrote:
> I'll run some tests using a hash partitioned setup.

Thanks.

Regards,
Amit

Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane

David Rowley  writes:
> On Tue, 5 Mar 2019 at 12:54, Andres Freund  wrote:
>> Yes, I think you have a point that progress here would be good and that
>> it's worth some pain. But the names will make even less sense if we just
>> shunt in an array based approach under the already obscure list
>> API.

> If we feel strongly about fixing that then probably it would be as
> simple as renaming the functions and adding some macros with the old
> names and insisting that all new or changed code use the functions and
> not the macro wrappers.

Meh ... Neil Conway already did a round of that back in 2004 or whenever,
and I'm not especially excited about another round.  I'm not really
following Andres' aversion to the list API --- it's not any more obscure
than a whole lot of things in Postgres.  (Admittedly, as somebody who
dabbled in Lisp decades ago, I might be more used to it than some folks.)

regards, tom lane

Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane

David Rowley  writes:
> ...  With list_concat() I find that pretty scary
> anyway. Using it means we can have a valid list that does not get it's
> length updated when someone appends a new item. Most users of that do
> list_copy() to sidestep that and other issues... which likely is
> something we'd want to rip out with Tom's patch.

Yeah, it's a bit OT for this patch, but I'd noticed the prevalence of
locutions like list_concat(list_copy(list1), list2), and been thinking
of proposing that we add some new primitives with, er, less ad-hoc
behavior.  The patch at hand already changes the semantics of list_concat
in a somewhat saner direction, but I think there is room for a version
of list_concat that treats both its inputs as const Lists.

regards, tom lane

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane

Paul Ramsey  writes:
>> On Mar 4, 2019, at 2:52 PM, Tom Lane  wrote:
>> BTW, if you'd like me to review the code you added for this, I'd be happy
>> to do so.  I've never looked at PostGIS' innards, but probably I can make
>> sense of the code for this despite that.

> I would be ecstatic for a review, I'm sure I've left a million loose threads 
> dangling.

I took a look, and saw that you'd neglected to check pseudoconstantness
of the non-index argument, so this'd fail on cases like ST_DWithin(x, y)
where x is indexed and y is another column in the same table.  Also
I thought the handling of commutation could be done better.  Attached is
a suggested patch atop your f731c1b7022381dbf627cae311c3d37791bf40c3 to
fix those and a couple of nitpicky other things.  (I haven't tested this,
mind you.)

One thing that makes me itch, but I didn't address in the attached,
is that expandFunctionOid() is looking up a function by name without
any schema-qualification.  That'll fail outright if PostGIS isn't in
the search path, and even if it is, you've got security issues there.
One way to address this is to assume that the expandfn is in the same
schema as the ST_XXX function you're attached to, so you could
do "get_namespace_name(get_func_namespace(funcid))" and then include
that in the list passed to LookupFuncName.

Also, this might be as-intended but I was wondering: I'd sort of expected
you to make, eg, _ST_DWithin() and ST_DWithin() into exact synonyms.
They aren't, since the former is not connected to the support function.
Is that intentional?  I guess if you had a situation where you wanted to
force non-use of an index, being able to use _ST_DWithin() for that would
be helpful.

regards, tom lane

diff --git a/postgis/gserialized_supportfn.c b/postgis/gserialized_supportfn.c
index 90c61f3..66b699b 100644
--- a/postgis/gserialized_supportfn.c
+++ b/postgis/gserialized_supportfn.c
@@ -60,7 +60,7 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS);
 */
 typedef struct
 {
-	char *fn_name;
+	const char *fn_name;
 	int   strategy_number; /* Index strategy to add */
 	int   nargs;   /* Expected number of function arguments */
 	int   expand_arg;  /* Radius argument for "within distance" search */
@@ -230,22 +230,42 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS)
 }
 
 /*
-* Somehow an indexed third argument has slipped in. That
-* should not happen.
+* We can only do something with index matches on the first or
+* second argument.
 */
 if (req->indexarg > 1)
 	PG_RETURN_POINTER((Node *)NULL);
 
 /*
-* Need the argument types (which should always be geometry or geography
-* since this function is only bound to those functions)
-* to use in the operator function lookup
+* Make sure we have enough arguments (just paranoia really).
 */
-if (nargs < 2)
+if (nargs < 2 || nargs < idxfn.expand_arg)
 	elog(ERROR, "%s: associated with function with %d arguments", __func__, nargs);
 
-leftarg = linitial(clause->args);
-rightarg = lsecond(clause->args);
+/*
+* Extract "leftarg" as the arg matching the index, and
+* "rightarg" as the other one, even if they were in the
+* opposite order in the call.  NOTE: the functions we deal
+* with here treat their first two arguments symmetrically
+* enough that we needn't distinguish the two cases beyond
+* this.  This might need more work later.
+*/
+if (req->indexarg == 0)
+{
+	leftarg = linitial(clause->args);
+	rightarg = lsecond(clause->args);
+}
+else
+{
+	rightarg = linitial(clause->args);
+	leftarg = lsecond(clause->args);
+}
+
+/*
+* Need the argument types (which should always be geometry or geography
+* since this function is only bound to those functions)
+* to use in the operator function lookup.
+*/
 leftdatatype = exprType(leftarg);
 rightdatatype = exprType(rightarg);
 
@@ -267,20 +287,27 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS)
 */
 if (idxfn.expand_arg)
 {
-	Node *indexarg = req->indexarg ? rightarg : leftarg;
-	Node *otherarg = req->indexarg ? leftarg : rightarg;
 	Node *radiusarg = (Node *) list_nth(clause->args, idxfn.expand_arg-1);
-	// Oid indexdatatype = exprType(indexarg);
-	Oid otherdatatype = exprType(otherarg);
-	Oid expandfn_oid = expandFunctionOid(otherdatatype);
+	Oid expandfn_oid = expandFunctionOid(rightdatatype);
+
+	FuncExpr *expandexpr = makeFuncExpr(expandfn_oid, rightdatatype,
+	list_make2(rightarg, radiusarg),
+		InvalidOid, InvalidOid, COERCE_EXPLICIT_CALL);
 
-	FuncExpr *expandexpr = makeFuncExpr(expandfn_oid, otherdatatype,
-	list_make2(otherarg, radiusarg),
-		InvalidOid, req->indexcollation, COERCE_EXPLICIT_CALL);
+	/*
+	* The comparison expression has to be a pseudoconstant,
+	* ie not volatile nor dependent on the target index's
+	* tabl

Re: POC: converting Lists into arrays

2019-03-04 Thread David Rowley

On Tue, 5 Mar 2019 at 12:54, Andres Freund  wrote:
> Yes, I think you have a point that progress here would be good and that
> it's worth some pain. But the names will make even less sense if we just
> shunt in an array based approach under the already obscure list
> API.

If we feel strongly about fixing that then probably it would be as
simple as renaming the functions and adding some macros with the old
names and insisting that all new or changed code use the functions and
not the macro wrappers. That could be followed up by a final sweep in
N years time when the numbers have dwindled to a low enough level. All
that code mustn't be getting modified anyway, so not much chance
backpatching pain.

I see length() finally died in a similar way in Tom's patch.  Perhaps
doing this would have people consider lcons more carefully before they
use it over lappend.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Should we increase the default vacuum_cost_limit?

2019-03-04 Thread Peter Geoghegan

On Mon, Feb 25, 2019 at 8:48 AM Robert Haas  wrote:
> +1 for raising the default substantially.  In my experience, and it
> seems others are in a similar place, nobody ever gets into trouble
> because the default is too high, but sometimes people get in trouble
> because the default is too low.

Does anyone want to make an argument against the idea of raising the
default? They should speak up now.

-- 
Peter Geoghegan

GSoC 2019 - TOAST'ing in slices idea

2019-03-04 Thread Bruno Hass


Hello everyone,

I am currently writing my proposal for GSoC 2019 for the TOAST'ing in slices 
idea.
 I already have a sketch of the description and approach outline, which I am 
sending in this e-mail. I would be happy to receive some feedback on it. I've 
been reading the PostgreSQL source code and documentation in order to 
understand better how to implement this idea and to write a detailed proposal. 
I would also be glad to receive some recommendation of documentation or 
material on TOAST'ing internals as well as some hint on where to look further 
in the source code.


I am looking forward to your feedback!


PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf

Re: pg_dump multi VALUES INSERT

2019-03-04 Thread David Rowley

Thanks for looking at this again.

On Sat, 2 Mar 2019 at 20:01, Fabien COELHO  wrote:
> Although I'm all in favor of checking the int associated to the option, I
> do not think that it warrants three checks and messages. I would suggest
> to factor them all as just one check and one (terse) message.

Yeah. I've been trying to keep that area sane for a while, so I agree
that one message is fine. Done that way in the attached and put back
the missing ERANGE check.

> Option "--help" line: number of row*s* ?

Oops. Fixed.

> About the output: I'd suggest to indent one line per row, something like:
>
>INSERT INTO foo VALUES
>  (..., ..., ..., ...),
>  (..., ..., ..., ...),
>  (..., ..., ..., ...);

Reasonable. Change it to that. I put a tab at the start of those
lines.   There's still the possibility that one 1 final row makes up
the final INSERT.  These will still span multiple lines. I don't think
there's anything that can reasonably be done about that.

> I'd suggest to add test tables with (1) no rows and (2) no columns but a
> few rows, with multiple --table options.

I didn't do that. I partially think that you're asking for tests to
test existing behaviour and partly because perl gives me a sore head.
Maybe Surafel want to do that?

v17 attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pg_dump-rows-per-insert-option-v17.patch
Description: Binary data

RE: pgbench - doCustom cleanup

2019-03-04 Thread Jamison, Kirk

Hi Fabien, 

>> See the attached latest patch.
>> The attached patch applies, builds cleanly, and passes the regression 
>> tests.
>
> No problems on my part as I find the changes logical.
> This also needs a confirmation from Alvaro.
>
> Ok.
>
> You switched the patch to "waiting on author": What are you waiting from me?
>
> If you think that it is ok and that it should be considered by a committer,
> probably Alvaro, it can be marked as "ready".

Indeed, it was my mistake.
I have already marked it as "Ready for Committer".

Regards,
Kirk Jamison

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-05 12:42:47 +1300, David Rowley wrote:
> So you think targetlists are the only case to benefit from an array
> based list? (Ignoring the fact that I already showed another case)

No, that's not what I'm trying to say at all. I think there's plenty
cases where it'd be beneficial. In this subthread we're just arguing
whether it's somewhat feasible to not change everything, and I'm still
fairly convinced that's possible; but I'm not arguing that that's the
best way.

> It's true that linked lists are certainly better for some stuff;
> list_concat() is going to get slower, lcons() too, but likely we can
> have a bonus lcons() elimination round at some point. I see quite a
> few of them that look like they could be changed to lappend().  I also
> just feel that if we insist on more here then we'll get about nothing.
> I'm also blocked on my partition performance improvement goals on
> list_nth() being O(N), so I'm keen to see progress here and do what I
> can to help with that.  With list_concat() I find that pretty scary
> anyway. Using it means we can have a valid list that does not get it's
> length updated when someone appends a new item. Most users of that do
> list_copy() to sidestep that and other issues... which likely is
> something we'd want to rip out with Tom's patch.

Yes, I think you have a point that progress here would be good and that
it's worth some pain. But the names will make even less sense if we just
shunt in an array based approach under the already obscure list
API. Obviously the individual pain of that is fairly small, but over the
years and everybody reading PG code, it's also substantial.  So I'm
torn.

Greetings,

Andres Freund

Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-03-04 Thread Bruce Momjian

On Thu, Feb 28, 2019 at 10:28:44AM +0300, Sergei Kornilov wrote:
> Hello
> 
> postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
> ab1d9f066aee4f9b81abde6136771debe0191ae8)
> So config will be changed in next minor release anyway. We have another 
> reason to not fix bgwriter_lru_maxpages comment?

Frankly, I was surprised postgresql.conf.sample was changed in a back
branch since it will cause people who diff $PGDATA/postgresql.conf with
share/postgresql.conf.sample to see differences they didn't make.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Inheriting table AMs for partitioned tables

2019-03-04 Thread Andres Freund

Hi,

In the pluggable storage patch [1], one thing that I'm wondering about
is how exactly to inherit the storage AM across partitions. I think
that's potentially worthy of a discussion with a wider audience than I'd
get in that thread.  It seems also related to the recent discussion in [2]

Consider (excerpted from the tests):

CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
heap2;

SET default_table_access_method = 'heap';
CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('a');

SET default_table_access_method = 'heap2';
CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('b');

CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('c') USING heap;
CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('d') USING heap2;

It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
would be stored via heap2, and tableam_parted_c_heap2 via heap.

But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
quite as clear.  I think it'd both be sensible for new partitions to
inherit the AM from the root, but it'd also be sensible to use the
current default.

Out of laziness (it's how it works rn) I'm inclined to to go with using
the current default, but I'd be curious if others disagree.

Greetings,

Andres Freund

[1] https://postgr.es/m/20180703070645.wchpu5muyto5n647%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/201902041630.gpadougzab7v%40alvherre.pgsql

Re: POC: converting Lists into arrays

2019-03-04 Thread David Rowley

On Tue, 5 Mar 2019 at 11:11, Andres Freund  wrote:
> On 2019-03-04 16:28:40 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > I don't buy this. I think e.g. redisgning the way we represent
> > > targetlists would be good (it's e.g. insane that we recompute
> > > descriptors out of them all the time), and would reduce their allocator
> > > costs.
> >
> > Maybe we're not on the same page here, but it seems to me that that'd be
> > addressable with pretty localized changes (eg, adding more fields to
> > TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
> > node).  But if the concern is about the amount of palloc bandwidth going
> > into List cells, we're not going to be able to improve that with localized
> > data structure changes; it'll take something like the patch I've proposed.
>
> What I'm saying is that it'd be reasonable to replace the use of list
> for targetlists with 'list2' without a wholesale replacement of all the
> list code, and it'd give us benefits.

So you think targetlists are the only case to benefit from an array
based list? (Ignoring the fact that I already showed another case)
When we discover the next thing to benefit, then the replacement will
be piecemeal, just the way Tom would rather not do it.  I personally
don't want to be up against huge resistance when I discover that
turning a single usage of a List into List2 is better.  We'll need to
consider backpatching pain / API breakage *every single time*.

A while ago I did have a go at changing some List implementations for
my then proposed ArrayList and it was beyond a nightmare, as each time
I changed one I realised I needed to change another. In the end, I
just gave up. Think of all the places we have forboth() and
forthree(), we'll need to either provide a set of macros that take
various combinations of List and List2 or do some conversion
beforehand.  With respect, if you don't believe me, please take my
ArrayList patch [1] and have a go at changing targetlists to use
ArrayLists all the way from the parser through to the executor. I'll
be interested in the diff stat once you're done.

It's true that linked lists are certainly better for some stuff;
list_concat() is going to get slower, lcons() too, but likely we can
have a bonus lcons() elimination round at some point. I see quite a
few of them that look like they could be changed to lappend().  I also
just feel that if we insist on more here then we'll get about nothing.
I'm also blocked on my partition performance improvement goals on
list_nth() being O(N), so I'm keen to see progress here and do what I
can to help with that.  With list_concat() I find that pretty scary
anyway. Using it means we can have a valid list that does not get it's
length updated when someone appends a new item. Most users of that do
list_copy() to sidestep that and other issues... which likely is
something we'd want to rip out with Tom's patch.

[1] 
https://www.postgresql.org/message-id/CAKJS1f_2SnXhPVa6eWjzy2O9A=ocwgd0cj-lqewpgtrwqbu...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: New vacuum option to do only freezing

2019-03-04 Thread Bossart, Nathan

On 2/28/19, 12:13 AM, "Masahiko Sawada"  wrote:
> Attached the updated version patch. I've incorporated all review
> comments I got and have changed the number of tuples being reported as
> 'removed tuples'. With this option, tuples completely being removed is
> only tuples marked as unused during HOT-pruning, other dead tuples are
> left. So we count those tuples during HOT-pruning and reports it as
> removed tuples.

Thanks for the new patches.  Beyond the reloptions discussion, most of
my feedback is wording suggestions.

+  VACUUM removes dead tuples and prunes HOT-updated
+  tuples chain for live tuples on table. If the table has any dead tuple
+  it removes them from both table and indexes for re-use. With this
+  option VACUUM doesn't completely remove dead tuples
+  and disables removing dead tuples from indexes.  This is suitable for
+  avoiding transaction ID wraparound (see
+  ) but not sufficient for avoiding
+  index bloat. This option is ignored if the table doesn't have index.
+  This cannot be used in conjunction with FULL
+  option.

There are a couple of small changes I would make.  How does something
like this sound?

VACUUM removes dead tuples and prunes HOT-updated tuple chains for
live tuples on the table.  If the table has any dead tuples, it
removes them from both the table and its indexes and marks the
corresponding line pointers as available for re-use.  With this
option, VACUUM still removes dead tuples from the table, but it
does not process any indexes, and the line pointers are marked as
dead instead of available for re-use.  This is suitable for
avoiding transaction ID wraparound (see Section 24.1.5) but not
sufficient for avoiding index bloat.  This option is ignored if
the table does not have any indexes.  This cannot be used in
conjunction with the FULL option.

- * Returns the number of tuples deleted from the page and sets
- * latestRemovedXid.
+ * Returns the number of tuples deleted from the page and set latestRemoveXid
+ * and increment nunused.

I would say something like: "Returns the number of tuples deleted from
the page, sets latestRemovedXid, and updates nunused."

+   /*
+* hasindex = true means two-pass strategy; false means one-pass. But we
+* always use the one-pass strategy when index vacuum is disabled.
+*/

I think the added sentence should make it more clear that hasindex
will still be true when DISABLE_INDEX_CLEANUP is used.  Maybe
something like:

/*
 * hasindex = true means two-pass strategy; false means one-pass
 *
 * If DISABLE_INDEX_CLEANUP is used, hasindex may still be true,
 * but we'll always use the one-pass strategy.
 */

tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
-   
 &vacrelstats->latestRemovedXid);
+   
 &vacrelstats->latestRemovedXid,
+   
 &tups_pruned);

Why do we need a separate tups_pruned argument in heap_page_prune()?
Could we add the result of heap_page_prune() to tups_pruned instead,
then report the total number of removed tuples as tups_vacuumed +
tups_pruned elsewhere?

+* If there are no indexes or we skip index vacuum then we can 
vacuum
+* the page right now instead of doing a second scan.

How about:

If there are no indexes or index cleanup is disabled, we can
vacuum the page right now instead of doing a second scan.

+   /*
+* Here, we have indexes but index vacuum is 
disabled. We don't
+* vacuum dead tuples on heap but forget them 
as we skip index
+* vacuum. The vacrelstats->dead_tuples could 
have tuples which
+* became dead after checked at HOT-pruning 
time but aren't marked
+* as dead yet. We don't process them because 
it's a very rare
+* condition and the next vacuum will process 
them.
+*/

I would suggest a few small changes:

/*
 * Here, we have indexes but index vacuum is disabled.  Instead of
 * vacuuming the dead tuples on the heap, we just forget them.
 *
 * Note that vacrelstats->dead_tuples could include tuples which
 * became dead after HOT-pruning but are not marked dead yet.  We
 * do not process them because this is a very rare condition, and
 * the next vacuum will process them anyway.
 */

-   /* If no indexes, make log report that lazy_vacuum_heap would've made */
+   /*
+* If no index or disables index vacuum, make log report that 
lazy_vacuum_heap
+* would

Re: jsonpath

2019-03-04 Thread Tomas Vondra

A bunch of additional comments, after looking at the patch a bit today.
All are mostly minor, and sometime perhaps a matter of preference.


1) There's a mismatch between the comment and actual function name for
jsonb_path_match_opr and jsonb_path_exists_opr(). The comments say
"_novars" instead.

2) In a couple of switches the "default" case does a return with a
value, following elog(ERROR). So it's practically unreachable, AFAICS
it's fine without it, and we don't do this elsewhere. And I don't get
any compiler warnings if I remove it either.

Examples:

JsonbTypeName

default:
elog(ERROR, "unrecognized jsonb value type: %d", jbv->type);
return "unknown";

jspOperationName

default:
elog(ERROR, "unrecognized jsonpath item type: %d", type);
return NULL;

compareItems

default:
elog(ERROR, "unrecognized jsonpath operation: %d", op);
return jpbUnknown;

3) jsonpath_send is using makeStringInfo() for a value that is not
returned - IMHO it should use regular stack-allocated variable and use
initStringInfo() instead

4) the version number should be defined/used as a constant, not as a
magic constant somewhere in the code

5) Why does jsonPathToCstring do this?

appendBinaryStringInfo(out, "strict ", 7);

Why not to use regular appendStringInfoString()? What am I missing?

6) comment typo: "Aling StringInfo"

7) alignStringInfoInt() should explain why we need this and why INTALIGN
is the right alignment.

8) I'm a bit puzzled by what flattenJsonPathParseItem does with 'next'

I don't quite understand what it's doing with 'next' value?

/*
 * Actual value will be recorded later, after next and children
 * processing.
 */
appendBinaryStringInfo(buf,
   (char *) &next, /* fake value */
   sizeof(next));

Perhaps a comment explaining it (why we need a fake value at all?) would
be a good idea here.

9) I see printJsonPathItem is only calling check_stack_depth while
flattenJsonPathParseItem also calls CHECK_INTERRUPTS. Why the
difference, considering they seem about equally expensive?

10) executeNumericItemMethod is missing a comment (unlike the other
executeXXX functions)

11) Wording of some of the error messages in the execute methods seems a
bit odd. For example executeNumericItemMethod may complain that it

... is applied to not a numeric value

but perhaps a more natural wording would be

... is applied to a non-numeric value

And similarly for the other execute methods. But I'm not a native
speaker, so perhaps the original wording is just fine.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Rare SSL failures on eelpout

2019-03-04 Thread Thomas Munro

On Tue, Mar 5, 2019 at 11:35 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > OK, here's something.  I can reproduce it quite easily on this
> > machine, and I can fix it like this:
>
> >libpq_gettext("could not send startup packet: %s\n"),
> >SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> > free(startpacket);
> > +   pqHandleSendFailure(conn);
> > goto error_return;
> > }
>
> Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
> constraints on what state libpq has to be in, as a consequence of which
> it's called from a bunch of ad-hoc places, and can't be called from
> some others.  It's kind of accidental that it will work here.
>
> I was toying with the idea of getting rid of that function in
> favor of a design like this:
>
> * Handle any socket-write failure at some low level of libpq by
> recording the fact that the error occurred (plus whatever data we
> need to produce a message about it), and then starting to just
> discard output data.
>
> * Eventually, we'll try to read.  Process any available input data
> as usual (and, if we get a read error, report that).  When no more
> input data is available, if a socket write failure has been recorded,
> report that much as if it were an incoming ERROR message, and then
> shut down the connection.
>
> This would essentially give us pqHandleSendFailure-like behavior
> across the board, which might be enough to fix this problem as well as
> bug #15598.  Or not ... as you say, we haven't thoroughly understood
> either issue, so it's possible this wouldn't do it.
>
> This would have the effect of prioritizing reports of socket read
> failures over those of write failures, which is an interesting
> behavioral change, but offhand I can't see a problem with it.

That seems to recreate (and extend) the behaviour that we see on
another TCP stacks, or on Linux with a remote connection, namely that
the first sendto() succeeds (even though future sendto() calls may
fail, we don't usually get to that because we read and discover an
application-level error message or whatever).

> One thing that isn't real clear to me is how much timing sensitivity
> there is in "when no more input data is available".  Can we assume that
> if we've gotten ECONNRESET or an allied error from a write, then any
> data the far end might've wished to send us is already available to
> read?  In principle, since TCP allows shutting down either transmission
> direction independently, the server could close the read side of its
> socket while continuing to transmit data.  In practice, PG servers
> don't do that; but could network timing glitches create the same effect?
> Even if it's possible, does it happen enough to worry about?

That is beyond my knowledge of networking; I have CC'd Andrew Gierth
in case he has something to say about that.  It certainly seems
sensible to assume that eg RST must follow any data that the other end
sent before closing, if it did indeed call close(), and in our case we
know that it never calls shutdown(SHUT_RD), so the only other
possibility seems to be that it crashed or the connection was lost.
So it seems safe to assume that we can read the server's parting words
after we've had a EPIPE/ECONNRESET error.  Based on some quick
googling, it looks like SHUT_RD doesn't actually send anything anyway
(unlike SHUT_WR which sends FIN and would result in EOF at the other
end), so I'm not sure if SHUT_RD even "exists" outside the mind of the
TCP stack doing it and therefore I'm not sure if there is any "timing
glitch" that could resemble it.  But I don't know.

> The reason I'm concerned is that I don't think it'd be bright to ignore a
> send error until we see input EOF, which'd be the obvious way to solve a
> timing problem if there is one.  If our send error was some transient
> glitch and the connection is really still open, then we might not get EOF,
> but we won't get a server reply either because no message went to the
> server.  You could imagine waiting some small interval of time before
> deciding that it's time to report the write failure, but ugh.
>
> In any case, consuming immediately-available data before reporting the
> write error would already be a step forward over what we're doing now,
> I think.
>
> Thoughts?

It seems reasonable to me on the grounds that sending data doesn't
mean the other guy got it anyway and we can see that on other TCP
stacks.  Preferring errors resulting from reads seems sensible because
it's exactly what we want in these cases, and I can't immediately
think of a downside.  It's bigger surgery than I was thinking of but
it seems like it removes some pre-existing complicated code and
replaces it with something simple, so that seems to be a win.

-- 
Thomas Munro
https://enterprisedb.com

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey



> On Mar 4, 2019, at 2:52 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> Gotcha, done and now have an implementation that passes all our regression 
>> tests.
> 
> Very cool!  So the next step, I guess, is to address your original problem
> by cranking up the cost estimates for these functions --- have you tried
> that yet?  In principle you should be able to do that and not have any
> bad planning side-effects, but this is all pretty new territory so maybe
> some problems remain to be ironed out.
> 
> BTW, if you'd like me to review the code you added for this, I'd be happy
> to do so.  I've never looked at PostGIS' innards, but probably I can make
> sense of the code for this despite that.

I would be ecstatic for a review, I’m sure I’ve left a million loose threads 
dangling.

P.

https://github.com/pramsey/postgis/blob/svn-trunk-supportfn/postgis/gserialized_supportfn.c#L191
 


https://github.com/pramsey/postgis/blob/svn-trunk-supportfn/postgis/postgis.sql.in#L4290

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane

Paul Ramsey  writes:
> Gotcha, done and now have an implementation that passes all our regression 
> tests.

Very cool!  So the next step, I guess, is to address your original problem
by cranking up the cost estimates for these functions --- have you tried
that yet?  In principle you should be able to do that and not have any
bad planning side-effects, but this is all pretty new territory so maybe
some problems remain to be ironed out.

BTW, if you'd like me to review the code you added for this, I'd be happy
to do so.  I've never looked at PostGIS' innards, but probably I can make
sense of the code for this despite that.

regards, tom lane

Re: pg_partition_tree crashes for a non-defined relation

2019-03-04 Thread Michael Paquier

On Mon, Mar 04, 2019 at 03:56:00PM -0300, Alvaro Herrera wrote:
> I added tests to immortalize the current behavior for legacy inheritance
> relations too.  Thanks!

Makes sense, thanks.
--
Michael


signature.asc
Description: PGP signature

Re: Remove Deprecated Exclusive Backup Mode

2019-03-04 Thread Bruce Momjian

On Thu, Feb 28, 2019 at 05:08:24PM +0200, David Steele wrote:
> On 2/28/19 4:44 PM, Fujii Masao wrote:
> > On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe  
> > wrote:
> > > 
> > > Fujii Masao wrote:
> > > > So, let me clarify the situations;
> > > > 
> > > > (3) If backup_label.pending exists but recovery.signal doesn't, the 
> > > > server
> > > > ignores (or removes) backup_label.pending and do the recovery
> > > > starting the pg_control's REDO location. This case can happen if
> > > > the server crashes while an exclusive backup is in progress.
> > > > So crash-in-the-middle-of-backup doesn't prevent the recovery 
> > > > from
> > > > starting in this idea
> > > 
> > > That's where I see the problem with your idea.
> > > 
> > > It is fairly easy for someone to restore a backup and then just start
> > > the server without first creating "recovery.signal", and that would
> > > lead to data corruption.
> > 
> > Isn't this case problematic even when a backup was taken by pg_basebackup?
> > Because of lack of recovery.signal, no archived WAL files are replayed and
> > the database may not reach to the latest point.
> 
> It is problematic because we have a message encouraging users to delete
> backup_label when in fact they should be creating recovery.signal.

Should we issue a client NOTICE/WARNING message as part of
pg_start_backup to indicate what to do if the server crashes before
pg_stop_backup()?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Patch to document base64 encoding

2019-03-04 Thread Karl O. Pinc

Hi,

Doc patch, against master.  Documents encode() and decode() base64
format.

Builds for me.

Attached: doc_base64_v1.patch

References RFC2045 section 6.8 to define base64.

Because encode() and decode() show up in both the string
functions section and the binary string functions section
I documented in only the string functions section and hyperlinked
"base64" in both sections to the new text.


Note that XML output can also generate base64 data.  I suspect
this is done via the (different, src/common/base64.c)
pg_b64_encode() function which does not limit line length.  
In any case this patch does not touch the XML documentation.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..f6b4b2c5a0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+ base64
+
 decode(string text,
 format text)

@@ -1769,13 +1772,16 @@
 
  encode
 
+
+ base64
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
+formats are: base64, hex, escape.
 escape converts zero bytes and high-bit-set bytes to
 octal sequences (\nnn) and
 doubles backslashes.
@@ -2365,6 +2371,20 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64
+   
+
+   
+ The base64 encoding of the encode
+ and decode functions is that of RFC2045 section 6.8.
+ As per the RFC, encoded lines are broken at 76 characters.  However
+ instead of the MIME CRLF end-of-line marker, only a newline is used for
+ end-of-line.  The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is raised
+ when decode is supplied invalid base64 data.
+   
+

See also the aggregate function string_agg in
.
@@ -3577,13 +3597,16 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
+   formats are: base64, hex, escape.
escape converts zero bytes and high-bit-set bytes to
octal sequences (\nnn) and
doubles backslashes.

Re: Rare SSL failures on eelpout

2019-03-04 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Mar 5, 2019 at 10:08 AM Tom Lane  wrote:
>> It's all very confusing, but I think there's a nontrivial chance
>> that this is an OpenSSL bug, especially since we haven't been able
>> to replicate it elsewhere.

> Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
> remember that very few animals are running the ssl tests; also it
> requires particular timing to hit.

True.  I've spent some time today running the ssl tests on various
machines here, without any luck reproducing.

> OK, here's something.  I can reproduce it quite easily on this
> machine, and I can fix it like this:

>libpq_gettext("could not send startup packet: %s\n"),
>SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> free(startpacket);
> +   pqHandleSendFailure(conn);
> goto error_return;
> }

Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
constraints on what state libpq has to be in, as a consequence of which
it's called from a bunch of ad-hoc places, and can't be called from
some others.  It's kind of accidental that it will work here.

I was toying with the idea of getting rid of that function in
favor of a design like this:

* Handle any socket-write failure at some low level of libpq by
recording the fact that the error occurred (plus whatever data we
need to produce a message about it), and then starting to just
discard output data.

* Eventually, we'll try to read.  Process any available input data
as usual (and, if we get a read error, report that).  When no more
input data is available, if a socket write failure has been recorded,
report that much as if it were an incoming ERROR message, and then
shut down the connection.

This would essentially give us pqHandleSendFailure-like behavior
across the board, which might be enough to fix this problem as well as
bug #15598.  Or not ... as you say, we haven't thoroughly understood
either issue, so it's possible this wouldn't do it.

This would have the effect of prioritizing reports of socket read
failures over those of write failures, which is an interesting
behavioral change, but offhand I can't see a problem with it.

One thing that isn't real clear to me is how much timing sensitivity
there is in "when no more input data is available".  Can we assume that
if we've gotten ECONNRESET or an allied error from a write, then any
data the far end might've wished to send us is already available to
read?  In principle, since TCP allows shutting down either transmission
direction independently, the server could close the read side of its
socket while continuing to transmit data.  In practice, PG servers
don't do that; but could network timing glitches create the same effect?
Even if it's possible, does it happen enough to worry about?

The reason I'm concerned is that I don't think it'd be bright to ignore a
send error until we see input EOF, which'd be the obvious way to solve a
timing problem if there is one.  If our send error was some transient
glitch and the connection is really still open, then we might not get EOF,
but we won't get a server reply either because no message went to the
server.  You could imagine waiting some small interval of time before
deciding that it's time to report the write failure, but ugh.

In any case, consuming immediately-available data before reporting the
write error would already be a step forward over what we're doing now,
I think.

Thoughts?

regards, tom lane

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan

On 3/4/19, 2:05 PM, "Andres Freund"  wrote:
> On 2019-03-04 22:00:47 +, Bossart, Nathan wrote:
>> On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
>> > Yea, I do think that's a danger. But we allow disabling autovacuum, so
>> > I'm not sure it matters that much... And for indexes you'd still have
>> > the index page-level vacuum that'd continue to work.
>> 
>> I think the difference here is that there isn't something like
>> autovacuum_freeze_max_age to force index cleanup at some point.
>> Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
>> at least there's a fallback available.
>
> Well, but your cluster doesn't suddenly shut down because of index bloat
> (in contrast to xid wraparound). So I don't quite see an equivalent need
> for an emergency valve.  I think we should just put a warning into the
> reloption's docs, and leave it at that.

That seems reasonable to me.

Nathan

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 16:28:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I don't buy this. I think e.g. redisgning the way we represent
> > targetlists would be good (it's e.g. insane that we recompute
> > descriptors out of them all the time), and would reduce their allocator
> > costs.
> 
> Maybe we're not on the same page here, but it seems to me that that'd be
> addressable with pretty localized changes (eg, adding more fields to
> TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
> node).  But if the concern is about the amount of palloc bandwidth going
> into List cells, we're not going to be able to improve that with localized
> data structure changes; it'll take something like the patch I've proposed.

What I'm saying is that it'd be reasonable to replace the use of list
for targetlists with 'list2' without a wholesale replacement of all the
list code, and it'd give us benefits.


> I find it interesting that you get different results.

What I reported weren't vanilla pgbench -S results, so there's that
difference. If measure the DO loop based test I posted, do you see a
difference?

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2019-03-04 Thread Jerry Jelinek

Alvaro,

Thanks for taking a look at the new patch. I'll update the patch to change
the name of the tunable to match your suggestion and I'll also go through
the cleanup you suggested. Finally, I'll try to rewrite the doc to
eliminate the confusion around the wording about allocating new blocks on
every write. I'll send out a new draft of the patch once all of these
changes are done.

Thanks again,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: POC: converting Lists into arrays

2019-03-04 Thread Peter Geoghegan

On Mon, Mar 4, 2019 at 2:04 PM Bruce Momjian  wrote:
> On Mon, Mar  4, 2019 at 12:44:41PM -0500, Robert Haas wrote:
> > I think that's a separate but also promising thing to attack, and I
> > agree that it'd take a lot of work to get there.  I don't think that
> > the problem with either parse-tree-copying or list usage is that no
> > performance benefits are to be had; I think it's that the amount of
> > work required to get those benefits is pretty large.  If it were
> > otherwise, somebody likely would have done it before now.
>
> Stupid question, but do we use any kind of reference counter to know if
> two subsystems look at a structure, and a copy is required?

No, but I wonder if we could use Valgrind to enforce rules about who
has the right to scribble on what, when. That could make it a lot
easier to impose a new rule.

-- 
Peter Geoghegan

Re: POC: converting Lists into arrays

2019-03-04 Thread Bruce Momjian

On Mon, Mar  4, 2019 at 01:11:35PM -0500, Tom Lane wrote:
> Robert Haas  writes:
> > I think the reason why you're not seeing a performance benefit is
> > because the problem is not that lists are generically a more expensive
> > data structure than arrays, but that there are cases when they are
> > more expensive than arrays.  If you only ever push/pop at the front,
> > of course a list is going to be better.  If you often look up elements
> > by index, of course an array is going to be better.  If you change
> > every case where the code currently uses a list to use something else
> > instead, then you're changing both the winning and losing cases.
> 
> I don't think this argument is especially on-point, because what I'm
> actually seeing is just that there aren't any list operations that
> are expensive enough to make much of an overall difference in
> typical queries.  To the extent that an array reimplementation
> reduces the palloc traffic, it'd take some load off that subsystem,
> but apparently you need not-typical queries to really notice.
> (And, if the real motivation is aggregate palloc savings, then yes you
> really do want to replace everything...)

Could it be that allocating List* structures near the structure it
points to is enough of a benefit in terms of cache hits that it is a
loss when moving to a List* array?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 22:00:47 +, Bossart, Nathan wrote:
> On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
> > Yea, I do think that's a danger. But we allow disabling autovacuum, so
> > I'm not sure it matters that much... And for indexes you'd still have
> > the index page-level vacuum that'd continue to work.
> 
> I think the difference here is that there isn't something like
> autovacuum_freeze_max_age to force index cleanup at some point.
> Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
> at least there's a fallback available.

Well, but your cluster doesn't suddenly shut down because of index bloat
(in contrast to xid wraparound). So I don't quite see an equivalent need
for an emergency valve.  I think we should just put a warning into the
reloption's docs, and leave it at that.

Greetings,

Andres Freund

Re: POC: converting Lists into arrays

2019-03-04 Thread Bruce Momjian

On Mon, Mar  4, 2019 at 12:44:41PM -0500, Robert Haas wrote:
> I think that's a separate but also promising thing to attack, and I
> agree that it'd take a lot of work to get there.  I don't think that
> the problem with either parse-tree-copying or list usage is that no
> performance benefits are to be had; I think it's that the amount of
> work required to get those benefits is pretty large.  If it were
> otherwise, somebody likely would have done it before now.

Stupid question, but do we use any kind of reference counter to know if
two subsystems look at a structure, and a copy is required?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan

On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
> On 2019-03-04 21:40:53 +, Bossart, Nathan wrote:
>> On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
>> > I'm not quite convinced this is right.  There's plenty sites that
>> > practically can't use autovacuum because it might decide to vacuum the
>> > 5TB index because of 300 dead tuples in the middle of busy periods.  And
>> > without an reloption that's not controllable.
>>
>> Wouldn't it be better to adjust the cost and threshold parameters or
>> to manually vacuum during quieter periods?
>
> No. (auto)vacuum is useful to reclaim space etc. It's just the
> unnecessary index cleanup that's the problem...  Most of the space can
> be reclaimed after all, the item pointer ain't that big...

I see what you mean.

>> I suppose setting DISABLE_INDEX_CLEANUP on a relation during busy
>> periods could be useful if you really need to continue reclaiming
>> transaction IDs, but that seems like an easy way to accidentally never
>> vacuum indexes.
>
> Yea, I do think that's a danger. But we allow disabling autovacuum, so
> I'm not sure it matters that much... And for indexes you'd still have
> the index page-level vacuum that'd continue to work.

I think the difference here is that there isn't something like
autovacuum_freeze_max_age to force index cleanup at some point.
Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
at least there's a fallback available.

Nathan

Re: Rare SSL failures on eelpout

2019-03-04 Thread Thomas Munro

On Tue, Mar 5, 2019 at 10:08 AM Tom Lane  wrote:
> I wrote:
> > Thomas Munro  writes:
> >> That suggests that we could perhaps handle ECONNRESET both at startup
> >> packet send time (for certificate rejection, eelpout's case) and at
> >> initial query send (for idle timeout, bug #15598's case) by attempting
> >> to read.  Does that make sense?
>
> > Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> > failure means there is nothing left to *read*.
>
> After staring at the code for awhile, I am thinking that there may be
> a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
> specific to the OpenSSL version you're using on eelpout?  There is
> not anything we could do differently in libpq, AFAICS, because it's
> OpenSSL's responsibility to read any data that might be available.
>
> I also looked into the idea that we're doing something wrong on the
> server side, allowing the final error message to not get flushed out.
> A plausible theory there is that SSL_shutdown is returning a WANT_READ
> or WANT_WRITE error and we should retry it ... but that doesn't square
> with your observation upthread that it's returning SSL_ERROR_SSL.
>
> It's all very confusing, but I think there's a nontrivial chance
> that this is an OpenSSL bug, especially since we haven't been able
> to replicate it elsewhere.

Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
remember that very few animals are running the ssl tests; also it
requires particular timing to hit.

OK, here's something.  I can reproduce it quite easily on this
machine, and I can fix it like this:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index f29202db5f..e9c137f1bd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2705,6 +2705,7 @@ keep_going:
 /* We will come back to here until there is

   libpq_gettext("could not send startup packet: %s\n"),

   SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
free(startpacket);
+   pqHandleSendFailure(conn);
goto error_return;
}

If I add some printf debugging in there, I can see that block being
reached every hundred or so times I try to connect with a revoked
certificate, and with that extra call to pqHandleSendFailure() in
there the error comes out as it should:

psql: SSL error: sslv3 alert certificate revoked

Now I'm confused because we already have handling like that in
PQsendQuery(), so I can't explain bug #15598.

-- 
Thomas Munro
https://enterprisedb.com

[Proposal] TOAST'ing in slices

2019-03-04 Thread Bruno Hass

Hello Hackers,

I'm intending to optimize some varlena data types as my GSoC proposal. That 
would be done by a smarter way of splitting the TOAST table chunks, depending 
on its data type. A JSONB would be split considering its internal tree 
structure, keeping track of which keys are in each chunk. Arrays and text 
fields are candidates for optimization as well, as I outlined in my proposal 
draft (attached). Not being very familiar with the source code, I am writing to 
this list seeking some guidance in where to look in the source code in order to 
detail my implementation. Furthermore, I have a couple of questions:

  *   Would it be a good idea to modify the TOAST table row to keep metadata on 
the data it stores?
  *   This proposal will modify how JSONB, array and text fields are TOASTed. 
Where can I find the code related to that?

Kind regards,

Bruno Hass



PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey




> On Mar 4, 2019, at 1:13 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> I had what seemed to be working code except for a couple rare cases,
>> but when I fixed those cases it turned out that I had a major problem:
>> building a  OP  expression works fine, but building a
>>  OP  expression returns me an error.
> 
> Yup, you're not supposed to do that.  The output expression *must* have
> the index key on the left, it's up to you to commute the operator if
> needed to make that happen.

Gotcha, done and now have an implementation that passes all our regression 
tests.

Thanks!

P

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-04 Thread Peter Geoghegan

On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas  wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> >  *_bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> >  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
> >  * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros a

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 21:40:53 +, Bossart, Nathan wrote:
> On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
> > I'm not quite convinced this is right.  There's plenty sites that
> > practically can't use autovacuum because it might decide to vacuum the
> > 5TB index because of 300 dead tuples in the middle of busy periods.  And
> > without an reloption that's not controllable.
>
> Wouldn't it be better to adjust the cost and threshold parameters or
> to manually vacuum during quieter periods?

No. (auto)vacuum is useful to reclaim space etc. It's just the
unnecessary index cleanup that's the problem...  Most of the space can
be reclaimed after all, the item pointer ain't that big...

> I suppose setting DISABLE_INDEX_CLEANUP on a relation during busy
> periods could be useful if you really need to continue reclaiming
> transaction IDs, but that seems like an easy way to accidentally never
> vacuum indexes.

Yea, I do think that's a danger. But we allow disabling autovacuum, so
I'm not sure it matters that much... And for indexes you'd still have
the index page-level vacuum that'd continue to work.

- Andres

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan

On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
> I'm not quite convinced this is right.  There's plenty sites that
> practically can't use autovacuum because it might decide to vacuum the
> 5TB index because of 300 dead tuples in the middle of busy periods.  And
> without an reloption that's not controllable.

Wouldn't it be better to adjust the cost and threshold parameters or
to manually vacuum during quieter periods?  I suppose setting
DISABLE_INDEX_CLEANUP on a relation during busy periods could be
useful if you really need to continue reclaiming transaction IDs, but
that seems like an easy way to accidentally never vacuum indexes.

Nathan

Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane

Andres Freund  writes:
> I don't buy this. I think e.g. redisgning the way we represent
> targetlists would be good (it's e.g. insane that we recompute
> descriptors out of them all the time), and would reduce their allocator
> costs.

Maybe we're not on the same page here, but it seems to me that that'd be
addressable with pretty localized changes (eg, adding more fields to
TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
node).  But if the concern is about the amount of palloc bandwidth going
into List cells, we're not going to be able to improve that with localized
data structure changes; it'll take something like the patch I've proposed.

I *have* actually done some tests of the sort you proposed, driving
just the planner and not any of the rest of the system, but I still
didn't find much evidence of big wins.  I find it interesting that
you get different results.

regards, tom lane

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane

Paul Ramsey  writes:
> I had what seemed to be working code except for a couple rare cases,
> but when I fixed those cases it turned out that I had a major problem:
> building a  OP  expression works fine, but building a
>  OP  expression returns me an error.

Yup, you're not supposed to do that.  The output expression *must* have
the index key on the left, it's up to you to commute the operator if
needed to make that happen.

regards, tom lane

Re: Rare SSL failures on eelpout

2019-03-04 Thread Tom Lane

I wrote:
> Thomas Munro  writes:
>> That suggests that we could perhaps handle ECONNRESET both at startup
>> packet send time (for certificate rejection, eelpout's case) and at
>> initial query send (for idle timeout, bug #15598's case) by attempting
>> to read.  Does that make sense?

> Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> failure means there is nothing left to *read*.

After staring at the code for awhile, I am thinking that there may be
a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
specific to the OpenSSL version you're using on eelpout?  There is
not anything we could do differently in libpq, AFAICS, because it's
OpenSSL's responsibility to read any data that might be available.

I also looked into the idea that we're doing something wrong on the
server side, allowing the final error message to not get flushed out.
A plausible theory there is that SSL_shutdown is returning a WANT_READ
or WANT_WRITE error and we should retry it ... but that doesn't square
with your observation upthread that it's returning SSL_ERROR_SSL.

It's all very confusing, but I think there's a nontrivial chance
that this is an OpenSSL bug, especially since we haven't been able
to replicate it elsewhere.

regards, tom lane

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 20:03:37 +, Bossart, Nathan wrote:
> On 3/3/19, 9:23 PM, "Masahiko Sawada"  wrote:
> > FWIW, I agree that we have options for vacuum as vacuum
> > command options. But for reloptions, I think if the persistence the
> > setting could be problematic we should not. According to the
> > discussions so far, I think VACUUM_SHRINK_ENABLED is the one option
> > that can be available as both vacuum command option and reloptions.
> > But I'm not sure there is good use case even if we can set
> > DISABLE_INDEX_CLEANUP as reloptions.
> 
> +1
> 
> The DISABLE_INDEX_CLEANUP option is intended to help avoid transaction
> ID wraparound and should not be used as a long-term VACUUM strategy
> for a table.

I'm not quite convinced this is right.  There's plenty sites that
practically can't use autovacuum because it might decide to vacuum the
5TB index because of 300 dead tuples in the middle of busy periods.  And
without an reloption that's not controllable.

- Andres

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan

On 3/3/19, 9:23 PM, "Masahiko Sawada"  wrote:
> FWIW, I agree that we have options for vacuum as vacuum
> command options. But for reloptions, I think if the persistence the
> setting could be problematic we should not. According to the
> discussions so far, I think VACUUM_SHRINK_ENABLED is the one option
> that can be available as both vacuum command option and reloptions.
> But I'm not sure there is good use case even if we can set
> DISABLE_INDEX_CLEANUP as reloptions.

+1

The DISABLE_INDEX_CLEANUP option is intended to help avoid transaction
ID wraparound and should not be used as a long-term VACUUM strategy
for a table.

Nathan

Re: psql display of foreign keys

2019-03-04 Thread Alvaro Herrera

On 2019-Feb-28, Michael Paquier wrote:

> On Wed, Feb 27, 2019 at 03:37:23PM -0300, Alvaro Herrera wrote:

> > +pg_partition_ancestors(PG_FUNCTION_ARGS)
> > +{
> > +   Oid relid = PG_GETARG_OID(0);
> > +   FuncCallContext *funcctx;
> > +   ListCell  **next;
> > +
> > +   if (!check_rel_can_be_partition(relid))
> > +   PG_RETURN_NULL();
> 
> Not returning an empty set here? ;)

Yeah, I adapted to what was there then, but in the original coding I had
the SRF_RETURN_DONE that you committed for pg_partition_tree.

> I would have added tests with pg_partition_ancestors(NULL) and
> pg_partition_ancestors(0) for consistency with the rest.

Done.

> Except that and the ancestor tracking for inheritance, the shape of
> the patch looks good to me.

Thanks for reviewing!  I have pushed with your proposed changes.

Here's the patch I'm really interested about :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 952d89b43543bb1c597a4fb8d4a8846a70daa0af Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Feb 2019 18:57:25 -0300
Subject: [PATCH v5] fix psql display of FKs

---
 src/bin/psql/describe.c   | 134 --
 src/test/regress/expected/foreign_key.out |  26 ++---
 2 files changed, 114 insertions(+), 46 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 4da6719ce71..6dac5fa3009 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1479,6 +1479,7 @@ describeOneTableDetails(const char *schemaname,
 		bool		rowsecurity;
 		bool		forcerowsecurity;
 		bool		hasoids;
+		bool		ispartition;
 		Oid			tablespace;
 		char	   *reloptions;
 		char	   *reloftype;
@@ -1501,7 +1502,24 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "false AS relhasoids, %s, c.reltablespace, "
+		  "false AS relhasoids, c.relispartition, %s, c.reltablespace, "
+		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
+		  "c.relpersistence, c.relreplident\n"
+		  "FROM pg_catalog.pg_class c\n "
+		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
+		  "WHERE c.oid = '%s';",
+		  (verbose ?
+		   "pg_catalog.array_to_string(c.reloptions || "
+		   "array(select 'toast.' || x from pg_catalog.unnest(tc.reloptions) x), ', ')\n"
+		   : "''"),
+		  oid);
+	}
+	else if (pset.sversion >= 11)
+	{
+		printfPQExpBuffer(&buf,
+		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
+		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
+		  "c.relhasoids, c.relispartition, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1518,7 +1536,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "c.relhasoids, %s, c.reltablespace, "
+		  "c.relhasoids, false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1535,7 +1553,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1552,7 +1570,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1569,7 +1587,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtyp

Re: Checksum errors in pg_stat_database

2019-03-04 Thread Julien Rouhaud

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander  wrote:
>
> It tracks things that happen in the general backends. Possibly we should also 
> consider counting the errors actually found when running base backups? OTOH, 
> that part of the code doesn't really track things like databases (as it 
> operates just on the raw data directory underneath), so that implementation 
> would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database.  Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey

So I am getting much closer to a working implementation in PostGIS,
but have just run into an issue which I am assuming is my
misunderstanding something...

https://github.com/pramsey/postgis/blob/92268c94f3aa1fc63a2941f2b451be15b28662cf/postgis/gserialized_supportfn.c#L287

I had what seemed to be working code except for a couple rare cases,
but when I fixed those cases it turned out that I had a major problem:
building a  OP  expression works fine, but building a
 OP  expression returns me an error.

create table f as select st_makepoint(200*random() - 100, 200*random()
- 100) as g from generate_series(0, 10);
create index f_g_x on f using gist (g);
explain select * from baz where st_coveredby('POINT(5 0)', geom);
explain select * from f where st_coveredby(g, 'POINT(5 0)');

QUERY PLAN
---
 Bitmap Heap Scan on f  (cost=13.36..314.58 rows=4 width=32)
   Filter: st_coveredby(g,
'0101001440'::geometry)
   ->  Bitmap Index Scan on f_g_x  (cost=0.00..5.03 rows=100 width=0)
 Index Cond: (g @
'0101001440'::geometry)

postgis=# explain select * from f where st_coveredby('POINT(5 0)', g);

ERROR:  index key does not match expected index column

Any thoughts?

P

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 13:11:35 -0500, Tom Lane wrote:
> The concern I have is mostly about the use of lists as core infrastructure
> in parsetree, plantree, etc data structures.  I think any idea that we'd
> replace those piecemeal is borderline insane: it's simply not worth it
> from a notational and bug-risk standpoint to glue together some parts of
> those structures differently from the way other parts are glued together.

I don't buy this. I think e.g. redisgning the way we represent
targetlists would be good (it's e.g. insane that we recompute
descriptors out of them all the time), and would reduce their allocator
costs.

Greetings,

Andres Freund

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-03 13:29:04 -0500, Tom Lane wrote:
> The cases I've been looking at suggest to me that we'd make far
> more progress on the excessive-palloc'ing front if we could redesign
> things to reduce unnecessary copying of parsetrees.  Right now the
> planner does an awful lot of copying because of fear of unwanted
> modifications of multiply-linked subtrees.  I suspect that we could
> reduce that overhead with some consistently enforced rules about
> not scribbling on input data structures; but it'd take a lot of work
> to get there, and I'm afraid it'd be easily broken :-(

Given the difficulty of this tasks, isn't your patch actually a *good*
attack on the problem? It makes copying lists considerably cheaper. As
you say, a more principled answer to this problem is hard, so attacking
it from the "make the constant factors smaller" side doesn't seem crazy?

Greetings,

Andres Freund

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-02 18:11:43 -0500, Tom Lane wrote:
> On test cases like "pg_bench -S" it seems to be pretty much within the
> noise level of being the same speed as HEAD.

I think that might be because it's bottleneck is just elsewhere
(e.g. very context switch heavy, very few lists of any length).

FWIW, even just taking context switches out of the equation leads to
a ~5-6 %benefit in a simple statement:

DO $f$BEGIN FOR i IN 1..50 LOOP EXECUTE $s$SELECT aid, bid, abalance, 
filler FROM pgbench_accounts WHERE aid = 2045530;$s$;END LOOP;END;$f$;

master:
+6.05%  postgres  postgres[.] AllocSetAlloc
+5.52%  postgres  postgres[.] base_yyparse
+2.51%  postgres  postgres[.] palloc
+1.82%  postgres  postgres[.] hash_search_with_hash_value
+1.61%  postgres  postgres[.] core_yylex
+1.57%  postgres  postgres[.] SearchCatCache1
+1.43%  postgres  postgres[.] expression_tree_walker.part.4
+1.09%  postgres  postgres[.] check_stack_depth
+1.08%  postgres  postgres[.] MemoryContextAllocZeroAligned

patch v3:
+5.77%  postgres  postgres[.] base_yyparse
+4.88%  postgres  postgres[.] AllocSetAlloc
+1.95%  postgres  postgres[.] hash_search_with_hash_value
+1.89%  postgres  postgres[.] core_yylex
+1.64%  postgres  postgres[.] SearchCatCache1
+1.46%  postgres  postgres[.] expression_tree_walker.part.0
+1.45%  postgres  postgres[.] palloc
+1.18%  postgres  postgres[.] check_stack_depth
+1.13%  postgres  postgres[.] MemoryContextAllocZeroAligned
+1.04%  postgres  libc-2.28.so[.] _int_malloc
+1.01%  postgres  postgres[.] nocachegetattr

And even just pgbenching the EXECUTEd statement above gives me a
reproducible ~3.5% gain when using -M simple, and ~3% when using -M
prepared.

Note than when not using prepared statement (a pretty important
workload, especially as long as we don't have a pooling solution that
actually allows using prepared statement across connections), even after
the patch most of the allocator overhead is still from list allocations,
but it's near exclusively just the "create a new list" case:

+5.77%  postgres  postgres[.] base_yyparse
-4.88%  postgres  postgres[.] AllocSetAlloc
   - 80.67% AllocSetAlloc
  - 68.85% AllocSetAlloc
 - 57.65% palloc
- 50.30% new_list (inlined)
   - 37.34% lappend
  + 12.66% pull_var_clause_walker
  + 8.83% build_index_tlist (inlined)
  + 8.80% make_pathtarget_from_tlist
  + 8.73% get_quals_from_indexclauses (inlined)
  + 8.73% distribute_restrictinfo_to_rels
  + 8.68% RewriteQuery
  + 8.56% transformTargetList
  + 8.46% make_rel_from_joinlist
  + 4.36% pg_plan_queries
  + 4.30% add_rte_to_flat_rtable (inlined)
  + 4.29% build_index_paths
  + 4.23% match_clause_to_index (inlined)
  + 4.22% expression_tree_mutator
  + 4.14% transformFromClause
  + 1.02% get_index_paths
   + 17.35% list_make1_impl
   + 16.56% list_make1_impl (inlined)
   + 15.87% lcons
   + 11.31% list_copy (inlined)
   + 1.58% lappend_oid
+ 12.90% expression_tree_mutator
+ 9.73% get_relation_info
+ 4.71% bms_copy (inlined)
+ 2.44% downcase_identifier
+ 2.43% heap_tuple_untoast_attr
+ 2.37% add_rte_to_flat_rtable (inlined)
+ 1.69% btbeginscan
+ 1.65% CreateTemplateTupleDesc
+ 1.61% core_yyalloc (inlined)
+ 1.59% heap_copytuple
+ 1.54% text_to_cstring (inlined)
+ 0.84% ExprEvalPushStep (inlined)
+ 0.84% ExecInitRangeTable
+ 0.84% scanner_init
+ 0.83% ExecInitRangeTable
+ 0.81% CreateQueryDesc
+ 0.81% _bt_search
+ 0.77% ExecIndexBuildScanKeys
+ 0.66% RelationGetIndexScan
+ 0.65% make_pathtarget_from_tlist


Given how hard it is to improve performance with as flatly distributed
costs as the above profiles, I actually think these are quite promising
results.

I'm not even convinced that it makes all that much sense to measure
end-to-end performance here, it might be worthwhile to measure with a
debugging function that allows to exercise parsing, parse-analysis,
rewrite etc at configurable loop counts. Given the relatively evenly
distributed profiles were going to have to make a few different
improvements to make headway, and it's hard to see benefits of
individual ones if you look at the overall numbers.

Greeti

Re: pg_partition_tree crashes for a non-defined relation

2019-03-04 Thread Alvaro Herrera

On 2019-Feb-28, Alvaro Herrera wrote:

> What about legacy inheritance?  I see that pg_partition_tree handles
> that case perfectly well -- it seems to return the complete hierarchy
> rooted at the given relation.  However, it seems odd that it works at
> all, don't you think?  Consider this:
> [...]
> I would opt for returning the empty set for legacy inheritance too.

I added tests to immortalize the current behavior for legacy inheritance
relations too.  Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Tomas Vondra




On 3/4/19 6:40 AM, Masahiko Sawada wrote:
> On Sat, Mar 2, 2019 at 5:27 AM Robert Haas  wrote:
>>
>> On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada  wrote:
>>> WAL encryption will follow as an additional feature.
>>
>> I don't think WAL encryption is an optional feature.  You can argue
>> about whether it's useful to encrypt the disk files in the first place
>> given that there's no privilege boundary between the OS user and the
>> database, but a lot of people seem to think it is and maybe they're
>> right.  However, who can justify encrypting only SOME of the disk
>> files and not others?  I mean, maybe you could justify not encryption
>> the SLRU files on the grounds that they probably don't leak much in
>> the way of interesting information, but the WAL files certainly do --
>> your data is there, just as much as in the data files themselves.
>>
> 
> Agreed.
> 
>> To be honest, I think there is a lot to like about the patches
>> Cybertec has proposed.  Those patches don't have all of the fancy
>> key-management stuff that you are proposing here, but maybe that
>> stuff, if we want it, could be added, rather than starting over from
>> scratch.  It seems to me that those patches get a lot of things right.
>> In particular, it looked to me when I looked at them like they made a
>> pretty determined effort to encrypt every byte that might go down to
>> the disk.  It seems to me that you if you want encryption, you want
>> that.
>>
> 
> Agreed. I think the patch lacks the key management stuff: 2-tier key
> architecture and integration of postgres with key management systems.
> I'd like to work together and can propose the patch of key management
> stuff to the proposed patch.
> 

Sounds like a plan. It'd be nice to come up with a unified version of
those two patches, combining the good pieces from both.

I wonder how other databases deal with key management? Surely we're not
the first/only database that tries to do transparent encryption, so
perhaps we could learn something from others? For example, do they use
this 2-tier key architecture? How do they do key management? etc.

I don't say we should copy from them, but it'd allow us to (a) avoid
making the same mistakes and (b) build a solution the users are already
somewhat familiar with.

May I suggest creating a page on the PostgreSQL wiki, explaining the
design and updating it as the discussion develops? It's rather difficult
to follow all the different sub-threads, and IIRC some larger patches
used that successfully for this purpose.

See for example:

* https://wiki.postgresql.org/wiki/Parallel_External_Sort
* https://wiki.postgresql.org/wiki/Parallel_Internal_Sort


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: \describe*

2019-03-04 Thread Corey Huinker

>
>
>> - Tab completion for \descibe-verbose.
>> I know that \d+ tab completion is also not there, but I think we must
>> have tab completion for \descibe-verbose.
>>
>> postgres=# \describe-
>> \describe-extension
>>  \describe-replication-publication \describe-user-mapping
>> \describe-foreign-data-wrapper
>> \describe-replication-subscription\describe-view
>> \describe-foreign-server  \describe-role
>>   \describe-window-function
>> \describe-foreign-table   \describe-rule
>>  ...
>>
>
I just confirmed that there isn't tab completion for the existing S/+
options, so it's hard to justify them for the equivalent verbose suffixes.



> (1 row)
>> Invalid command \describe. Try \? for help.
>>
>>
>> I think this status is causing the problem.
>>
>>
>>
>> +   /*
>> standard listing of interesting things */
>> +   success =
>> listTables("tvmsE", NULL, show_verbose, show_system);
>> +   }
>> +   status = PSQL_CMD_UNKNOWN;
>>
>>
I'll look into this, thanks!



> - Confusion about \desc and \desC
>> There is confusion while running the \desc command. I know the problem,
>> but the user may confuse by this.
>> postgres=# \desC
>>List of foreign servers
>>  Name | Owner | Foreign-data wrapper
>> --+---+--
>> (0 rows)
>>
>> postgres=# \desc
>> Invalid command \desc. Try \? for help.
>>
>> - Auto-completion of commands.
>> There is some more confusion in the completion of commands.
>>
>> This command shows List of aggregates.
>> postgres=# \describe-aggregate-function
>>  List of aggregate functions
>>  Schema | Name | Result data type | Argument data types | Description
>> +--+--+-+-
>> (0 rows)
>>
>>
>>
>> This command shows a list of relation "\d"
>> postgres=# \describe-aggregatE-function
>> List of relations
>>  Schema | Name | Type  |  Owner
>> +--+---+-
>>  public | foo  | table | vagrant
>> (1 row)
>>
>> This command also shows a list of relations "\d".
>> postgres=# \describe-aggr
>> List of relations
>>  Schema | Name | Type  |  Owner
>> +--+---+-
>>  public | foo  | table | vagrant
>> (1 row)
>>
>> This command shows error messages.
>> postgres=# \descr
>> Invalid command \descr. Try \? for help.
>>
>>
I will look into it.



>
>> I have done a brief code review except for the documentation code. I
>> don't like this code
>>
>> if (cmd_match(cmd,"describe-aggregate-function"))
>>
>>  success = describeAggregates(pattern, show_verbose, show_system);
>>  else if (cmd_match(cmd,
>> "describe-access-method"))
>>  success = describeAccessMethods(pattern,
>> show_verbose);
>>  else if (cmd_match(cmd,
>> "describe-tablespace"))
>>  success = describeTablespaces(pattern,
>> show_verbose);
>>  else if (cmd_match(cmd,
>> "describe-conversion"))
>>  success = listConversions(pattern,
>> show_verbose, show_system);
>>  else if (cmd_match(cmd, "describe-cast"))
>>  success = listCasts(pattern, show_verbose
>>
>>
>> This can be achieved with the list/array/hash table, so I have changed
>> that code in the attached patch just for a sample if you want I can do that
>> for whole code.
>>
>
There's some problems with a hash table. The function signatures vary quite
a lot, and some require additional psql_scan_slash_options to be called.
The hash option, if implemented, probably should be expanded to all slash
commands, at which point maybe it belongs in psqlscanslash.l...

>

Re: query logging of prepared statements

2019-03-04 Thread Justin Pryzby

On Mon, Mar 04, 2019 at 06:53:31PM +0300, Arthur Zakirov wrote:
> Hello Justin,
> 
> On 27.02.2019 21:06, Justin Pryzby wrote:
> >I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather 
> >than
> >making such logs conditional on log_error_verbosity=VERBOSE, since
> >log_error_verbosity is documented to control whether these are output:
> >DETAIL/HINT/QUERY/CONTEXT/SQLSTATE.
> I looked the patch. I got interesting result with different parameters.
> 
> But with log_statement='none' and log_min_duration_statement='0' I get:
> 
> =# execute test_ins(3);
> LOG:  duration: 8.439 ms  statement: execute test_ins(3);
> DETAIL:  prepare: prepare test_ins (int) as
> insert into test values ($1);
> 
> Is it intended? In the second result I got the query details.

It wasn't intentional.  Find attached v3 patch which handles that case,
by removing the 2nd call to errdetail_execute() ; since it's otherwise unused,
so remove that function entirely.

|postgres=# execute test_ins(3);
|2019-03-04 11:56:15.997 EST [4044] LOG:  duration: 0.637 ms  statement: 
execute test_ins(3);

I also fixed a 2nd behavior using library call pqExecPrepared with
log_min_duration_statement=0.

It was logging:
|LOG:  duration: 0.264 ms  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  duration: 0.027 ms  bind q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'
|LOG:  duration: 0.006 ms  execute q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'

But now logs:

PGPORT=5679 PGHOST=/tmp PYTHONPATH=../PyGreSQL/build/lib.linux-x86_64-2.7/ 
python2.7 -c "import pg; d=pg.DB('postgres'); d.query('SET 
client_min_messages=error; SET log_error_verbosity=default; SET 
log_min_duration_statement=0; SET log_statement=\"none\"'); d.query('SELECT 1; 
PREPARE q AS SELECT \$1'); d.query_prepared('q',[1])"
|LOG:  duration: 0.479 ms  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  duration: 0.045 ms  bind q
|DETAIL:  parameters: $1 = '1'
|LOG:  duration: 0.008 ms  execute q
|DETAIL:  parameters: $1 = '1'

Thanks for reviewing.  I'm also interested in discussion about whether this
change is undesirable for someone else for some reason ?  For me, the existing
output seems duplicative and "denormalized".  :)

Justin
>From c04f2fe815a55babe6a9bdd53421d74fc283094b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 9 Feb 2019 19:20:43 -0500
Subject: [PATCH v3] Avoid repetitive log of PREPARE during EXECUTE of prepared
 statements

---
 src/backend/tcop/postgres.c | 54 +++--
 1 file changed, 8 insertions(+), 46 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..f348475ea3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -182,7 +182,6 @@ static int	ReadCommand(StringInfo inBuf);
 static void forbidden_in_wal_sender(char firstchar);
 static List *pg_rewrite_query(Query *query);
 static bool check_log_statement(List *stmt_list);
-static int	errdetail_execute(List *raw_parsetree_list);
 static int	errdetail_params(ParamListInfo params);
 static int	errdetail_abort(void);
 static int	errdetail_recovery_conflict(void);
@@ -1041,8 +1040,7 @@ exec_simple_query(const char *query_string)
 	{
 		ereport(LOG,
 (errmsg("statement: %s", query_string),
- errhidestmt(true),
- errdetail_execute(parsetree_list)));
+ errhidestmt(true)));
 		was_logged = true;
 	}
 
@@ -1292,8 +1290,7 @@ exec_simple_query(const char *query_string)
 			ereport(LOG,
 	(errmsg("duration: %s ms  statement: %s",
 			msec_str, query_string),
-	 errhidestmt(true),
-	 errdetail_execute(parsetree_list)));
+	 errhidestmt(true)));
 			break;
 	}
 
@@ -1929,12 +1926,11 @@ exec_bind_message(StringInfo input_message)
 			break;
 		case 2:
 			ereport(LOG,
-	(errmsg("duration: %s ms  bind %s%s%s: %s",
+	(errmsg("duration: %s ms  bind %s%s%s",
 			msec_str,
 			*stmt_name ? stmt_name : "",
 			*portal_name ? "/" : "",
-			*portal_name ? portal_name : "",
-			psrc->query_string),
+			*portal_name ? portal_name : ""),
 	 errhidestmt(true),
 	 errdetail_params(params)));
 			break;
@@ -2062,14 +2058,13 @@ exec_execute_message(const char *portal_name, long max_rows)
 	if (check_log_statement(portal->stmts))
 	{
 		ereport(LOG,
-(errmsg("%s %s%s%s: %s",
+(errmsg("%s %s%s%s",
 		execute_is_fetch ?
 		_("execute fetch from") :
 		_("execute"),
 		prepStmtName,
 		*portal_name ? "/" : "",
-		*portal_name ? portal_name : "",
-		sourceText),
+		*portal_name ? portal_name : ""),
  errhidestmt(true),
  errdetail_params(portalParams)));
 		was_logged = true;
@@ -2150,15 +2145,14 @@ exec_execute_message(const char *portal_name, long max_rows)
 			break;
 		case 2:
 			ereport(LOG,
-	(errmsg("duration: %s ms  %s %s%s%s: %s",
+	(errmsg("duration: %s ms  %s %s%s%s",
 			msec_str,
 			execute_is_fetch ?
 			_("execute

Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 13:25:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > One thing I want to bring up here rather than in the pluggable storage
> > thread is that currently the pg_dump support for access methods deals
> > with table access methods in a manner similar to the way we deal with
> > tablespaces. Instead of specifying the AM on every table creation, we
> > set the default AM when needed.  That makes it easier to adjust dumps.
> 
> Hm.  I wonder if it'd make more sense to consider that an access method is
> a property of a tablespace?  That is, all tables in a tablespace have the
> same access method, so you don't need to label tables individually?

I don't think that'd work well. That'd basically necessitate creating
multiple tablespaces just to create a table with a different AM -
creating tablespaces is a superuser only activity that makes backups etc
more complicated. It also doesn't correspond well to pg_class.relam etc.


> > But it does basically require breaking archive compatibility.  I
> > personally am OK with that, but I thought it might be worth discussing.
> 
> I don't recall there being huge pushback when we did that in the past,
> so I'm fine with it as long as there's an identifiable feature making
> it necessary.

Cool.

Greetings,

Andres Freund

Re: GiST VACUUM

2019-03-04 Thread Andrey Borodin

Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together 
seems so obvious now, but never actually visited my mind before.

> 4 марта 2019 г., в 18:04, Heikki Linnakangas  написал(а):
> Thanks! As I noted at 
> https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi,
>  the test patch left the index corrupt. I fixed it so that it leaves behind 
> incompletely split pages, without the corruption, see attached. It's similar 
> to yours, but let me recap what it does:
> 
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. 
> They are leaked, so they won't be reused until the a VACUUM sees them and 
> puts them to the FSM
> 
> * Hacks gistinserttuples(), to leave the split incompleted with 50% 
> probability
> 
> * In vacuum, print a line to the log whenever it needs to "jump left"
> 
> I used this patch, with the attached test script that's similar to yours, but 
> it also tries to verify that the index returns correct results. It prints a 
> result set like this:
> 
>   sum
> -
> -364450
>  364450
> (2 rows)
> 
> If the two numbers add up to 0, the index seems valid. And you should see 
> "RESCAN" lines in the log, to show that jumping left happened. Because the 
> behavior is random and racy, you may need to run the script many times to see 
> both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. 
> Especially the "FollowRight" case happens less frequently than the NSN case, 
> you might need to run the script > 10 times to see it.
Great! I've repeated your tests on my machine, I observe similar frequencies of 
causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
> 
> Attached is also latest version of the patch itself. It is the same as your 
> latest patch v19, except for some tiny comment kibitzing. I'll mark this as 
> Ready for Committer in the commitfest app, and will try to commit it in the 
> next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data 
structure prettier and less hacky.

Best regards, Andrey Borodin.

Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Tom Lane

Andres Freund  writes:
> One thing I want to bring up here rather than in the pluggable storage
> thread is that currently the pg_dump support for access methods deals
> with table access methods in a manner similar to the way we deal with
> tablespaces. Instead of specifying the AM on every table creation, we
> set the default AM when needed.  That makes it easier to adjust dumps.

Hm.  I wonder if it'd make more sense to consider that an access method is
a property of a tablespace?  That is, all tables in a tablespace have the
same access method, so you don't need to label tables individually?

> But it does basically require breaking archive compatibility.  I
> personally am OK with that, but I thought it might be worth discussing.

I don't recall there being huge pushback when we did that in the past,
so I'm fine with it as long as there's an identifiable feature making
it necessary.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Tomas Vondra

On 3/4/19 6:55 PM, Laurenz Albe wrote:
> Masahiko Sawada wrote:
>> Why do people want to just encrypt everything? For satisfying some
>> security compliance?
> 
> I'd say that TDE primarily protects you from masked ninjas that
> break into your server room and rip out the disks with your database
> on them.
> 
> Or from people stealing your file system backups that you leave
> lying around in public.
> 
> My guess is that this requirement almost always comes from security
> departments that don't know a lot about the typical security threats
> that databases face, or (worse) from lawmakers.
> 
> And these are probably the people who will insist that *everything*
> is encrypted, even your commit log (unencrypted log? everyone can
> read the commits?).
> 

IMHO it's a sound design principle - deny access by default, then allow
for specific cases. It's much easier to reason about, and also validate
such solutions.

It's pretty much the same reason why firewall rules generally prohibit
everything by default, and then only allow access for specific ports,
from specific IP ranges, etc. Doing it the other way around is futile.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Andres Freund

Hi,

On 2019-02-27 09:32:17 -0300, Alvaro Herrera wrote:
> On 2019-Feb-27, Dmitry Dolgov wrote:
> > But I hope there are no objections if I'll then submit the original
> > changes with more consistent null handling separately to make decision
> > about them more consciously.
> 
> I think we should save such a patch for whenever we next update the
> archive version number, which could take a couple of years given past
> history.  I'm inclined to add a comment near K_VERS_SELF to remind
> whoever next patches it.

The pluggable storage patchset contains exactly that... I've attached
the precursor patch (CREATE ACCESS METHOD ... TYPE TABLE), and the patch
for pg_dump support. They need a bit more cleanup, but it might be
useful information for this thread.

One thing I want to bring up here rather than in the pluggable storage
thread is that currently the pg_dump support for access methods deals
with table access methods in a manner similar to the way we deal with
tablespaces. Instead of specifying the AM on every table creation, we
set the default AM when needed.  That makes it easier to adjust dumps.

But it does basically require breaking archive compatibility.  I
personally am OK with that, but I thought it might be worth discussing.

I guess we could try avoid the compat issue by only increasing the
archive format if there actually are any non-default AMs, but to me that
doesn't seem like an improvement worthy of the necessary complications.

Greetings,

Andres Freund
>From 2f37df9c3da0eb9cd75709a35d898269b0da2a0c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 17 Jan 2019 14:11:35 -0800
Subject: [PATCH v14 1/2] tableam: introduce + minimal infrastructure.

Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/Makefile  |   2 +-
 src/backend/access/heap/heapam_handler.c  |  45 ++
 src/backend/access/table/Makefile |   2 +-
 src/backend/access/table/tableam.c|  18 +++
 src/backend/access/table/tableamapi.c | 172 ++
 src/backend/bootstrap/bootparse.y |   2 +
 src/backend/catalog/genbki.pl |   4 +
 src/backend/catalog/heap.c|  22 +++
 src/backend/catalog/index.c   |   1 +
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/amcmds.c |  28 ++--
 src/backend/commands/cluster.c|   1 +
 src/backend/commands/createas.c   |   1 +
 src/backend/commands/tablecmds.c  |  33 +
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/parser/gram.y | 100 -
 src/backend/rewrite/rewriteDefine.c   |   1 +
 src/backend/utils/adt/pseudotypes.c   |   1 +
 src/backend/utils/cache/relcache.c| 125 +++-
 src/backend/utils/misc/guc.c  |  12 ++
 src/bin/psql/describe.c   |  25 +++-
 src/bin/psql/settings.h   |   1 +
 src/bin/psql/startup.c|   8 +
 src/include/access/tableam.h  |  48 ++
 src/include/catalog/heap.h|   2 +
 src/include/catalog/pg_am.dat |   3 +
 src/include/catalog/pg_am.h   |   1 +
 src/include/catalog/pg_class.dat  |   8 +-
 src/include/catalog/pg_class.h|   2 +-
 src/include/catalog/pg_proc.dat   |  13 ++
 src/include/catalog/pg_type.dat   |   5 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   1 +
 src/include/utils/rel.h   |  15 +-
 src/include/utils/relcache.h  |   3 +
 src/test/regress/expected/create_am.out   |  79 ++
 src/test/regress/expected/opr_sanity.out  |  19 ++-
 src/test/regress/expected/psql.out|  40 +
 src/test/regress/expected/type_sanity.out |   6 +-
 src/test/regress/pg_regress_main.c|   7 +-
 src/test/regress/sql/create_am.sql|  47 ++
 src/test/regress/sql/opr_sanity.sql   |  16 +-
 src/test/regress/sql/psql.sql |  15 ++
 src/test/regress/sql/type_sanity.sql  |   7 +-
 src/tools/pgindent/typedefs.list  |   1 +
 46 files changed, 872 insertions(+), 74 deletions(-)
 create mode 100644 src/backend/access/heap/heapam_handler.c
 create mode 100644 src/backend/access/table/tableam.c
 create mode 100644 src/backend/access/table/tableamapi.c
 create mode 100644 src/include/access/tableam.h

diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index eae36fdbf40..b2a017249b8 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/heap
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heapam.o  heapam_visibility.o hio.o pruneheap.o rewriteheap.o \
+OBJS = heapam.o heapam_handler.o heapam_visibility.o hio.o p

Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane

Robert Haas  writes:
> I think the reason why you're not seeing a performance benefit is
> because the problem is not that lists are generically a more expensive
> data structure than arrays, but that there are cases when they are
> more expensive than arrays.  If you only ever push/pop at the front,
> of course a list is going to be better.  If you often look up elements
> by index, of course an array is going to be better.  If you change
> every case where the code currently uses a list to use something else
> instead, then you're changing both the winning and losing cases.

I don't think this argument is especially on-point, because what I'm
actually seeing is just that there aren't any list operations that
are expensive enough to make much of an overall difference in
typical queries.  To the extent that an array reimplementation
reduces the palloc traffic, it'd take some load off that subsystem,
but apparently you need not-typical queries to really notice.
(And, if the real motivation is aggregate palloc savings, then yes you
really do want to replace everything...)

> Yeah, changing things individually is more work, but that's how you
> get the wins without incurring the losses.

The concern I have is mostly about the use of lists as core infrastructure
in parsetree, plantree, etc data structures.  I think any idea that we'd
replace those piecemeal is borderline insane: it's simply not worth it
from a notational and bug-risk standpoint to glue together some parts of
those structures differently from the way other parts are glued together.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Chris Howard




Or on your laptop



On 3/4/19 11:55 AM, Laurenz Albe wrote:

Masahiko Sawada wrote:

Why do people want to just encrypt everything? For satisfying some
security compliance?

I'd say that TDE primarily protects you from masked ninjas that
break into your server room and rip out the disks with your database
on them.

Or from people stealing your file system backups that you leave
lying around in public.

My guess is that this requirement almost always comes from security
departments that don't know a lot about the typical security threats
that databases face, or (worse) from lawmakers.

And these are probably the people who will insist that *everything*
is encrypted, even your commit log (unencrypted log? everyone can
read the commits?).

Yours,
Laurenz Albe

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Laurenz Albe

Masahiko Sawada wrote:
> Why do people want to just encrypt everything? For satisfying some
> security compliance?

I'd say that TDE primarily protects you from masked ninjas that
break into your server room and rip out the disks with your database
on them.

Or from people stealing your file system backups that you leave
lying around in public.

My guess is that this requirement almost always comes from security
departments that don't know a lot about the typical security threats
that databases face, or (worse) from lawmakers.

And these are probably the people who will insist that *everything*
is encrypted, even your commit log (unencrypted log? everyone can
read the commits?).

Yours,
Laurenz Albe

Re: POC: converting Lists into arrays

2019-03-04 Thread Robert Haas

On Sun, Mar 3, 2019 at 1:29 PM Tom Lane  wrote:
> > My main observation was from when the expression evaluation was using
> > lists all over. List iteration overhead was very substantial there. But
> > that's not a problem anymore, because all of those are gone now due to
> > the expression rewrite.  I personally wasn't actually advocating for a
> > new list implementation, I was/am advocating that we should move some
> > tasks over to a more optimized representation.
>
> I doubt that you'll get far with that; if this experiment is anything
> to go by, it's going to be really hard to make the case that twiddling
> the representation of widely-known data structures is worth the work
> and bug hazards.

I'm befuddled by this comment.  Andres is arguing that we shouldn't go
do a blind search-and-replace, but rather change certain things, and
you're saying that's going to be really hard because twiddling the
representation of widely-known data structures is really hard.  But if
we only change certain things, we don't *need* to twiddle the
representation of a widely-known data structure.  We just add a new
one and convert the things that benefit from it, like I proposed
upthread (and promptly got told I was wrong).

I think the reason why you're not seeing a performance benefit is
because the problem is not that lists are generically a more expensive
data structure than arrays, but that there are cases when they are
more expensive than arrays.  If you only ever push/pop at the front,
of course a list is going to be better.  If you often look up elements
by index, of course an array is going to be better.  If you change
every case where the code currently uses a list to use something else
instead, then you're changing both the winning and losing cases.
Yeah, changing things individually is more work, but that's how you
get the wins without incurring the losses.

I think David's results go in this direction, too.  Code that was
written on the assumption that list_nth() is slow is going to avoid
using it as much as possible, and therefore no benefit is to be
expected from making it fast.  If the author written the same code
assuming that the underlying data structure was an array rather than a
list, they might have picked a different algorithm which, as David's
results shows, could be a lot faster in some cases.  But it's not
going to come from just changing the way lists work internally; it's
going to come from redesigning the algorithms that are using lists to
do something better instead, as Andres's example of linearized
expression evaluation also shows.

> The cases I've been looking at suggest to me that we'd make far
> more progress on the excessive-palloc'ing front if we could redesign
> things to reduce unnecessary copying of parsetrees.  Right now the
> planner does an awful lot of copying because of fear of unwanted
> modifications of multiply-linked subtrees.  I suspect that we could
> reduce that overhead with some consistently enforced rules about
> not scribbling on input data structures; but it'd take a lot of work
> to get there, and I'm afraid it'd be easily broken :-(

I think that's a separate but also promising thing to attack, and I
agree that it'd take a lot of work to get there.  I don't think that
the problem with either parse-tree-copying or list usage is that no
performance benefits are to be had; I think it's that the amount of
work required to get those benefits is pretty large.  If it were
otherwise, somebody likely would have done it before now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Problem with default partition pruning

2019-03-04 Thread Ibrar Ahmed

Hi  Yuzuko Hosoya,

Ignore my last message, I think this is also a legitimate scan on default
partition.


On Mon, Mar 4, 2019 at 10:29 PM Ibrar Ahmed  wrote:

> Hi
>
> Patch work fine to me, but I have one test case where default partition
> still scanned.
>
> postgres=# explain select * from test1 where (id < 10) and true;
> QUERY PLAN
> ---
>  Append  (cost=0.00..55.98 rows=846 width=36)
>->  Seq Scan on test1_1  (cost=0.00..25.88 rows=423 width=36)
>  Filter: (id < 10)
>->  Seq Scan on test1_def  (cost=0.00..25.88 rows=423 width=36)
>  Filter: (id < 10)
> (5 rows)



-- 
Ibrar Ahmed

Re: Problem with default partition pruning

2019-03-04 Thread Ibrar Ahmed

Hi 

Patch work fine to me, but I have one test case where default partition still 
scanned. 

postgres=# explain select * from test1 where (id < 10) and true;
QUERY PLAN 
---
 Append  (cost=0.00..55.98 rows=846 width=36)
   ->  Seq Scan on test1_1  (cost=0.00..25.88 rows=423 width=36)
 Filter: (id < 10)
   ->  Seq Scan on test1_def  (cost=0.00..25.88 rows=423 width=36)
 Filter: (id < 10)
(5 rows)

Re: Minimal logical decoding on standbys

2019-03-04 Thread Andres Freund

Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:
> On 03/01/2019 11:16 PM, Andres Freund wrote:
> > So, if I understand correctly you do*not*  have a phyiscal replication
> > slot for this standby? For the feature to work reliably that needs to
> > exist, and you need to have hot_standby_feedback enabled. Does having
> > that fix the issue?
> 
> Ok, This time around  - I performed like this -
> 
> .)Master cluster (set wal_level=logical and hot_standby_feedback=on in
> postgresql.conf) , start the server and create a physical replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres

Re: psql show URL with help

2019-03-04 Thread Magnus Hagander

On Sun, Mar 3, 2019 at 10:48 PM David Fetter  wrote:

> On Sun, Mar 03, 2019 at 09:57:25PM +0100, Magnus Hagander wrote:
> > On Sun, Mar 3, 2019 at 7:14 PM David Fetter  wrote:
> >
> > > On Wed, Feb 27, 2019 at 09:14:59AM +0100, Peter Eisentraut wrote:
> > > > + url = psprintf("
> > > https://www.postgresql.org/docs/%s/%s.html";,
> > > > +
> > > strstr(PG_VERSION, "devel") ? "devel" : PG_MAJORVERSION,
> > > > +
> > > QL_HELP[i].docbook_id);
> > >
> > > Do we need to make sure that the docs are published under the major
> > > version as soon as we get to alpha, or do we need something more like
> > > this?
> > >
> > > url = psprintf("https://www.postgresql.org/docs/%s/%s.html";,
> > > (strstr(PG_VERSION, "devel") || strstr(PG_VERSION, "beta")
> ||
> > >  strstr(PG_VERSION, "alpha")) : "devel" : PG_MAJORVERSION,
> > > QL_HELP[i].docbook_id);
> > >
> >
> > We don't really release alphas any more. And we do load the documentation
> > alongside the betas. (Last time we did an alpha was so long ago I don't
> > remember if we loaded docs)
>
> If the first thing we do when we move from devel to some other state
> (beta, RC, etc.) is to publish the docs under the major version
> number, then maybe this test should be more along the lines of looking
> for anything that's neither devel nor a number, extract the number,
> and use that.
>

Well, alpha versions do go under the numeric URL. Whether we load the docs
at that time or not we can just choose -- but there is no reason not to. So
yeah, that sounds like it would work better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: speeding up planning with partitions

2019-03-04 Thread Jesper Pedersen


Hi Amit,

Passes check-world.

On 3/4/19 5:38 AM, Amit Langote wrote:

See patch 0001.



+* members of any appendrels we find here are added built later when

s/built//


See patch 0002.



+   grouping_planner(root, false, 0.0 /* retrieve all tuples */);

Move comment out of function call.

+   if (root->simple_rte_array[childRTindex])
+   elog(ERROR, "rel %d already exists", childRTindex);
+   root->simple_rte_array[childRTindex] = childrte;
+   if (root->append_rel_array[childRTindex])
+   elog(ERROR, "child relation %d already exists", 
childRTindex);
+   root->append_rel_array[childRTindex] = appinfo;


Could the "if"s be made into Assert's instead ?

+ * the newly added bytes with zero

Extra spaces

+   if (rte->rtekind == RTE_RELATION &&  !root->contains_inherit_children)

s/TAB/space


See patch 0003.



+* because they correspond to flattneed UNION ALL subqueries.  
Especially,

s/flattneed/flatten


See patch 0004.



+* no need to make any new entries, because anything that'd need those

Use "would" explicit

+ * this case, since it needn't be scanned.

, since it doesn't need to be scanned


See patch 0005.

See patch 0006.



I'll run some tests using a hash partitioned setup.

Best regards,
 Jesper

Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Rahila


Hi Alvaro,

Resending the email as earlier one didn't get sent on pgsql-hackers.

On 2/23/19 3:24 AM, Alvaro Herrera wrote:

On 2019-Feb-13, Amit Langote wrote:


Doesn't the name amphasename sound a bit too generic, given that it can
only describe the phases of ambuild?  Maybe ambuildphase?

Hmm, yeah, maybe it does.  I renamed it "ambuildphasename", since it's
not about reporting the phase itself -- it's about translating the phase
number to the string that's reported to the user.

The attached patch does it that way.  Also, when an index build uses an
AM that doesn't support progress reporting, it no longer reports a NULL
phase name while building.  I also changed it to report the progress of
phase 7 (heap scan validation) using block numbers rather than tuple
counts.  I also tweaked the strings reported in the view.  They're
clearer now IMO.

One slight annoyance is that when parallel workers are used, the last
block number reported in phase 3/subphase 2 (IndexBuildHeapScan stuff)
is not necessarily accurate, since the tail of the table could well be
scanned by a worker that's not the leader, and we only report in the
leader when it gets a new block.

When the AM does not support progress reporting, everything stays as
zeros during the index build phase.  It's easy to notice how slow hash
indexes are to build compared to btrees this way!  Maybe it'd be
better fallback on reporting block numbers in IndexBuildHeapScan when
this happens.  Thoughts?

I added docs to the monitoring section -- that's the bulkiest part of
the patch.


1. Thank you for incorporating review comments.
Can you please rebase the latest 0001-Report-progress-of-
CREATE-INDEX-operations.patch on master? Currently it does not apply on 
754b90f657bd54b482524b73726dae4a9165031c

  15:56:44.694 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
  15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
  15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
  15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |   79057 |0 |  
 0
  15:56:44.738 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  217018 |0 |  
 0
  15:56:44.75  | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  353804 |0 |  
 0
  
2. In the above report, even though we are reporting progress in terms 
of tuples_done for final btree sort & load phase we have not cleared
the blocks_done entry from previous phases. I think this can be 
confusing as the blocks_done does not correspond to the tuples_done in 
the final btree sort & load phase.


Thank you,
Rahila Syed

Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Rahila Syed

Hi Alvaro,


> On 2019-Feb-13, Amit Langote wrote:
>
> > Doesn't the name amphasename sound a bit too generic, given that it can
> > only describe the phases of ambuild?  Maybe ambuildphase?
>
> Hmm, yeah, maybe it does.  I renamed it "ambuildphasename", since it's
> not about reporting the phase itself -- it's about translating the phase
> number to the string that's reported to the user.
>
> The attached patch does it that way.  Also, when an index build uses an
> AM that doesn't support progress reporting, it no longer reports a NULL
> phase name while building.  I also changed it to report the progress of
> phase 7 (heap scan validation) using block numbers rather than tuple
> counts.  I also tweaked the strings reported in the view.  They're
> clearer now IMO.
>
> One slight annoyance is that when parallel workers are used, the last
> block number reported in phase 3/subphase 2 (IndexBuildHeapScan stuff)
> is not necessarily accurate, since the tail of the table could well be
> scanned by a worker that's not the leader, and we only report in the
> leader when it gets a new block.
>
> When the AM does not support progress reporting, everything stays as
> zeros during the index build phase.  It's easy to notice how slow hash
> indexes are to build compared to btrees this way!  Maybe it'd be
> better fallback on reporting block numbers in IndexBuildHeapScan when
> this happens.  Thoughts?
>
> I added docs to the monitoring section -- that's the bulkiest part of
> the patch.
>

1. Thank you for incorporating review comments.
Can you please rebase the latest
0001-Report-progress-of-CREATE-INDEX-operations.patch on master? Currently
it does not apply on 754b90f657bd54b482524b73726dae4a9165031c


>  15:56:44.694 | building index (3 of 8): initializing (1/5)|
>  442478 |  442399 |0 |   0 |0
> |   0
>  15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |
>  442478 |  442399 |1 |   79057 |0
> |   0
>

2. In the above report, even though we are reporting progress in terms of
tuples_done for final btree sort & load phase we have not cleared
the blocks_done entry from previous phases. I think this can be confusing
as the blocks_done does not correspond to the tuples_done in the current
phase.


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-04 Thread Magnus Hagander

On Mon, Mar 4, 2019 at 3:02 PM Tomas Vondra 
wrote:

>
>
> On 3/4/19 4:09 AM, Michael Paquier wrote:
> > On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> >> I agree that having a server function (extension?) to do a full checksum
> >> verification, possibly bandwidth-controlled, would be a good thing.
> However
> >> it would have side effects, such as interfering deeply with the server
> page
> >> cache, which may or may not be desirable.
> >
> > In what is that different from VACUUM or a sequential scan?  It is
> > possible to use buffer ring replacement strategies in such cases using
> > the normal clock-sweep algorithm, so that scanning a range of pages
> > does not really impact Postgres shared buffer cache.
> > --
>
> But Fabien was talking about page cache, not shared buffers. And we
> can't use custom ring buffer there. OTOH I don't see why accessing the
> file through SQL function would behave any differently than direct
> access (i.e. what the tool does now).
>

It shouldn't.

One other thought that I had around this though, which if it's been covered
before and I missed it, please disregard :)

The *online* version of the tool is very similar to running pg_basebackup
to /dev/null, is it not? Except it doesn't set the cluster to backup mode.
Perhaps what we really want is a simpler way to do *that*. That wouldn't
necessarily make it a SQL callable function, but it would be a CLI tool
that would call a command on a walsender for example.

(We'd of course still need the standalone tool for offline checks)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: query logging of prepared statements

2019-03-04 Thread Arthur Zakirov


Hello Justin,

On 27.02.2019 21:06, Justin Pryzby wrote:

I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather than
making such logs conditional on log_error_verbosity=VERBOSE, since
log_error_verbosity is documented to control whether these are output:
DETAIL/HINT/QUERY/CONTEXT/SQLSTATE.

I looked the patch. I got interesting result with different parameters.

With log_statement='all' and log_min_duration_statement='0' I get:

=# execute test_ins(3);
LOG:  statement: execute test_ins(3);
LOG:  duration: 17.283 ms

But with log_statement='none' and log_min_duration_statement='0' I get:

=# execute test_ins(3);
LOG:  duration: 8.439 ms  statement: execute test_ins(3);
DETAIL:  prepare: prepare test_ins (int) as
insert into test values ($1);

Is it intended? In the second result I got the query details.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: pg_dump multi VALUES INSERT

2019-03-04 Thread Alvaro Herrera

On 2019-Mar-02, Fabien COELHO wrote:

> Although I'm all in favor of checking the int associated to the option, I do
> not think that it warrants three checks and messages. I would suggest to
> factor them all as just one check and one (terse) message.

I suggest ("rows-per-insert must be in range 1..%d", INT_MAX), like
extra_float_digits and compression level.

> About the output: I'd suggest to indent one line per row, something like:
> 
>   INSERT INTO foo VALUES
> (..., ..., ..., ...),
> (..., ..., ..., ...),
> (..., ..., ..., ...);

+1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: insensitive collations

2019-03-04 Thread Daniel Verite

Peter Eisentraut wrote:

[v7-0001-Collations-with-nondeterministic-comparison.patch]

+GenericMatchText(const char *s, int slen, const char *p, int plen, Oid
collation)
 {
+  if (collation && !lc_ctype_is_c(collation) && collation !=
DEFAULT_COLLATION_OID)
+  {
+pg_locale_tlocale = pg_newlocale_from_collation(collation);
+
+if (locale && !locale->deterministic)
+  ereport(ERROR,
+  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   errmsg("nondeterministic collations are not supported for
LIKE")));
+  }

This test gets side-stepped when pattern_fixed_prefix() in selfuncs.c
returns Pattern_Prefix_Exact, and the code optimizes the operation by
converting it to a bytewise equality test, or a bytewise range check
in the index with Pattern_Type_Prefix.

Here's a reproducer:

===
create collation ciai (locale='und-u-ks-level1', deterministic=false,
provider='icu');

create table w(t text collate "C");

insert into w select md5(i::text) from generate_series(1,1) as i;
insert into w values('abc');

create index indexname on w(t );

select t from w where t like 'ABC' collate ciai;
 t 
---
(0 rows)

select t from w where t like 'ABC%' collate ciai;
 t 
---
(0 rows)

===

For the LIKE operator, I think the fix should be that like_fixed_prefix() 
should always return Pattern_Prefix_None for non-deterministic collations.

For regular expressions, pg_set_regex_collation() is called at some
point when checking for a potential prefix, and since it errors out with
non-deterministic collations, this issue is taken care of already.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Re: proposal: variadic argument support for least, greatest function

2019-03-04 Thread David Steele


On 3/1/19 3:59 AM, Chapman Flack wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

For completeness, I'll mark this reviewed again. It passes installcheck-world, 
the latest patch addresses the suggestions from me, and is improved on the 
code-structure matters that Tom pointed out. I don't know if it will meet Tom's 
threshold for desirability overall, but that sounds like a committer call at 
this point, so I'll change it to RfC.


Both committers who have looked at this patch (Tom, and Andres in his 
patch roundup [1]) recommend that it be rejected.


If no committer steps up in the next week I think we should mark it as 
rejected.


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/20190214203752.t4hl574k6jlu4t25%40alap3.anarazel.de

Re: Online verification of checksums

2019-03-04 Thread Tomas Vondra





On 3/4/19 2:00 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
>> You and Andres may be right that trying to verify checksums online
>> without close interaction with the server is ultimately futile (or at
>> least overly complex). But I'm not sure those issues (torn pages and
>> partial reads) are very good arguments, considering basebackup has to
>> deal with them too. Not sure.
> 
> FWIW, I don't think that the backend is right in its way of checking
> checksums the way it does currently either with warnings and a limited
> set of failures generated.  I raised concerns about that unfortunately
> after 11 has been GA'ed, which was too late, so this time, for this
> patch, I prefer raising them before the fact and I'd rather not spread
> this kind of methodology around the core code more and more.

I still don't understand what issue you see in how basebackup verifies
checksums. Can you point me to the explanation you've sent after 11 was
released?

> I work a lot with virtualization, and I have seen ESX hanging around
> I/O requests from time to time depending on the environment used
> (which is actually wrong, anyway, but a lot of tests happen on a
> daily basis on the stuff I work on).  What's presented on this thread
> is *never* going to be 100% safe, and would generate false positives
> which can be confusing for the user.  This is not a good sign.

So you have a workload/configuration that actually results in data
corruption yet we fail to detect that? Or we generate false positives?
Or what do you mean by "100% safe" here?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: \describe*

2019-03-04 Thread Ibrar Ahmed

Hi Corey,

Here is the modified patch (sample).



On Mon, Mar 4, 2019 at 7:02 PM Ibrar Ahmed  wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Thanks for the patch, I have reviewed the patch and have some comments
> about the patch. The review includes the testing of the patch along with
> some code review.
>
> Here are my testings results,
>
> - Tab completion for \descibe-verbose.
> I know that \d+ tab completion is also not there, but I think we must have
> tab completion for \descibe-verbose.
>
> postgres=# \describe-
> \describe-extension
>  \describe-replication-publication \describe-user-mapping
> \describe-foreign-data-wrapper
> \describe-replication-subscription\describe-view
> \describe-foreign-server  \describe-role
>   \describe-window-function
> \describe-foreign-table   \describe-rule
>  ...
>
>
> - Error message in each command.
> There is an error message after each command, here is the example.
> postgres=# \describe
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
>
> (1 row)
> Invalid command \describe. Try \? for help.
>
>
> I think this status is causing the problem.
>
>
>
> +   /*
> standard listing of interesting things */
> +   success =
> listTables("tvmsE", NULL, show_verbose, show_system);
> +   }
> +   status = PSQL_CMD_UNKNOWN;
>
>
>
>
> - Confusion about \desc and \desC
> There is confusion while running the \desc command. I know the problem,
> but the user may confuse by this.
> postgres=# \desC
>List of foreign servers
>  Name | Owner | Foreign-data wrapper
> --+---+--
> (0 rows)
>
> postgres=# \desc
> Invalid command \desc. Try \? for help.
>
> - Auto-completion of commands.
> There is some more confusion in the completion of commands.
>
> This command shows List of aggregates.
> postgres=# \describe-aggregate-function
>  List of aggregate functions
>  Schema | Name | Result data type | Argument data types | Description
> +--+--+-+-
> (0 rows)
>
>
>
> This command shows a list of relation "\d"
> postgres=# \describe-aggregatE-function
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
> (1 row)
>
> This command also shows a list of relations "\d".
> postgres=# \describe-aggr
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
> (1 row)
>
> This command shows error messages.
> postgres=# \descr
> Invalid command \descr. Try \? for help.
>
> ...
>
>
> Code review.
> -
>
> I have done a brief code review except for the documentation code. I don't
> like this code
>
> if (cmd_match(cmd,"describe-aggregate-function"))
>
>  success = describeAggregates(pattern, show_verbose, show_system);
>  else if (cmd_match(cmd,
> "describe-access-method"))
>  success = describeAccessMethods(pattern,
> show_verbose);
>  else if (cmd_match(cmd,
> "describe-tablespace"))
>  success = describeTablespaces(pattern,
> show_verbose);
>  else if (cmd_match(cmd,
> "describe-conversion"))
>  success = listConversions(pattern,
> show_verbose, show_system);
>  else if (cmd_match(cmd, "describe-cast"))
>  success = listCasts(pattern, show_verbose
>
>
> This can be achieved with the list/array/hash table, so I have changed
> that code in the attached patch just for a sample if you want I can do that
> for whole code.
>
> --
> Ibrar Ahmed
>
> The new status of this patch is: Waiting on Author
>


-- 
Ibrar Ahmed


0001-Add-describe-commands-to-compliment-d-commands-ibrar-v2.patch
Description: Binary data

Re: Ordered Partitioned Table Scans

2019-03-04 Thread Antonin Houska

David Rowley  wrote:

> On Mon, 5 Nov 2018 at 10:46, David Rowley  
> wrote:
> > On 1 November 2018 at 22:05, Antonin Houska  wrote:
> > > I think these conditions are too restrictive:
> > >
> > > /*
> > >  * Determine if these pathkeys match the partition order, or 
> > > reverse
> > >  * partition order.  It can't match both, so only go to the 
> > > trouble of
> > >  * checking the reverse order when it's not in ascending partition
> > >  * order.
> > >  */
> > > partition_order = pathkeys_contained_in(pathkeys,
> > > partition_pathkeys);
> > > partition_order_desc = !partition_order &&
> > > pathkeys_contained_in(pathkeys,
> > > 
> > > partition_pathkeys_desc);
> > >
> 
> > The problem with doing that is that if the partition keys are better
> > than the pathkeys then we'll most likely fail to generate any
> > partition keys at all due to lack of any existing eclass to use for
> > the pathkeys. It's unsafe to use just the prefix in this case as the
> > eclass may not have been found due to, for example one of the
> > partition keys having a different collation than the required sort
> > order of the query. In other words, we can't rely on a failure to
> > create the pathkey meaning that a more strict sort order is not
> > required.
> 
> I had another look at this patch and it seems okay just to add a new
> flag to build_partition_pathkeys() to indicate if the pathkey List was
> truncated or not.  In generate_mergeappend_paths() we can then just
> check that flag before checking if the partiiton pathkeys are
> contained in pathkeys. It's fine if the partition keys were truncated
> for the reverse of that check.
> 
> I've done this in the attached and added additional regression tests
> for this case.

I agree with this approach and I'm also fine with your other comments / code
changes to address my review.

As for the latest version (v8-0001-...) I've only caught a small typo: "When
the first subpath needs sorted ...". It was probably meant "... needs sort
...".

-- 
Antonin Houska
https://www.cybertec-postgresql.com

1 2 >

1 - 100 of 122 matches

Mail list logo