Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote
On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  wrote:
> On 2019/03/31 1:06, Amit Langote wrote:
>  > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
>  >> I am curious as to why there seems to be more degradation
>  >> for hash cases, as per Yoshikazu-san's results in
>  >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
>  >> but whatever's accounting for the difference probably
>  >> is not that.
>  >
>  > I suspected it may have been the lack of bitmapsets, but maybe only
>  > Imai-san could've confirmed that by applying the live_parts patch too.
>
> Yeah, I forgot to applying live_parts patch. I did same test again which
> I did for hash before.
> (BTW, thanks for committing speeding up patches!)

Thanks a lot for committing, Tom.  I wish you had listed yourself as
an author though.

I will send the patch for get_relation_constraints() mentioned
upthread tomorrow.

> [HEAD(428b260)]
> npartsTPS
> ==  =
> 2:  13134 (13240, 13290, 13071, 13172, 12896)
> 1024:   12627 (12489, 12635, 12716, 12732, 12562)
> 8192:   10289 (10216, 10265, 10171, 10278, 10514)
>
> [HEAD(428b260) + live_parts.diff]
> npartsTPS
> ==  =
> 2:  13277 (13112, 13290, 13241, 13360, 13382)
> 1024:   12821 (12930, 12849, 12909, 12700, 12716)
> 8192:   11102 (11134, 11158, 4, 10997, 11109)
>
>
> Degradations of performance are below.
>
>
> My test results from above (with live_parts, HEAD(428b260) +
> live_parts.diff)
> nparts   live_parts   HEAD
> ==   ==   
> 2:13277  13134
> 1024: 12821  12627
> 8192: 11102  10289
>
> 11102/13277 = 83.6 %
>
>
> Amit-san's test results (with live_parts)
>  > npartsv38   HEAD
>  > ==      
>  > 22971   2969
>  > 82980   1949
>  > 32   2955733
>  > 128  2946145
>  > 512  2924 11
>  > 1024 2986  3
>  > 4096 2702  0
>  > 8192 2531OOM
>
> 2531/2971 = 85.2 %
>
>
> My test results I posted before (without live_parts)
>  > npartsv38   HEAD
>  > ==      
>  > 0:  10538  10487
>  > 2:   6942   7028
>  > 4:   7043   5645
>  > 8:   6981   3954
>  > 16:  6932   2440
>  > 32:  6897   1243
>  > 64:  6897309
>  > 128: 6753120
>  > 256: 6727 46
>  > 512: 6708 12
>  > 1024:6063  3
>  > 2048:5894  1
>  > 4096:5374OOM
>  > 8192:4572OOM
>
> 4572/6942 = 65.9 %
>
>
> Certainly, using bitmapset contributes to the performance when scanning
> one partition(few partitions) from large partitions.

Thanks Imai-san for testing.

Regards,
Amit




Re: [HACKERS] generated columns

2019-03-30 Thread Erik Rijkers

On 2019-01-16 22:40, Erik Rijkers wrote:


If you add a generated column to a file_fdw foreign table, it works OK
wih VIRTUAL (the default) but with STORED it adds an empty column,
silently.  I would say it would make more sense to get an error.


VIRTUAL is gone, but that other issue is still there:  STORED in a 
file_fdw foreign table still silently creates the column which then 
turns out to be useless on SELECT, with an error like:


"ERROR:  column some_column_name is a generated column
DETAIL:  Generated columns cannot be used in COPY."

Maybe it'd be possible to get an error earlier, i.e., while trying to 
create such a useless column?



thanks,

Erik Rijkers










Re: Psql patch to show access methods info

2019-03-30 Thread s . cherkashin

Thanks for review.


With + it shows description:
# \dA+
 List of access methods
  Name  |
Type  |   Handler|  Description
+---+--+---
-
 brin   | index | brinhandler  | block range index (BRIN)
access method
 btree  | index | bthandler| b-tree index access method
 gin| index | ginhandler   | GIN index access method
 gist   | index | gisthandler  | GiST index access method
 hash   | index | hashhandler  | hash index access method
 heap   | table | heap_tableam_handler | heap table access method
 spgist | index | spghandler   | SP-GiST index access method
(7 rows)


Looks nice, but this fails for 9.4 or 9.5 server. I'm not sure
how far back versions we should support, though.


The command \dA initially displayed an error message when working
on a server version below 9.6, and I did not change this logic.
I'm not sure, but it probably makes sense for versions 9.4 and 9.5
to output something like this query does:
SELECT
 a.amname AS "AM",
 d.description AS "Description"
FROM pg_am a
JOIN pg_description d ON a.oid = d.objoid
ORDER BY 1;

#\dA
AM   | Description
+-
  btree  | b-tree index access method
  gin| GIN index access method
  gist   | GiST index access method
  hash   | hash index access method
  spgist | SP-GiST index access method

SELECT
 a.amname AS "AM",
 CASE WHEN a.amcanorder THEN 'yes' ELSE 'no' END AS "Ordering",
 CASE WHEN a.amcanunique THEN 'yes' ELSE 'no' END AS "Unique
indexes",
 CASE WHEN a.amcanmulticol THEN 'yes' ELSE 'no' END AS "Multicol
indexes",
 CASE WHEN a.amsearchnulls THEN 'yes' ELSE 'no' END AS "Searching
NULLs",
 CASE WHEN a.amclusterable THEN 'yes' ELSE 'no' END AS "Clusterale"
FROM pg_am a
JOIN pg_description d ON a.oid = d.objoid
ORDER BY 1;

#dA NAME
AM   | Ordering | Unique indexes | Multicol indexes | Searching 
NULLs

| Clusterale
+--++--+-+
  btree  | yes  | yes| yes  | yes
| yes
  gin| no   | no | yes  | no
| no
  gist   | no   | no | yes  | yes
| yes
  hash   | no   | no | no   | no
| no
  spgist | no   | no | no   | yes
| no
(5 rows)




The functionality of the \dAp command has been moved to \dA NAME.
Now the user can query the properties of a particular AM (or several,
using the search pattern) as follows:

# \dA h*
 Index access
method properties
  AM  | Can order | Support unique indexes | Support indexes with
multiple columns | Support exclusion constraints | Can include non-key
columns
--+---++---
+---+
-
 hash | no| no |
no| yes
|
no
(1 row)


In the earlier patches they were "Can order", "Can unique", "Can
multi col", "Can exclude" and they indeed look
too-short. Nevertheless the current column names occupies the top
four places on the podium by their length. "Foreign-data wrapeer"
is on the fifth place. Most of them are just one noun. Some of
them are two-or-three-word nouns. Some of them are single-word
adjective followed by '?'. \dicp uses single-word adverbs or
a-few-words nouns without trailing '?'. How about the following?

8  Ordering yes/no
14 Unique indexes   yes/no
16 Multicol indexes yes/no
21 Exclusion constraintsyes/no
23 Include non-key columns  yes/no
=
20 Foreign-data wrapper


Does anyone have better wordings? Or, are the current wordings OK?


I like this version.



# \dAo gin jsonb_ops
 List operators of family related to access method
 AM  | Opfamily Schema | Opfamily Name |  Operator
-+-+---+
 gin | pg_catalog  | jsonb_ops | @> (jsonb, jsonb)
 gin | pg_catalog  | jsonb_ops | ? (jsonb, text)
 gin | pg_catalog  | jsonb_ops | ?| (jsonb, text[])
 gin | pg_catalog  | jsonb_ops | ?& (jsonb, text[])
(4 rows)


I'm not sure but couldn't we show the opfamily name in full
qualified? The schema is not a property of the AM.

Now Opfamily Schema is shown if opfamily name is not visible in the
current
schema search path (check by pg_opfamily_is_visible().




# \dAo+ gist circle_ops
 List operators of family related to access
method
  AM  | Opfamily Schema | Opfamily Name |   Operator   |
Strategy | Purpose  | Sort family
--+-+---+--+---
---+--+-
 gist | pg_catalog  | circle_ops| << 

Re: Why does ExecComputeStoredGenerated() form a heap tuple

2019-03-30 Thread Andres Freund
Hi,

On 2019-03-30 19:57:44 -0700, Andres Freund wrote:
> while rebasing the remaining tableam patches (luckily a pretty small set
> now!), I had a few conflicts with ExecComputeStoredGenerated(). While
> resolving I noticed:
> 
>   oldtuple = ExecFetchSlotHeapTuple(slot, true, _free);
>   newtuple = heap_modify_tuple(oldtuple, tupdesc, values, nulls, 
> replaces);
>   ExecForceStoreHeapTuple(newtuple, slot);
>   if (should_free)
>   heap_freetuple(oldtuple);
> 
>   MemoryContextSwitchTo(oldContext);
> 
> First off, I'm not convinced this is correct:
> 
> ISTM you'd need at least an ExecMaterializeSlot() before the
> MemoryContextSwitchTo() in ExecComputeStoredGenerated().
> 
> But what actually brought me to reply was that it seems like it'll cause
> unnecessary slowdowns for !heap AMs. First, it'll form a heaptuple if
> the slot isn't in that form, and then it'll cause a conversion by
> storing a heap tuple even if the target doesn't use heap representation.
> 
> ISTM the above would be much more efficiently - even more efficient if
> only heap is used - implemented as something roughly akin to:
> 
> slot_getallattrs(slot);
> memcpy(values, slot->tts_values, ...);
> memcpy(nulls, slot->tts_isnull, ...);
> 
> for (int i = 0; i < natts; i++)
> {
> if (TupleDescAttr(tupdesc, i)->attgenerated == 
> ATTRIBUTE_GENERATED_STORED)
> {
> values[i] = ...
> }
> else
> values[i] = datumCopy(...);
> }
> 
> ExecClearTuple(slot);
> memcpy(slot->tts_values, values, ...);
> memcpy(slot->tts_isnull, nulls, ...);
> ExecStoreVirtualTuple(slot);
> ExecMaterializeSlot(slot);
> 
> that's not perfect, but more efficient than your version...

Also, have you actually benchmarked this code? ISTM that adding a
stored generated column would cause quite noticable slowdowns in the
COPY path based on this code.

Greetings,

Andres Freund




Why does ExecComputeStoredGenerated() form a heap tuple

2019-03-30 Thread Andres Freund
Hi,

while rebasing the remaining tableam patches (luckily a pretty small set
now!), I had a few conflicts with ExecComputeStoredGenerated(). While
resolving I noticed:

oldtuple = ExecFetchSlotHeapTuple(slot, true, _free);
newtuple = heap_modify_tuple(oldtuple, tupdesc, values, nulls, 
replaces);
ExecForceStoreHeapTuple(newtuple, slot);
if (should_free)
heap_freetuple(oldtuple);

MemoryContextSwitchTo(oldContext);

First off, I'm not convinced this is correct:

ISTM you'd need at least an ExecMaterializeSlot() before the
MemoryContextSwitchTo() in ExecComputeStoredGenerated().

But what actually brought me to reply was that it seems like it'll cause
unnecessary slowdowns for !heap AMs. First, it'll form a heaptuple if
the slot isn't in that form, and then it'll cause a conversion by
storing a heap tuple even if the target doesn't use heap representation.

ISTM the above would be much more efficiently - even more efficient if
only heap is used - implemented as something roughly akin to:

slot_getallattrs(slot);
memcpy(values, slot->tts_values, ...);
memcpy(nulls, slot->tts_isnull, ...);

for (int i = 0; i < natts; i++)
{
if (TupleDescAttr(tupdesc, i)->attgenerated == 
ATTRIBUTE_GENERATED_STORED)
{
values[i] = ...
}
else
values[i] = datumCopy(...);
}

ExecClearTuple(slot);
memcpy(slot->tts_values, values, ...);
memcpy(slot->tts_isnull, nulls, ...);
ExecStoreVirtualTuple(slot);
ExecMaterializeSlot(slot);

that's not perfect, but more efficient than your version...

Greetings,

Andres Freund




Re: speeding up planning with partitions

2019-03-30 Thread Imai Yoshikazu
On 2019/03/31 1:06, Amit Langote wrote:
 > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
 >> Amit Langote  writes:
 >>> I think the performance results did prove that degradation due to
 >>> those loops over part_rels becomes significant for very large
 >>> partition counts.  Is there a better solution than the bitmapset that
 >>> you have in mind?
 >>
 >> Hm, I didn't see much degradation in what you posted in
 >> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.
 >
 > Sorry that I didn't mention the link to begin with, but I meant to
 > point to numbers that I reported on Monday this week.
 >
 > 
https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp
 >
 > You were complaining of the bitmapset being useless overhead for small
 > partition counts, but the numbers I get tend to suggest that any
 > degradation in performance is within noise range, whereas the
 > performance benefit from having them looks pretty significant for very
 > large partition counts.
 >
 >> I am curious as to why there seems to be more degradation
 >> for hash cases, as per Yoshikazu-san's results in
 >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
 >> but whatever's accounting for the difference probably
 >> is not that.
 >
 > I suspected it may have been the lack of bitmapsets, but maybe only
 > Imai-san could've confirmed that by applying the live_parts patch too.

Yeah, I forgot to applying live_parts patch. I did same test again which 
I did for hash before.
(BTW, thanks for committing speeding up patches!)

[HEAD(428b260)]
npartsTPS
==  =
2:  13134 (13240, 13290, 13071, 13172, 12896)
1024:   12627 (12489, 12635, 12716, 12732, 12562)
8192:   10289 (10216, 10265, 10171, 10278, 10514)

[HEAD(428b260) + live_parts.diff]
npartsTPS
==  =
2:  13277 (13112, 13290, 13241, 13360, 13382)
1024:   12821 (12930, 12849, 12909, 12700, 12716)
8192:   11102 (11134, 11158, 4, 10997, 11109)


Degradations of performance are below.


My test results from above (with live_parts, HEAD(428b260) + 
live_parts.diff)
nparts   live_parts   HEAD
==   ==   
2:13277  13134
1024: 12821  12627
8192: 11102  10289

11102/13277 = 83.6 %


Amit-san's test results (with live_parts)
 > npartsv38   HEAD
 > ==      
 > 22971   2969
 > 82980   1949
 > 32   2955733
 > 128  2946145
 > 512  2924 11
 > 1024 2986  3
 > 4096 2702  0
 > 8192 2531OOM

2531/2971 = 85.2 %


My test results I posted before (without live_parts)
 > npartsv38   HEAD
 > ==      
 > 0:  10538  10487
 > 2:   6942   7028
 > 4:   7043   5645
 > 8:   6981   3954
 > 16:  6932   2440
 > 32:  6897   1243
 > 64:  6897309
 > 128: 6753120
 > 256: 6727 46
 > 512: 6708 12
 > 1024:6063  3
 > 2048:5894  1
 > 4096:5374OOM
 > 8192:4572OOM

4572/6942 = 65.9 %


Certainly, using bitmapset contributes to the performance when scanning 
one partition(few partitions) from large partitions.


Thanks
--
Imai Yoshikazu
diff --git a/src/backend/optimizer/path/joinrels.c 
b/src/backend/optimizer/path/joinrels.c
index 34cc7da..e847655 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1516,6 +1516,9 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo 
*rel1, RelOptInfo *rel2,
populate_joinrel_with_paths(root, child_rel1, child_rel2,

child_joinrel, child_sjinfo,

child_restrictlist);
+   if(!IS_DUMMY_REL(child_joinrel))
+   joinrel->live_parts = 
bms_add_member(joinrel->live_parts,
+   
cnt_parts);
}
 }
 
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index f08f1cd..9ddf42a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7107,7 +7107,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
int partition_idx;
 
/* Adjust each partition. */
-   for (partition_idx = 0; partition_idx < rel->nparts; 
partition_idx++)
+   partition_idx = -1;
+   while ((partition_idx = bms_next_member(rel->live_parts,
+   
partition_idx)) >= 0)
{
RelOptInfo *child_rel = rel->part_rels[partition_idx];
AppendRelInfo **appinfos;
@@ -7115,9 +7117,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
List 

Re: dropdb --force

2019-03-30 Thread Andres Freund
Hi,

On 2019-03-10 11:20:42 +0100, Filip Rembiałkowski wrote:
>  bool
> -CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
> +CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared, bool 
> force_terminate)
>  {

That doesn't seem like a decent API to me.

Greetings,

Andres Freund




Re: dropdb --force

2019-03-30 Thread Ryan Lambert
Hello,
This is a feature I have wanted for a long time, thank you for your work on
this.
The latest patch [1] applied cleanly for me.  In dbcommands.c the comment
references a 5 second delay, I don't see where that happens, am I missing
something?

I tested both the dropdb program and the in database commands.  Without
FORCE I get
the expected error message about active connections.

postgres=# DROP DATABASE testdb;
ERROR:  source database "testdb" is being accessed by other users
DETAIL:  There is 1 other session using the database.

With FORCE the database drops cleanly.

 postgres=# DROP DATABASE testdb FORCE;
 DROP DATABASE

The active connections get terminated as expected.  Thanks,

[1]
https://www.postgresql.org/message-id/attachment/99536/drop-database-force-20190310_01.patch

*Ryan Lambert*
RustProof Labs


On Sun, Mar 10, 2019 at 12:54 PM Filip Rembiałkowski <
filip.rembialkow...@gmail.com> wrote:

> Thank you. Updated patch attached.
>
> On Sat, Mar 9, 2019 at 2:53 AM Thomas Munro 
> wrote:
> >
> > On Wed, Mar 6, 2019 at 1:39 PM Filip Rembiałkowski
> >  wrote:
> > > Here is Pavel's patch rebased to master branch, added the dropdb
> > > --force option, a test case & documentation.
> >
> > Hello,
> >
> > cfbot.cputube.org says this fails on Windows, due to a missing
> semicolon here:
> >
> > #ifdef HAVE_SETSID
> > kill(-(proc->pid), SIGTERM);
> > #else
> > kill(proc->pid, SIGTERM)
> > #endif
> >
> > The test case failed on Linux, I didn't check why exactly:
> >
> > Test Summary Report
> > ---
> > t/050_dropdb.pl (Wstat: 65280 Tests: 13 Failed: 2)
> > Failed tests: 12-13
> > Non-zero exit status: 255
> > Parse errors: Bad plan. You planned 11 tests but ran 13.
> >
> > +/* Time to sleep after isuing SIGTERM to backends */
> > +#define TERMINATE_SLEEP_TIME 1
> >
> > s/isuing/issuing/
> >
> > But, hmm, this macro doesn't actually seem to be used in the patch.
> > Wait, is that because the retry loop forgot to actually include the
> > sleep?
> >
> > +/* without "force" flag raise exception immediately, or after
> > 5 minutes */
> >
> > Normally we call it an "error", not an "exception".
> >
> > --
> > Thomas Munro
> > https://enterprisedb.com
>


Re: patch to allow disable of WAL recycling

2019-03-30 Thread Tomas Vondra

On Fri, Mar 29, 2019 at 01:09:46PM +1300, Thomas Munro wrote:


...

I still don't know why exactly this happens, but it's clearly a real
phenomenon.  As for why Tomas Vondra couldn't see it, I'm guessing
that stacks more RAM and ~500k IOPS help a lot (essentially the
opposite end of the memory, CPU, IO spectrum from this little
machine), and Joyent's systems may be somewhere in between?



If needed, I guess I can rerun the tests with data on the SATA RAID, but
this time limit the amount of RAM to something much lower.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-30 Thread Tomas Vondra

On Sun, Mar 31, 2019 at 08:50:53AM +0800, John Naylor wrote:

I believe I found a typo in mcv.c, fix attached.



Thanks, pushed.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-30 Thread John Naylor
I believe I found a typo in mcv.c, fix attached.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


mcv-comment-fix.patch
Description: Binary data


Re: Re: FETCH FIRST clause WITH TIES option

2019-03-30 Thread Tomas Vondra

On Sun, Mar 31, 2019 at 01:14:46AM +0100, Tomas Vondra wrote:

On Fri, Mar 29, 2019 at 01:56:48AM +0100, Tomas Vondra wrote:

On Tue, Mar 26, 2019 at 10:46:00AM +0300, Surafel Temesgen wrote:

On Mon, Mar 25, 2019 at 11:56 AM David Steele  wrote:


This patch no longer passes testing so marked Waiting on Author.



Thank  you for informing. Fixed


Thanks for the updated patch.  I do have this on my list of patches that
I'd like to commit in this CF - likely tomorrow after one more round of
review, or so.



Hi,

I got to look at the patch today, with the intent to commit, but sadly I
ran into a couple of minor issues that I don't feel comfortable fixing
on my own. Attached is a patch highlighling some of the places (0001 is
your v7 patch, to keep the cfbot happy).


1) the docs documented this as

 ... [ ONLY | WITH TIES ]

but that's wrong, because it implies those options are optional (i.e.
the user may not specify anything). That's not the case, exactly one
of those options needs to be specified, so it should have been

 ... { ONLY | WITH TIES }


2) The comment in ExecLimit() needs to be updated to explain that WITH
TIES changes the behavior.


3) Minor code style issues (no space before * on comment lines, {}
around single-line if statements, ...).


4) The ExecLimit() does this

  if (node->limitOption == WITH_TIES)
  ExecCopySlot(node->last_slot, slot);

but I think we only really need to do that for the last tuple in the
window, no? Would it be a useful optimization?


5) Two issues in _outLimit(). Firstly, when printing uniqCollations the
code actually prints uniqOperators. Secondly, why does the code use
these loops at all, instead of using WRITE_ATTRNUMBER_ARRAY and
WRITE_OID_ARRAY, like other places? Perhaps there's an issue with empty
arrays? I haven't tested this, but looking at the READ_ counterparts, I
don't see why that would be the case.



Actually, two more minor comments:

6) There's some confusing naming - in plannodes.h the fields added to
the Limit node are called uniqSomething, but in other places the patch
uses sortSomething, ordSomething. I suggest more consistent naming.

7) The LimitOption enum has two items - WITH_ONLY and WITH_TIES. That's
a bit strange, because there's nothing like "WITH ONLY".


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Re: FETCH FIRST clause WITH TIES option

2019-03-30 Thread Tomas Vondra

On Fri, Mar 29, 2019 at 01:56:48AM +0100, Tomas Vondra wrote:

On Tue, Mar 26, 2019 at 10:46:00AM +0300, Surafel Temesgen wrote:

On Mon, Mar 25, 2019 at 11:56 AM David Steele  wrote:


This patch no longer passes testing so marked Waiting on Author.



Thank  you for informing. Fixed


Thanks for the updated patch.  I do have this on my list of patches that
I'd like to commit in this CF - likely tomorrow after one more round of
review, or so.



Hi,

I got to look at the patch today, with the intent to commit, but sadly I
ran into a couple of minor issues that I don't feel comfortable fixing
on my own. Attached is a patch highlighling some of the places (0001 is
your v7 patch, to keep the cfbot happy).


1) the docs documented this as

  ... [ ONLY | WITH TIES ]

but that's wrong, because it implies those options are optional (i.e.
the user may not specify anything). That's not the case, exactly one
of those options needs to be specified, so it should have been

  ... { ONLY | WITH TIES }


2) The comment in ExecLimit() needs to be updated to explain that WITH
TIES changes the behavior.


3) Minor code style issues (no space before * on comment lines, {}
around single-line if statements, ...).


4) The ExecLimit() does this

   if (node->limitOption == WITH_TIES)
   ExecCopySlot(node->last_slot, slot);

but I think we only really need to do that for the last tuple in the
window, no? Would it be a useful optimization?


5) Two issues in _outLimit(). Firstly, when printing uniqCollations the
code actually prints uniqOperators. Secondly, why does the code use
these loops at all, instead of using WRITE_ATTRNUMBER_ARRAY and
WRITE_OID_ARRAY, like other places? Perhaps there's an issue with empty
arrays? I haven't tested this, but looking at the READ_ counterparts, I
don't see why that would be the case.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
>From af9b8b8edcad84f88fc846ab3ce3f77cfb94230e Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 31 Mar 2019 00:04:31 +0100
Subject: [PATCH 1/2] fetch_first_with_ties_v7

---
 doc/src/sgml/ref/select.sgml|  9 ++-
 src/backend/executor/nodeLimit.c| 91 ++---
 src/backend/nodes/copyfuncs.c   |  7 ++
 src/backend/nodes/equalfuncs.c  |  2 +
 src/backend/nodes/outfuncs.c| 31 +
 src/backend/nodes/readfuncs.c   |  6 ++
 src/backend/optimizer/plan/createplan.c | 44 +++-
 src/backend/optimizer/plan/planner.c|  1 +
 src/backend/optimizer/util/pathnode.c   | 16 +
 src/backend/parser/analyze.c|  3 +
 src/backend/parser/gram.y   | 49 +
 src/include/nodes/execnodes.h   |  3 +
 src/include/nodes/nodes.h   | 12 
 src/include/nodes/parsenodes.h  |  2 +
 src/include/nodes/pathnodes.h   |  1 +
 src/include/nodes/plannodes.h   |  5 ++
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/planmain.h|  3 +-
 src/test/regress/expected/limit.out | 35 ++
 src/test/regress/sql/limit.sql  | 17 +
 20 files changed, 307 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..b3b045ea87 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | 
DESC | USING operator ] [ NULLS { 
FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] 
]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } [ ONLY | WITH TIES ] ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] 
[...] ]
 
 where from_item can be 
one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] 
{ ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] 
{ ROW | ROWS } [ ONLY | WITH TIES ]
 
 In this syntax, the start
 or count value is required by
@@ -1440,7 +1440,10 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+ROW .
+WITH TIES option is used to return two or more rows
+that tie for last place in the limit results set according to 
ORDER BY
+clause (ORDER BY clause must be specified in this case).
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..e8aed10177 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -41,6 +41,7 @@ static TupleTableSlot *   /* return: a 
tuple or 

clean up docs for v12

2019-03-30 Thread Justin Pryzby
I reviewed docs like this:
git log -p remotes/origin/REL_11_STABLE..HEAD -- doc

And split some into separate patches, which may be useful at least for
reviewing.

I'm mailing now rather than after feature freeze to avoid duplicative work and
see if there's any issue.

Note, I also/already mailed this one separately:
|Clean up docs for log_statement_sample_rate..
https://www.postgresql.org/message-id/flat/20190328135918.GA27808%40telsasoft.com

Justin
>From fb712dfe3cb3d64ac7d297ca5217193327f8b547 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 28 Mar 2019 08:53:26 -0500
Subject: [PATCH v1 1/8] Clean up docs for log_statement_sample_rate..

..which was added at commit 88bdbd3f746049834ae3cc972e6e650586ec3c9d
---
 doc/src/sgml/config.sgml  | 18 +-
 src/backend/utils/misc/guc.c  |  4 ++--
 src/backend/utils/misc/postgresql.conf.sample |  6 +++---
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2..9b66e7f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5786,9 +5786,9 @@ local0.*/var/log/postgresql
  Causes the duration of each completed statement to be logged
  if the statement ran for at least the specified number of
  milliseconds, modulated by log_statement_sample_rate.
- Setting this to zero prints all statement durations. Minus-one (the default)
- disables logging statement durations. For example, if you set it to
- 250ms then all SQL statements that run 250ms or longer
+ Setting this to zero prints all statement durations. -1 (the default)
+ disables logging statements due to exceeding duration threshold. For example, if you set it to
+ 250ms, then all SQL statements that run 250ms or longer
  will be logged. Enabling this parameter can be helpful in tracking down
  unoptimized queries in your applications.
  Only superusers can change this setting.
@@ -5824,13 +5824,13 @@ local0.*/var/log/postgresql
   

 
- Determines the fraction of the statements that exceed
-  which to log.
- The default is 1, meaning log to all such
+ Determines the fraction of statements that exceed
+  to be logged.
+ The default is 1.0, meaning log all such
  statements.
- Setting this to zero disables logging, same as setting
+ Setting this to zero disables logging by duration, same as setting
  log_min_duration_statement
- to minus-one. log_statement_sample_rate
+ to -1. log_statement_sample_rate
  is helpful when the traffic is too high to log all queries.
 

@@ -6083,7 +6083,7 @@ local0.*/var/log/postgresql
 

 
- The difference between setting this option and setting
+ The difference between enabling log_duration and setting
   to zero is that
  exceeding log_min_duration_statement forces the text of
  the query to be logged, but this option doesn't.  Thus, if
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d1..415cd78 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3357,8 +3357,8 @@ static struct config_real ConfigureNamesReal[] =
 
 	{
 		{"log_statement_sample_rate", PGC_SUSET, LOGGING_WHEN,
-			gettext_noop("Fraction of statements over log_min_duration_statement to log."),
-			gettext_noop("If you only want a sample, use a value between 0 (never "
+			gettext_noop("Fraction of statements exceeding log_min_duration_statement to be logged."),
+			gettext_noop("If you only want a sample, use a value between 0.0 (never "
 		 "log) and 1.0 (always log).")
 		},
 		_statement_sample_rate,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f1..684f5e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -489,9 +489,9 @@
 	# 0 logs all statement, > 0 logs only statements running at
 	# least this number of milliseconds.
 
-#log_statement_sample_rate = 1	# Fraction of logged statements over
-	# log_min_duration_statement. 1.0 logs all statements,
-	# 0 never logs.
+#log_statement_sample_rate = 1.0	# Fraction of logged statements exceeding
+	# log_min_duration_statement to be logged
+	# 1.0 logs all statements, 0.0 never logs
 
 # - What to Log -
 
-- 
2.1.4

>From 4b5cc00e25c3abc601f48bfae0cc69a8729cd041 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 28 Mar 2019 18:50:03 -0500
Subject: [PATCH v1 2/8] review docs for pg12dev

---
 doc/src/sgml/bloom.sgml|  2 +-
 doc/src/sgml/catalogs.sgml |  4 ++--
 doc/src/sgml/config.sgml   | 13 ++-
 doc/src/sgml/ecpg.sgml | 18 +++---
 

Re: Teach pg_upgrade test to honor NO_TEMP_INSTALL

2019-03-30 Thread Tom Lane
Andrew Dunstan  writes:
> On some machines (*cough* Mingw *cough*) installs are very slow. We've
> ameliorated this by allowing temp installs to be reused, but the
> pg_upgrade Makefile never got the message. Here's a patch that does
> that. I'd like to backpatch it, at least to 9.5 where we switched the
> pg_upgrade location. The risk seems appropriately low and it only
> affects our test regime.

I haven't tested this, but it looks reasonable.

I suspect you need double-quotes around the path values, as in
the adjacent usage of EXTRA_REGRESS_OPTS.

regards, tom lane




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Andres Freund
Hi,

On March 30, 2019 5:33:12 PM EDT, Thomas Munro  wrote:
>On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan  wrote:
>> On Sat, Mar 30, 2019 at 8:44 AM Robert Haas 
>wrote:
>> > Overall I'm inclined to think that we're making the same mistake
>here
>> > that we did with work_mem, namely, assuming that you can control a
>> > bunch of different prefetching behaviors with a single GUC and
>things
>> > will be OK.  Let's just create a new GUC for this and default it to
>10
>> > or something and go home.
>>
>> I agree. If you invent a new GUC, then everybody notices, and it
>> usually has to be justified quite rigorously. There is a strong
>> incentive to use an existing GUC, if only because the problem that
>> this creates is harder to measure than the supposed problem that it
>> avoids. This can perversely work against the goal of making the
>system
>> easy to use. Stretching the original definition of a GUC is bad.
>>
>> I take issue with the general assumption that not adding a GUC at
>> least makes things easier for users. In reality, it depends entirely
>> on the situation at hand.
>
>I'm not sure I understand why this is any different from the bitmap
>heapscan case though, or in fact why we are adding 10 in this case.
>In both cases we will soon be reading the referenced buffers, and it
>makes sense to queue up prefetch requests for the blocks if they
>aren't already in shared buffers.  In both cases, the number of
>prefetch requests we want to send to the OS is somehow linked to the
>amount of IO requests we think the OS can handle concurrently at once
>(since that's one factor determining how fast it drains them), but
>it's not necessarily the same as that number, AFAICS.  It's useful to
>queue some number of prefetch requests even if you have no IO
>concurrency at all (a single old school spindle), just because the OS
>will chew on that queue in the background while we're also doing
>stuff, which is probably what that "+ 10" is expressing.  But that
>seems to apply to bitmap heapscan too, doesn't it?

The index page deletion code does work on behalf of multiple backends, bitmap 
scans don't. If your system is busy it makes sense to like resource usage of 
per backend work, but not really work on shared resources like page reuse. A 
bit like work mem vs mwm.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Thomas Munro
On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan  wrote:
> On Sat, Mar 30, 2019 at 8:44 AM Robert Haas  wrote:
> > Overall I'm inclined to think that we're making the same mistake here
> > that we did with work_mem, namely, assuming that you can control a
> > bunch of different prefetching behaviors with a single GUC and things
> > will be OK.  Let's just create a new GUC for this and default it to 10
> > or something and go home.
>
> I agree. If you invent a new GUC, then everybody notices, and it
> usually has to be justified quite rigorously. There is a strong
> incentive to use an existing GUC, if only because the problem that
> this creates is harder to measure than the supposed problem that it
> avoids. This can perversely work against the goal of making the system
> easy to use. Stretching the original definition of a GUC is bad.
>
> I take issue with the general assumption that not adding a GUC at
> least makes things easier for users. In reality, it depends entirely
> on the situation at hand.

I'm not sure I understand why this is any different from the bitmap
heapscan case though, or in fact why we are adding 10 in this case.
In both cases we will soon be reading the referenced buffers, and it
makes sense to queue up prefetch requests for the blocks if they
aren't already in shared buffers.  In both cases, the number of
prefetch requests we want to send to the OS is somehow linked to the
amount of IO requests we think the OS can handle concurrently at once
(since that's one factor determining how fast it drains them), but
it's not necessarily the same as that number, AFAICS.  It's useful to
queue some number of prefetch requests even if you have no IO
concurrency at all (a single old school spindle), just because the OS
will chew on that queue in the background while we're also doing
stuff, which is probably what that "+ 10" is expressing.  But that
seems to apply to bitmap heapscan too, doesn't it?

-- 
Thomas Munro
https://enterprisedb.com




Re: Teach pg_upgrade test to honor NO_TEMP_INSTALL

2019-03-30 Thread Daniel Gustafsson
On Saturday, March 30, 2019 9:42 PM, Andrew Dunstan 
 wrote:

> On some machines (cough Mingw cough) installs are very slow. We've
> ameliorated this by allowing temp installs to be reused, but the
> pg_upgrade Makefile never got the message. Here's a patch that does
> that. I'd like to backpatch it, at least to 9.5 where we switched the
> pg_upgrade location. The risk seems appropriately low and it only
> affects our test regime.

While I haven't tried the patch (yet), reading it it makes sense, so +1
on the fix. Nice catch!

cheers ./daniel




Re: Compressed TOAST Slicing

2019-03-30 Thread Stephen Frost
Greetings,

* Paul Ramsey (pram...@cleverelephant.ca) wrote:
> > On Mar 19, 2019, at 4:47 AM, Stephen Frost  wrote:
> > * Paul Ramsey (pram...@cleverelephant.ca) wrote:
> >>> On Mar 18, 2019, at 7:34 AM, Robert Haas  wrote:
> >>> +1.  I think Paul had it right originally.
> >> 
> >> In that spirit, here is a “one pglz_decompress function, new parameter” 
> >> version for commit.
> > 
> > Alright, I've been working through this and have made a few improvements
> > (the big comment block at the top of pg_lzcompress.c needed updating,
> > among a couple other minor things), but I was trying to wrap my head
> > around this:
> > 
> > 
> > Specifically, the two SET_VARSIZE() calls, do we really need both..?
> > Are we sure that we're setting the length correctly there..?  Is there
> > any cross-check we can do?
> 
> Well, we don’t need to do the two SET_VARSIZE() calls, but we *do* need to 
> use rawsize in the call before the return, since we cannot be sure that the 
> size of the uncompressed bit is as large as the requested slice (even though 
> it will be 99 times out of 100)

Sure, of course, that makes sense, what didn't make much sense was
setting it and then setting it again to something different.

I'll pull out the extra one then.

> > I have to admit that I find the new argument to pglz_decompress() a bit
> > awkward to describe and document; if you have any thoughts as to how
> > that could be improved, that'd be great.
> 
> The only thing I can see is loosening the integrity check in pglz_decompress 
> which is a guardrail on something I’m not sure we ever hit. Instead of 
> checking that both the src and dst buffers are fully used up, a test that at 
> least one of them is used up should come up true in all error-free-happy 
> cases.

Hrmpf.  I don't really like loosening up the integrity check in the
cases where we should be using up everything though.  As such, I'll go
with what you've proposed here.  We can adjust it later if we end up
deciding that reducing the error-checking is reasonable.

I'll plan to push this tomorrow with the above change (and a few
additional comments to explain what all is going on..).

Thanks!

Stephen


signature.asc
Description: PGP signature


Teach pg_upgrade test to honor NO_TEMP_INSTALL

2019-03-30 Thread Andrew Dunstan

On some machines (*cough* Mingw *cough*) installs are very slow. We've
ameliorated this by allowing temp installs to be reused, but the
pg_upgrade Makefile never got the message. Here's a patch that does
that. I'd like to backpatch it, at least to 9.5 where we switched the
pg_upgrade location. The risk seems appropriately low and it only
affects our test regime.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index adb0d5d707..c3fa40e6a8 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -14,6 +14,15 @@ OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
 override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
 
+ifdef NO_TEMP_INSTALL
+	tbindir=$(abs_top_builddir)/tmp_install/$(bindir)
+	tlibdir=$(abs_top_builddir)/tmp_install/$(libdir)
+	DOINST =
+else
+	tbindir=$(bindir)
+	tlibdir=$(libdir)
+	DOINST = --install
+endif
 
 all: pg_upgrade
 
@@ -37,7 +46,7 @@ clean distclean maintainer-clean:
 	   pg_upgrade_dump_*.custom pg_upgrade_*.log
 
 check: test.sh all
-	MAKE=$(MAKE) bindir=$(bindir) libdir=$(libdir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) $< --install
+	MAKE=$(MAKE) bindir=$(tbindir) libdir=$(tlibdir) EXTRA_REGRESS_OPTS="$(EXTRA_REGRESS_OPTS)" $(SHELL) $< $(DOINST)
 
 # installcheck is not supported because there's no meaningful way to test
 # pg_upgrade against a single already-running server


Re: Enable data checksums by default

2019-03-30 Thread Andres Freund



On March 30, 2019 3:25:43 PM EDT, Tomas Vondra  
wrote:
>On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>>Re: Bernd Helmle 2019-03-29
><3586bb9345a59bfc8d13a50a7c729be1ee6759fd.ca...@oopsware.de>
>>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>>> >
>>> > I can't really believe that many people set up shared_buffers at
>>> > 128kB
>>> > which would cause such a large number of page evictions, but I can
>>> > believe that many users have shared_buffers set to its default
>value
>>> > and that we are going to get complains about "performance drop
>after
>>> > upgrade to v12" if we switch data checksums to on by default.
>>>
>>> Yeah, i think Christoph's benchmark is based on this thinking. I
>assume
>>> this very unrealistic scenery should emulate the worst case (many
>>> buffer_reads, high checksum calculation load).
>>
>>It's not unrealistic to have large seqscans that are all buffer
>>misses, the table just has to be big enough. The idea in my benchmark
>>was that if I make shared buffers really small, and the table still
>>fits in to RAM, I should be seeing only buffer misses, but without any
>>delay for actually reading from disk.
>>
>>Christoph
>>
>
>FWIW I think it's a mistake to focus solely on CPU utilization, which
>all the benchmarks performed on this thread do because they look at tps
>of in-memory read-only workloads. Checksums have other costs too, not
>just the additional CPU time. Most importanly they require
>wal_log_hints
>to be set (which people may or may not want anyway).
>
>I've done a simple benchmark, that does read-only (-S) and read-write
>(-N) pgbench runs with different scales, but also measures duration of
>the pgbench init and amount of WAL produced during the tests.
>
>On a small machine (i5, 8GB RAM, SSD RAID) the results are these:
>
> scaleconfig |init   tpswal
>=|==
>ro  10no-hints   |   2117038130
>  hints  |   2116378146
>  checksums  |   2115619147
>  ---|--
>   200no-hints   |  32 883402407
>  hints  |  37 861542628
>  checksums  |  36 833362624
>  ---|--
>  2000no-hints   | 365 386801967
>  hints  | 423 386702123
>  checksums  | 504 375102046
>-|--
>rw  10no-hints   |   2 19691437
>  hints  |   2 19712437
>  checksums  |   2 19654437
>  ---|--
>   200no-hints   |  32 158392745
>  hints  |  37 157352783
>  checksums  |  36 156462775
>  ---|--
>  2000no-hints   | 365 5371 3721
>  hints  | 423 5270 3671
>  checksums  | 504 5094 3574
>
>The no-hints config is default (wal_log_hints=off, data_checksums=off),
>hints sets wal_log_hints=on and checksums enables data checksums. All
>the configs were somewhat tuned (1GB shared buffers, max_wal_size high
>enough not to hit checkpoints very often, etc.).
>
>I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
>RAM, NVMe SSD), and the general pattern is about the same - while the
>tps and amount of WAL (not covering the init) does not change, the time
>for initialization increases significantly (by 20-40%).
>
>This effect is even clearer when using slower storage (SATA-based
>RAID).
>The results then look like this:
>
> scaleconfig |init   tpswal
>=|==
>ro 100no-hints   |  49229459122
>  hints  | 101167983190
>  checksums  | 103156307190
>  ---|--
>  1000no-hints   | 580152167109
>  hints  |1047122814142
>  checksums  |1080118586141
>  ---|--
>  6000no-hints   |4035   508  1
>  hints  |   11193   502  1
>  checksums  |   11376   506  1
>

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-30 Thread Tomas Vondra

On Wed, Mar 27, 2019 at 08:55:07PM +0100, Tomas Vondra wrote:

Hi,

I've now committed the MCV part, ...


Hmmm, what's the right status in the CF app when a part of a patch was
committed and the rest should be moved to the next CF? Committed, Moved
to next CF, or something else?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Progress reporting for pg_verify_checksums

2019-03-30 Thread Michael Banck
Hi,

so we are basically back at the original patch as written by Bernd :)

> +/*
> + * Report current progress status. Parts borrowed from
> + * PostgreSQLs' src/bin/pg_basebackup.c
> + */
> +static void
> +progress_report(bool force)
> +{
> + int percent;
> + chartotal_size_str[32];
> + charcurrent_size_str[32];
> + pg_time_t   now;
> +
> + if (!showprogress)
> + return;

The way you've now changed this is that there's a function call into
progress_report() for every block that's being read, even if there is no
progress reporting requested. That looks like a pretty severe
performance problem so I suggest to at least stick with checking
showprogress before calling progress_report() and not the other way
round.

I guess there is still some noticeable performance penalty to pay when
reporting progress, even more so now that you reverted to only doing it
once per second - if we report progress, we presumably call
report_progress() thousands of times and return early 99,9% of the time.

So my vote is in favour of only calling progress_report() every once in
a while - I saw quite a speedup (or removal of slowdown) due to this in
my tests, this was not just some unwarranted microoptimization.

> + fprintf(stderr, "\r");

I think the isatty() check that was in our versions of the patch is
useful here, did you check how this looks when piping the output to a
file?

> @@ -195,12 +254,23 @@ scan_file(const char *fn, BlockNumber segmentno)
>   _("%s: checksums enabled in file
\"%s\"\n"), progname, fn);
>   }
>  
> + /* Make sure progress is reported at least once per file */
> + progress_report(false);
> +
>   close(f);
>  }

This hunk is from the performance optimization of calling
progress_report only every 1024 blocks and could be removed if we stick
with calling it for every block anyway (but see above).

> -static void
> -scan_directory(const char *basedir, const char *subdir)
> +/*
> + * Scan the given directory for items which can be checksummed and
> + * operate on each one of them.  If "sizeonly" is true, the size of
> + * all the items which have checksums is computed and returned back

"computed" is maybe a bit strong a word here, how about "added up"?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: Enable data checksums by default

2019-03-30 Thread Tomas Vondra

On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:

Re: Bernd Helmle 2019-03-29 
<3586bb9345a59bfc8d13a50a7c729be1ee6759fd.ca...@oopsware.de>

Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>
> I can't really believe that many people set up shared_buffers at
> 128kB
> which would cause such a large number of page evictions, but I can
> believe that many users have shared_buffers set to its default value
> and that we are going to get complains about "performance drop after
> upgrade to v12" if we switch data checksums to on by default.

Yeah, i think Christoph's benchmark is based on this thinking. I assume
this very unrealistic scenery should emulate the worst case (many
buffer_reads, high checksum calculation load).


It's not unrealistic to have large seqscans that are all buffer
misses, the table just has to be big enough. The idea in my benchmark
was that if I make shared buffers really small, and the table still
fits in to RAM, I should be seeing only buffer misses, but without any
delay for actually reading from disk.

Christoph



FWIW I think it's a mistake to focus solely on CPU utilization, which
all the benchmarks performed on this thread do because they look at tps
of in-memory read-only workloads. Checksums have other costs too, not
just the additional CPU time. Most importanly they require wal_log_hints
to be set (which people may or may not want anyway).

I've done a simple benchmark, that does read-only (-S) and read-write
(-N) pgbench runs with different scales, but also measures duration of
the pgbench init and amount of WAL produced during the tests.

On a small machine (i5, 8GB RAM, SSD RAID) the results are these:

scaleconfig |init   tpswal
   =|==
   ro  10no-hints   |   2117038130
 hints  |   2116378146
 checksums  |   2115619147
 ---|--
  200no-hints   |  32 883402407
 hints  |  37 861542628
 checksums  |  36 833362624
 ---|--
 2000no-hints   | 365 386801967
 hints  | 423 386702123
 checksums  | 504 375102046
   -|--
   rw  10no-hints   |   2 19691437
 hints  |   2 19712437
 checksums  |   2 19654437
 ---|--
  200no-hints   |  32 158392745
 hints  |  37 157352783
 checksums  |  36 156462775
 ---|--
 2000no-hints   | 365 5371 3721
 hints  | 423 5270 3671
 checksums  | 504 5094 3574

The no-hints config is default (wal_log_hints=off, data_checksums=off),
hints sets wal_log_hints=on and checksums enables data checksums. All
the configs were somewhat tuned (1GB shared buffers, max_wal_size high
enough not to hit checkpoints very often, etc.).

I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
RAM, NVMe SSD), and the general pattern is about the same - while the
tps and amount of WAL (not covering the init) does not change, the time
for initialization increases significantly (by 20-40%).

This effect is even clearer when using slower storage (SATA-based RAID).
The results then look like this:

scaleconfig |init   tpswal
   =|==
   ro 100no-hints   |  49229459122
 hints  | 101167983190
 checksums  | 103156307190
 ---|--
 1000no-hints   | 580152167109
 hints  |1047122814142
 checksums  |1080118586141
 ---|--
 6000no-hints   |4035   508  1
 hints  |   11193   502  1
 checksums  |   11376   506  1
   -|--
   rw 100no-hints   |  49   279192
 hints  | 101   275190
 checksums  | 103   275

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Peter Geoghegan
On Sat, Mar 30, 2019 at 8:44 AM Robert Haas  wrote:
> Overall I'm inclined to think that we're making the same mistake here
> that we did with work_mem, namely, assuming that you can control a
> bunch of different prefetching behaviors with a single GUC and things
> will be OK.  Let's just create a new GUC for this and default it to 10
> or something and go home.

I agree. If you invent a new GUC, then everybody notices, and it
usually has to be justified quite rigorously. There is a strong
incentive to use an existing GUC, if only because the problem that
this creates is harder to measure than the supposed problem that it
avoids. This can perversely work against the goal of making the system
easy to use. Stretching the original definition of a GUC is bad.

I take issue with the general assumption that not adding a GUC at
least makes things easier for users. In reality, it depends entirely
on the situation at hand.

-- 
Peter Geoghegan




Re: Progress reporting for pg_verify_checksums

2019-03-30 Thread Fabien COELHO


Bonjour Michaël,


Getting to know the total size and the current size are the two
important factors that matter when it comes to do progress reporting
in my opinion.  I have read the patch, and I am not really convinced
by the need to show the progress report based on an interval of 250ms
as we talk about an operation which could take dozens of minutes.


I do not think that it matters. I like to see things moving, and the 
performance impact is null.


So I have simplified the patch to only show a progress report every 
second.  This also removes the include for the time-related APIs from 
portability/.


I do not think that it is a good idea, because Michael is thinking of 
adding some throttling capability, which would be a very good thing, but 
which will need something precise, so better use the precise stuff from 
the start. Also, the per second stuff induces rounding effects at the 
beginning.



A second thing is that I don't think that the speed is much useful.


Hmmm. I like this information because I this is where I have expectations, 
whereas I'm not sure whether 1234 seconds for 12.3 GB is good or bad, but 
I know that 10 MB/s on my SSD is not very good.


I would expect the speed to be steady, still there is a risk to show 
incorrect information if the speed of the operation is spiky or 
irregular leading to an incorrect estimation of the remaining time.


Hmmm. That is life, I'd say I'm used to it.


In short, I would like to commit the first patch as attached, which is
much more simple than what has been sent previously, still it provides
the progress information which is useful.


I would prefer that you would keep the patch with the initial precision & 
features for the reasons outlined above, but you are the committer and I'm 
only a reviewer.


--
Fabien.

Re: Checksum errors in pg_stat_database

2019-03-30 Thread Julien Rouhaud
Sorry for delay, I had to catch a train.

On Sat, Mar 30, 2019 at 4:02 PM Magnus Hagander  wrote:
>
> My vote is still to drop it completely, but if we're keeping it, it has to go 
> in both paths.

Ok.  For now I'm attaching v2, which drops this field, rename the view
to pg_stat_checksums (terminal s), and use the policy for choosing
random oid in the 8000.. range for new functions.

I'd also have to get more feedback on this.  For now, I'll add this
thread to the pg12 open items, as a follow up of the initial code
drop.

> Technically, that should be in pg_stat_progress_checksums to be consistent :) 
> So whichever way we turn, it's going to be inconsistent with something.

Indeed :)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f1df14bdea..30674f61ce 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -384,6 +384,14 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
  
 
+ 
+  pg_stat_checksumspg_stat_checksums
+  One row per database, plus one for the shared objects, showing
+  database-wide checksums statistics. See
+for details.
+  
+ 
+
  
   pg_stat_databasepg_stat_database
   One row per database, showing database-wide statistics. See
@@ -2418,6 +2426,54 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
single row, containing global data for the cluster.
   
 
+  
+   pg_stat_checksums View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+
+ datid
+ oid
+ OID of a database, or 0 for objects belonging to a shared relation
+
+
+ datname
+ name
+ Name of this database, or shared_objects
+
+
+ checksum_failures
+ bigint
+ Number of data page checksum failures detected in this
+ database
+
+
+ checksum_last_failure
+ timestamp with time zone
+ Time at which the last data page checksum failures was detected in
+ this database
+
+
+ stats_reset
+ timestamp with time zone
+ Time at which these statistics were last reset
+
+   
+   
+  
+
+  
+   The pg_stat_database view will contain one row
+   for each database in the cluster, showing database-wide statistics.
+  
+
   
pg_stat_database View

@@ -2529,11 +2585,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  bigint
  Number of deadlocks detected in this database
 
-
- checksum_failures
- bigint
- Number of data page checksum failures detected in this database
-
 
  blk_read_time
  double precision
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 84fb37c293..e063dd972e 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -218,7 +218,9 @@ PostgreSQL documentation
 I/O system that would otherwise be silent. Enabling checksums
 may incur a noticeable performance penalty. This option can only
 be set during initialization, and cannot be changed later. If
-set, checksums are calculated for all objects, in all databases.
+set, checksums are calculated for all objects, in all databases. All
+checksum failures will be reported in the  view.

   
  
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c4f3950e5b..8179db5f2a 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -531,7 +531,8 @@ PostgreSQL documentation
 By default, checksums are verified and checksum failures will result
 in a non-zero exit status. However, the base backup will not be
 removed in such a case, as if the --no-clean option
-had been used.
+had been used.  Checksum verifications failures will also be reported
+in the  view.

   
  
diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 70339eaec9..65e067a8aa 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -39,10 +39,11 @@ PostgreSQL documentation
pg_checksums checks, enables or disables data
checksums in a PostgreSQL cluster.  The server
must be shut down cleanly before running
-   pg_checksums. The exit status is zero if there
-   are no checksum errors when checking them, and nonzero if at least one
-   checksum failure is detected. If enabling or disabling checksums, the
-   exit status is nonzero if the operation failed.
+   pg_checksums. As a consequence, the
+   pg_stat_checksums view won't reflect this activity.
+   The exit status is zero if there are no checksum errors when checking them,
+   and nonzero if at least one checksum failure is detected. If enabling or
+   disabling checksums, the exit status is nonzero if the operation failed.
   
 
   
diff 

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-30 Thread Vik Fearing
On 27/03/2019 21:54, Darafei "Komяpa" Praliaskouski wrote:
> Hi hackers,
> 
> Attached is sketch of small patch that fixes several edge cases with
> autovacuum. Long story short autovacuum never comes to append only
> tables, killing large productions.
> 
> First case, mine.
>  
> https://www.postgresql.org/message-id/CAC8Q8tLBeAxR%2BBXWuKK%2BHP5m8tEVYn270CVrDvKXt%3D0PkJTY9g%40mail.gmail.com
> 
> We had a table we were appending and wanted Index Only Scan to work. For
> it to work, you need to call VACUUM manually, since VACUUM is the only
> way to mark pages all visible, and autovacuum never comes to append only
> tables. We were clever to invent a workflow without dead tuples and it
> painfully bit us.
> 
> Second case, just read in the news.
> https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/
> 
> Mandrill has 6TB append only table that autovacuum probably never
> vacuumed. Then anti-wraparound came and production went down. If
> autovacuum did its job before that last moment, it would probably be okay.
> 
> Idea: look not on dead tuples, but on changes, just like ANALYZE does.
> It's my first patch on Postgres, it's probably all wrong but I hope it
> helps you get the idea.

This was suggested and rejected years ago:
https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30...@2ndquadrant.fr
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support




Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas
On Sat, Mar 30, 2019 at 12:16 PM Amit Langote  wrote:
> Fwiw, I had complained when reviewing the run-time pruning patch that
> creating those maps in the planner and putting them in
> PartitionPruneInfo might not be a good idea, but David insisted that
> it'd be good for performance (in the context of using cached plans) to
> compute this information during planning.

Well, he's not wrong about that, I expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-30 Thread Amit Langote
On Sun, Mar 31, 2019 at 1:11 AM Robert Haas  wrote:
>
> On Wed, Mar 27, 2019 at 5:32 PM Alvaro Herrera  
> wrote:
> > * certain tables would have some sort of partial scan that sets the
> >   visibility map.  There's no reason to invoke the whole vacuuming
> >   machinery.  I don't think this is limited to append-only tables, but
> >   rather those are just the ones that are affected the most.
>
> I think this is a really good idea, but in order for it to work well I
> think we would need to have some kind of estimate of autovacuum
> pressure.
>
> If we know that we're currently fairly on top of things, and there is
> not much for autovacuum to do, periodically vacuuming a chunk of some
> table that has a lot of unset visibility-map bits is probably a good
> idea.  However, we can't possibly guess how aggressively to do this if
> we have no idea how long it's going to be before we need to vacuum
> that table for real.  If the number of XIDs remaining until the table
> gets a wraparound vacuum is X, and the number of XIDs being consumed
> per day is Y, we can estimate that in roughly X/Y days, we're going to
> need to do a wraparound vacuum.  That value might be in the range of
> months, or in the range of hours.
>
> If it's months, we probably want limit vacuum to working at a pretty
> slow rate, say 1% of the table size per hour or something.  If it's in
> hours, we need to be a lot more aggressive.  Right now we have no
> information to tell us which of those things is the case, so we'd just
> be shooting in the dark.

Sawada-san presented some ideas in his PGCon 2018 talk that may be related.

https://www.pgcon.org/2018/schedule/attachments/488_Vacuum_More_Efficient_Than_Ever

(slide 32~)

Thanks,
Amit




Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote
On Sun, Mar 31, 2019 at 12:59 AM Robert Haas  wrote:
>
> On Sat, Mar 30, 2019 at 11:46 AM Tom Lane  wrote:
> > > The only problem with PartitionPruneInfo structures of which I am
> > > aware is that they rely on PartitionDesc offsets not changing. But I
> > > added code in that commit in ExecCreatePartitionPruneState to handle
> > > that exact problem.  See also paragraph 5 of the commit message, which
> > > begins with "Although in general..."
> >
> > Ah.  Grotty, but I guess it will cover the issue.
>
> I suppose it is.  I am a little suspicious of the decision to make
> PartitionPruneInfo structures depend on PartitionDesc indexes.

Fwiw, I had complained when reviewing the run-time pruning patch that
creating those maps in the planner and putting them in
PartitionPruneInfo might not be a good idea, but David insisted that
it'd be good for performance (in the context of using cached plans) to
compute this information during planning.

Thanks,
Amit




Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-30 Thread Robert Haas
On Wed, Mar 27, 2019 at 5:32 PM Alvaro Herrera  wrote:
> * certain tables would have some sort of partial scan that sets the
>   visibility map.  There's no reason to invoke the whole vacuuming
>   machinery.  I don't think this is limited to append-only tables, but
>   rather those are just the ones that are affected the most.

I think this is a really good idea, but in order for it to work well I
think we would need to have some kind of estimate of autovacuum
pressure.

If we know that we're currently fairly on top of things, and there is
not much for autovacuum to do, periodically vacuuming a chunk of some
table that has a lot of unset visibility-map bits is probably a good
idea.  However, we can't possibly guess how aggressively to do this if
we have no idea how long it's going to be before we need to vacuum
that table for real.  If the number of XIDs remaining until the table
gets a wraparound vacuum is X, and the number of XIDs being consumed
per day is Y, we can estimate that in roughly X/Y days, we're going to
need to do a wraparound vacuum.  That value might be in the range of
months, or in the range of hours.

If it's months, we probably want limit vacuum to working at a pretty
slow rate, say 1% of the table size per hour or something.  If it's in
hours, we need to be a lot more aggressive.  Right now we have no
information to tell us which of those things is the case, so we'd just
be shooting in the dark.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: speeding up planning with partitions

2019-03-30 Thread Amit Langote
On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
> Amit Langote  writes:
> > I think the performance results did prove that degradation due to
> > those loops over part_rels becomes significant for very large
> > partition counts.  Is there a better solution than the bitmapset that
> > you have in mind?
>
> Hm, I didn't see much degradation in what you posted in
> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.

Sorry that I didn't mention the link to begin with, but I meant to
point to numbers that I reported on Monday this week.

https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp

You were complaining of the bitmapset being useless overhead for small
partition counts, but the numbers I get tend to suggest that any
degradation in performance is within noise range, whereas the
performance benefit from having them looks pretty significant for very
large partition counts.

> I am curious as to why there seems to be more degradation
> for hash cases, as per Yoshikazu-san's results in
> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
> but whatever's accounting for the difference probably
> is not that.

I suspected it may have been the lack of bitmapsets, but maybe only
Imai-san could've confirmed that by applying the live_parts patch too.

Thanks,
Amit




Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas
On Sat, Mar 30, 2019 at 11:46 AM Tom Lane  wrote:
> > The only problem with PartitionPruneInfo structures of which I am
> > aware is that they rely on PartitionDesc offsets not changing. But I
> > added code in that commit in ExecCreatePartitionPruneState to handle
> > that exact problem.  See also paragraph 5 of the commit message, which
> > begins with "Although in general..."
>
> Ah.  Grotty, but I guess it will cover the issue.

I suppose it is.  I am a little suspicious of the decision to make
PartitionPruneInfo structures depend on PartitionDesc indexes.  First,
it's really non-obvious that the dependency exists, and I do not think
I would have spotted it had not Alvaro pointed the problem out.
Second, I wonder whether it is really a good idea in general to make a
plan depend on array indexes when the array is not stored in the plan.
In one sense, we do that all the time, because attnums are arguably
just indexes into what is conceptually an array of attributes.
However, I feel that's not quite the same, because the attnum is
explicitly stored in the catalogs, and PartitionDesc array indexes are
not stored anywhere, but rather are the result of a fairly complex
calculation.  Now I guess it's probably OK because we will probably
have lots of other problems if we don't get the same answer every time
we do that calculation, but it still makes me a little nervous.  I
would try to propose something better but I don't have a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: speeding up planning with partitions

2019-03-30 Thread Tom Lane
Robert Haas  writes:
> On Sat, Mar 30, 2019 at 11:11 AM Tom Lane  wrote:
>> Before that, though, I remain concerned that the PartitionPruneInfo
>> data structure the planner is transmitting to the executor is unsafe
>> against concurrent ATTACH PARTITION operations.  The comment for
>> PartitionedRelPruneInfo says in so many words that it's relying on
>> indexes in the table's PartitionDesc; how is that not broken by
>> 898e5e329?

> The only problem with PartitionPruneInfo structures of which I am
> aware is that they rely on PartitionDesc offsets not changing. But I
> added code in that commit in ExecCreatePartitionPruneState to handle
> that exact problem.  See also paragraph 5 of the commit message, which
> begins with "Although in general..."

Ah.  Grotty, but I guess it will cover the issue.

regards, tom lane




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Robert Haas
On Sat, Mar 30, 2019 at 6:33 AM Thomas Munro  wrote:
> I didn't understand that last sentence.
>
> Here's an attempt to write a suitable comment for the quick fix.  And
> I suppose effective_io_concurrency is a reasonable default.
>
> It's pretty hard to think of a good way to get your hands on the real
> value safely from here.  I wondered if there was a way to narrow this
> to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
> but that doesn't work, we access other catalog too in that path.
>
> Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
> of asynchronous I/O requests" according to config.sgml, but here 0
> will prefetch 10 buffers.

Mmmph.  I'm starting to think we're not going to get a satisfactory
result here unless we make this controlled by something other than
effective_io_concurrency.  There's just no reason to suppose that the
same setting that we use to control prefetching for bitmap index scans
is also going to be right for what's basically a bulk operation.

Interestingly, Dilip Kumar ran into similar issues recently while
working on bulk processing for undo records for zheap.  In that case,
you definitely want to prefetch the undo aggressively, because you're
reading it front to back and backwards scans suck without prefetching.
And you possibly also want to prefetch the data pages to which the
undo that you are prefetching applies, but maybe not as aggressively
because you're going to be doing a WAL write for each data page and
flooding the system with too many reads could be counterproductive, at
least if pg_wal and the rest of $PGDATA are not on separate spindles.
And even if they are, it's possible that as you suck in undo pages and
the zheap pages that they need to update, you might evict dirty pages,
generating write activity against the data directory.

Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK.  Let's just create a new GUC for this and default it to 10
or something and go home.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: speeding up planning with partitions

2019-03-30 Thread Robert Haas
On Sat, Mar 30, 2019 at 11:11 AM Tom Lane  wrote:
> Before that, though, I remain concerned that the PartitionPruneInfo
> data structure the planner is transmitting to the executor is unsafe
> against concurrent ATTACH PARTITION operations.  The comment for
> PartitionedRelPruneInfo says in so many words that it's relying on
> indexes in the table's PartitionDesc; how is that not broken by
> 898e5e329?

The only problem with PartitionPruneInfo structures of which I am
aware is that they rely on PartitionDesc offsets not changing. But I
added code in that commit in ExecCreatePartitionPruneState to handle
that exact problem.  See also paragraph 5 of the commit message, which
begins with "Although in general..."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: jsonpath

2019-03-30 Thread Alexander Korotkov
On Fri, Mar 29, 2019 at 4:15 PM Alexander Korotkov
 wrote:
> On Thu, Mar 28, 2019 at 7:43 PM Andrew Dunstan
>  wrote:
> > On 3/28/19 9:50 AM, Tom Lane wrote:
> > > Andres Freund  writes:
> > >> On March 28, 2019 9:31:14 AM EDT, Tom Lane  wrote:
> > >>> Has anybody gotten through a valgrind run on this code yet?
> > >> Skink has successfully passed since - but that's x86...
> > > Yeah, there is a depressingly high chance that this is somehow specific
> > > to the bison version, flex version, and/or compiler in use on jacana.
> >
> > lousyjack has also passed it (x64).
> >
> > git bisect on jacana blames commit 550b9d26f.
>
> Hmm... 550b9d26f just makes jsonpath_gram.y and jsonpath_scan.l
> compile at once.  I've re-read this commit and didn't find anything
> suspicious.
> I've asked Andrew for access to jacana in order to investigate this myself.

I'm going to push there 3 attached patches for jsonpath.

1st one is revised patch implementing GIN index support for jsonpath.
Based on feedback from Jonathan Katz, I decided to restrict jsonb_ops
from querying only case.  Now, jsonb_ops also works on if
"accessors_chain = const" statement is found.  Keys are frequently not
selective.  Given we have no statistics of them, purely key GIN
queries are likely confuse optimizer making it select inefficient
plan.  Actually, jsonb_ops may be used for pure key query when ?
operator is used.  But in this case, user explicitly searches for key.
With jsonpath, purely key GIN searches can easily happen unintended.
So, restrict that.

2nd and 3rd patches are from Nikita Glukhov upthread.  2nd restrict
some cases in parsing numerics.  3rd make jsonb_path_match() function
throw error when result is not single boolean and silent mode is off.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-jsonpath-gin-support.patch
Description: Binary data


0002-restrict-some-cases-in-jsonpath-numerics-parsing.patch
Description: Binary data


0003-make-jsonb_path_match-throw-error.patch
Description: Binary data


Re: speeding up planning with partitions

2019-03-30 Thread Tom Lane
Amit Langote  writes:
> On Sat, Mar 30, 2019 at 9:17 AM Tom Lane  wrote:
>> What I propose we do about the GEQO problem is shown in 0001 attached
>> (which would need to be back-patched into v11).
>> ...
>> That's just dumb.  What we *ought* to be doing in such degenerate
>> outer-join cases is just emitting the non-dummy side, ie

> Fwiw, I agree that we should fix join planning so that we get the
> ProjectionPath atop scan path of non-nullable relation instead of a
> full-fledged join path with dummy path on the nullable side.  It seems
> to me that the "fix" would be mostly be localized to
> try_partitionwise_join() at least insofar as detecting whether we
> should generate a join or the other plan shape is concerned, right?

Well, if we're going to do something about that, I would like to see
it work for non-partition cases too, ie we're not smart about this
either:

regression=# explain select * from tenk1 left join (select 1 where false) ss(x) 
on unique1=x;
QUERY PLAN 
---
 Nested Loop Left Join  (cost=0.00..570.00 rows=1 width=248)
   Join Filter: (tenk1.unique1 = 1)
   ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=1 width=244)
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 One-Time Filter: false
(5 rows)

A general solution would presumably involve new logic in
populate_joinrel_with_paths for the case where the nullable side is
dummy.  I'm not sure whether that leaves anything special to do in
try_partitionwise_join or not.  Maybe it would, since that would
have a requirement to build the joinrel without any RHS input
RelOptInfo, but I don't think that's the place to begin working on
this.

> By the way, does it make sense to remove the tests whose output
> changes altogether and reintroduce them when we fix join planning?
> Especially, in partitionwise_aggregate.out, there are comments near
> changed plans which are no longer true.

Good point about the comments, but we shouldn't just remove those test
cases; they're useful to exercise the give-up-on-partitionwise-join
code paths.  I'll tweak the comments.

>> 0002 attached is then the rest of the partition-planning patch;
>> it doesn't need to mess with joinrels.c at all.  I've addressed
>> the other points discussed today in that, except for the business
>> about whether we want your 0003 bitmap-of-live-partitions patch.
>> I'm still inclined to think that that's not really worth it,
>> especially in view of your performance results.

> I think the performance results did prove that degradation due to
> those loops over part_rels becomes significant for very large
> partition counts.  Is there a better solution than the bitmapset that
> you have in mind?

Hm, I didn't see much degradation in what you posted in
<5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.
I am curious as to why there seems to be more degradation
for hash cases, as per Yoshikazu-san's results in
<0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
but whatever's accounting for the difference probably
is not that.  Anyway I still believe that getting rid of
these sparse arrays would be a better answer.

Before that, though, I remain concerned that the PartitionPruneInfo
data structure the planner is transmitting to the executor is unsafe
against concurrent ATTACH PARTITION operations.  The comment for
PartitionedRelPruneInfo says in so many words that it's relying on
indexes in the table's PartitionDesc; how is that not broken by
898e5e329?

regards, tom lane




Re: Checksum errors in pg_stat_database

2019-03-30 Thread Magnus Hagander
On Sat, Mar 30, 2019 at 3:55 PM Julien Rouhaud  wrote:

> On Sat, Mar 30, 2019 at 2:33 PM Magnus Hagander 
> wrote:
> >
> > On Wed, Mar 13, 2019 at 4:54 PM Julien Rouhaud 
> wrote:
> >>
> >> On Wed, Mar 13, 2019 at 4:53 PM Julien Rouhaud 
> wrote:
> >> >
> >> > As a result I ended up simply adding counters for the number of total
> >> > checks and the timestamp of the last failure in PgStat_StatDBEntry,
> >> > making attached patch very lightweight.  I moved all the checksum
> >> > related counters out of pg_stat_database in a new pg_stat_checksum
> >> > view.  It avoids to make pg_stat_database too wide, and also allows to
> >> > display information about shared object in this new view (some of the
> >> > other counters don't really make sense for shared objects or could
> >> > break existing monitoring query).  While at it, I tried to add a
> >> > little bit of documentation wrt. checksum monitoring.
> >>
> >> and of course I forgot to attach the patch.
> >
> >
> > Does it really make any sense to track "number of checksum checks"? In
> any sort of interesting database that's just going to be an insanely high
> number, isn't it? (And also, to stay consistent with checksum failures, we
> should of course also count the checks done in base backups, which is not
> in the patch. But I'm more thinking we should drop it)
>
> Thanks for looking at it!
>
> It's surely going to be a huge number on databases with a large number
> of buffer eviction and/or frequent pg_basebackup.  The idea was to be
> able to know if the possible lack of failure was due to lack of check
> at all or because the server appears to be healthy, without spamming
> gettimeofday calls.  If having a last_check() is better, I'm fine with
> it.  If it's useless, let's drop it.
>

I'm not sure either of them are really useful, but would be happy to take
input from others :)


The number of checks was supposed to also be tracked in base_backups, with
>

Oh, that's a sloppy review. I see it's there. However, it doesn't appear to
count up in the *normal* backend path...

My vote is still to drop it completely, but if we're keeping it, it has to
go in both paths.


> Having thought some more about this, I wonder if the right thing to do is
> to actually add a row to pg_stat_database for the global stats, rather than
> invent a separate view. I can see the argument going both ways, but
> particularly with the name pg_stat_checksums we are setting a pattern that
> will create one view for each counter. That's not very good, I think.
> >
> > In the end I'm somewhat split on the idea of pg_stat_database with a
> NULL row or pg_stat_checkpoints. What do others think?
>
> I agree that having a separate view for each counter if a bad idea.
> But what I was thinking is that we'll probably end up with a view to
> track per-db online checksum activation progress/activity/status at
> some point (similar to pg_stat_progress_vacuum), so why not starting
> with this dedicated view right now and add new counters later, either
> in pgstat and/or some shmem, as long as we keep the view name as SQL
> interface.
>

Technically, that should be in pg_stat_progress_checksums to be consistent
:) So whichever way we turn, it's going to be inconsistent with something.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Checksum errors in pg_stat_database

2019-03-30 Thread Julien Rouhaud
On Sat, Mar 30, 2019 at 2:33 PM Magnus Hagander  wrote:
>
> On Wed, Mar 13, 2019 at 4:54 PM Julien Rouhaud  wrote:
>>
>> On Wed, Mar 13, 2019 at 4:53 PM Julien Rouhaud  wrote:
>> >
>> > As a result I ended up simply adding counters for the number of total
>> > checks and the timestamp of the last failure in PgStat_StatDBEntry,
>> > making attached patch very lightweight.  I moved all the checksum
>> > related counters out of pg_stat_database in a new pg_stat_checksum
>> > view.  It avoids to make pg_stat_database too wide, and also allows to
>> > display information about shared object in this new view (some of the
>> > other counters don't really make sense for shared objects or could
>> > break existing monitoring query).  While at it, I tried to add a
>> > little bit of documentation wrt. checksum monitoring.
>>
>> and of course I forgot to attach the patch.
>
>
> Does it really make any sense to track "number of checksum checks"? In any 
> sort of interesting database that's just going to be an insanely high number, 
> isn't it? (And also, to stay consistent with checksum failures, we should of 
> course also count the checks done in base backups, which is not in the patch. 
> But I'm more thinking we should drop it)

Thanks for looking at it!

It's surely going to be a huge number on databases with a large number
of buffer eviction and/or frequent pg_basebackup.  The idea was to be
able to know if the possible lack of failure was due to lack of check
at all or because the server appears to be healthy, without spamming
gettimeofday calls.  If having a last_check() is better, I'm fine with
it.  If it's useless, let's drop it.

The number of checks was supposed to also be tracked in base_backups, with

@@ -1527,6 +1527,8 @@ sendFile(const char *readfilename, const char
*tarfilename, struct stat *statbuf
 "failures in file \"%s\" will not "
 "be reported", readfilename)));
 }
+else if (block_retry == false)
+checksum_checks++;

> Having thought some more about this, I wonder if the right thing to do is to 
> actually add a row to pg_stat_database for the global stats, rather than 
> invent a separate view. I can see the argument going both ways, but 
> particularly with the name pg_stat_checksums we are setting a pattern that 
> will create one view for each counter. That's not very good, I think.
>
> In the end I'm somewhat split on the idea of pg_stat_database with a NULL row 
> or pg_stat_checkpoints. What do others think?

I agree that having a separate view for each counter if a bad idea.
But what I was thinking is that we'll probably end up with a view to
track per-db online checksum activation progress/activity/status at
some point (similar to pg_stat_progress_vacuum), so why not starting
with this dedicated view right now and add new counters later, either
in pgstat and/or some shmem, as long as we keep the view name as SQL
interface.

Anyway, I don't have a strong preference for any implementation, so
I'll be happy to send an updated patch with what ends up being
preferred.




Re: Progress reporting for pg_verify_checksums

2019-03-30 Thread Michael Paquier
On Thu, Mar 28, 2019 at 03:53:59PM +0100, Fabien COELHO wrote:
> I think that it is good to show the overall impact of the signal stuff, in
> particular the fact that the size must always be computed if the progress
> may be activated.

Getting to know the total size and the current size are the two
important factors that matter when it comes to do progress reporting
in my opinion.  I have read the patch, and I am not really convinced
by the need to show the progress report based on an interval of 250ms
as we talk about an operation which could take dozens of minutes.  So
I have simplified the patch to only show a progress report every
second.  This also removes the include for the time-related APIs from 
portability/.  A second thing is that I don't think that the speed is 
much useful.  I would expect the speed to be steady, still there is a
risk to show incorrect information if the speed of the operation is
spiky or irregular leading to an incorrect estimation of the remaining
time.

In short, I would like to commit the first patch as attached, which is
much more simple than what has been sent previously, still it provides
the progress information which is useful.
--
Michael
diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 70339eaec9..29507ff98e 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -135,6 +135,17 @@ PostgreSQL documentation
   
  
 
+ 
+  -P
+  --progress
+  
+   
+Enable progress reporting. Turning this on will deliver a progress
+report while checking or enabling checksums.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 5c185ebed8..f9903a441c 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -15,6 +15,7 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
 #include 
 #include 
 
@@ -37,6 +38,7 @@ static ControlFileData *ControlFile;
 static char *only_relfilenode = NULL;
 static bool do_sync = true;
 static bool verbose = false;
+static bool showprogress = false;
 
 typedef enum
 {
@@ -59,6 +61,13 @@ static PgChecksumMode mode = PG_MODE_CHECK;
 
 static const char *progname;
 
+/*
+ * Progress status information.
+ */
+int64		total_size = 0;
+int64		current_size = 0;
+static pg_time_t last_progress_report = 0;
+
 static void
 usage(void)
 {
@@ -71,6 +80,7 @@ usage(void)
 	printf(_("  -d, --disable  disable data checksums\n"));
 	printf(_("  -e, --enable   enable data checksums\n"));
 	printf(_("  -N, --no-sync  do not wait for changes to be written safely to disk\n"));
+	printf(_("  -P, --progress show progress information\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
@@ -97,6 +107,52 @@ static const char *const skip[] = {
 	NULL,
 };
 
+/*
+ * Report current progress status. Parts borrowed from
+ * PostgreSQLs' src/bin/pg_basebackup.c
+ */
+static void
+progress_report(bool force)
+{
+	int			percent;
+	char		total_size_str[32];
+	char		current_size_str[32];
+	pg_time_t	now;
+
+	if (!showprogress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force)
+		return;	/* Max once per second */
+
+	/* Save current time */
+	last_progress_report = now;
+
+	/* Adjust total size if current_size is larger */
+	if (current_size > total_size)
+		total_size = current_size;
+
+	/* Calculate current percentage of size done */
+	percent = total_size ? (int) ((current_size) * 100 / total_size) : 0;
+
+	snprintf(total_size_str, sizeof(total_size_str), INT64_FORMAT,
+			 total_size / (1024 * 1024));
+	snprintf(current_size_str, sizeof(current_size_str), INT64_FORMAT,
+			 current_size / (1024 * 1024));
+
+	/*
+	 * Separate step to keep platform-dependent format code out of
+	 * translatable strings.  And we only test for INT64_FORMAT availability
+	 * in snprintf, not fprintf.
+	 */
+	fprintf(stderr, "%*s/%s MB (%d%%) done",
+			(int) strlen(current_size_str), current_size_str, total_size_str,
+			percent);
+
+	fprintf(stderr, "\r");
+}
+
 static bool
 skipfile(const char *fn)
 {
@@ -153,6 +209,7 @@ scan_file(const char *fn, BlockNumber segmentno)
 			continue;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
+		current_size += r;
 		if (mode == PG_MODE_CHECK)
 		{
 			if (csum != header->pd_checksum)
@@ -183,6 +240,8 @@ scan_file(const char *fn, BlockNumber segmentno)
 exit(1);
 			}
 		}
+
+		progress_report(false);
 	}
 
 	if (verbose)
@@ -195,12 +254,23 @@ scan_file(const char *fn, BlockNumber segmentno)
 	_("%s: checksums enabled in file \"%s\"\n"), progname, fn);
 	}
 
+	/* Make sure progress is reported at least once per file */
+	

Re: Column lookup in a row performance

2019-03-30 Thread Tom Lane
=?UTF-8?B?0J/QsNCy0LvRg9GF0LjQvSDQmNCy0LDQvQ==?=  writes:
> Does anyone know why the format is still the same?

(1) Backwards compatibility, and (2) it's not clear that a different
layout would be a win for all cases.

regards, tom lane




[PATCH v22] GSSAPI encryption support

2019-03-30 Thread Stephen Frost
Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> 
> > One of the things that I really didn't care for in this patch was the
> > use of the string buffers, without any real checks (except for "oh,
> > you tried to allocated over 1G"...) to make sure that the other side
> > of the connection wasn't feeding us ridiculous packets, and with the
> > resizing of the buffers, etc, that really shouldn't be necessary.
> > After chatting with Robbie about these concerns while reading through
> > the code, we agreed that we should be able to use fixed buffer sizes
> > and use the quite helpful gss_wrap_size_limit() to figure out how much
> > data we can encrypt without going over our fixed buffer size.  As
> > Robbie didn't have time to implement those changes this past week, I
> > did so, and added a bunch more comments and such too, and have now
> > gone through and done more testing.  Robbie has said that he should
> > have time this upcoming week to review the changes that I made, and
> > I'm planning to go back and review other parts of the patch more
> > closely now as well.
> 
> In general this looks good - there are a couple minor comments inline,
> but it's fine.

Thanks for reviewing!

> I wanted to note a couple things about this approach.  It now uses one
> more buffer than before (in contrast to the previous approach, which
> reused a buffer for received data that was encrypted and decrypted).

Yeah, I don't see that as too much of an issue and it certainly seems
cleaner and simpler to reason about to me, which seems worth the modest
additional buffer cost.  That's certainly something we could change if
others feel differently.

> Since these are static fixed size buffers, this increases the total
> steady-state memory usage by 16k as opposed to re-using the buffer.
> This may be fine; I don't know how tight RAM is here.

It seems unlikely to be an issue to me- and I would contend that the
prior implementation didn't actually take any steps to prevent the other
side from sending packets of nearly arbitrary size (up to 1G), so while
the steady-state memory usage of the prior implementation was less when
everyone was playing nicely, it could have certainly been abused.  I'm a
lot happier having an explicit cap on how much memory will be used, even
if that cap is a bit higher.

> > Note that there's an issue with exporting the context to get the
> > encryption algorithm used that I've asked Robbie to look into, so
> > that's no longer done and instead we just print that the connection is
> > encrypted, if it is.  If we can't figure out a way to make that work
> > then obviously I'll pull out that code, and if we can get it to work
> > then I'll update it to be done through libpq properly, as I had
> > suggested earlier.  That's really more of a nice to have in any case
> > though, so I may just exclude it for now anyway if it ends up adding
> > any complications.
> 
> Correct.  Unfortunately I'd overlooked that the lucid interface won't
> meet our needs (destroys the context).  So the two options here are:
> SASL SSF (and I'll separately push more mechs to add support for that),
> or nothing at all.

I went ahead and ripped out all of that, including the SASL SSF (which
really didn't seem like it added all that much...  but if others feel
like it's useful, then we can add it back in later).  That also let me
get rid of the configure/configure.in changes, which was nice.  I did
add in a simple pg_stat_gssapi view, modeled on pg_stat_ssl, so that you
can check server-side if GSSAPI was used for authentication and/or
encryption, and what principal was used if GSSAPI was used for
authentication.

I also changed the libpq function to return a boolean to indicate if
encryption was used or not; I don't think it's appropriate to have a
libpq function that will just print stuff to stdout, that should be the
client program's job, so that's handled by psql now.

There were some other bits that had been forgotten to get removed from
when I moved away from using the string buffers, but I went through and
I'm pretty confident that I've cleaned all of that out now.

> If you want a patch for that I can make one, but I think there was code
> already... just needed a ./configure check program for whether the OID
> is defined.
> 
> > +ssize_t
> > +be_gssapi_write(Port *port, void *ptr, size_t len)
> > +{
> > +   size_t  bytes_to_encrypt = len;
> > +   size_t  bytes_encrypted = 0;
> > +
> > +   /*
> > +* Loop through encrypting data and sending it out until
> > +* secure_raw_write() complains (which would likely mean that the socket
> > +* is non-blocking and the requested send() would block, or there was 
> > some
> > +* kind of actual error) and then return.
> > +*/
> > +   while (bytes_to_encrypt || PqGSSSendPointer)
> > +   {
> 
> I guess it's not a view everyone will share, but I think this block is
> too long.  Maybe a helper function around 

Re: Indexscan failed assert caused by using index without lock

2019-03-30 Thread Tom Lane
=?UTF-8?B?6auY5aKe55Cm?=  writes:
> Following example can reproduce the problem:

Yeah, this is being discussed at
https://www.postgresql.org/message-id/flat/19465.1541636...@sss.pgh.pa.us

regards, tom lane




Re: Online verification of checksums

2019-03-30 Thread Andres Freund
Hi,

On 2019-03-30 12:56:21 +0100, Magnus Hagander wrote:
> > ISTM that the fact that we had to teach it about different segment files
> > for checksum verification by splitting up the filename at "." implies
> > that it is not the correct level of abstraction (but maybe it could get
> > schooled some more about Postgres internals, e.g. by passing it a
> > RefFileNode struct and not a filename).
> >
> 
> But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
> we fix it there, we only have to fix it once...

I'm not understanding the problem here. We already need to know all of
this? sendFile() determines whether the file is checksummed, and
computes the segment number:

if (is_checksummed_file(readfilename, filename))
{
verify_checksum = true;
...
checksum = pg_checksum_page((char *) 
page, blkno + segmentno * RELSEG_SIZE);
phdr = (PageHeader) page;

I agree that the way checksumming works is a bit of a layering
violation. In my opinion it belongs in the smgr level, not bufmgr.c etc,
so different storage methods can store it differently. But that seems
fairly indepedent of this problem.

Greetings,

Andres Freund




Re: Checksum errors in pg_stat_database

2019-03-30 Thread Magnus Hagander
On Wed, Mar 13, 2019 at 4:54 PM Julien Rouhaud  wrote:

> On Wed, Mar 13, 2019 at 4:53 PM Julien Rouhaud  wrote:
> >
> > On Sun, Mar 10, 2019 at 1:13 PM Julien Rouhaud 
> wrote:
> > >
> > > On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud 
> wrote:
> > > >
> > > > On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander 
> wrote:
> > > > >
> > > > > On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud 
> wrote:
> > > > >>
> > > > >> Sorry, I have again new comments after a little bit more thinking.
> > > > >> I'm wondering if we can do something about shared objects while
> we're
> > > > >> at it.  They don't belong to any database, so it's a little bit
> > > > >> orthogonal to this proposal, but it seems quite important to track
> > > > >> error on those too!
> > > > >>
> > > > >> What about adding a new field in PgStat_GlobalStats for that?  We
> can
> > > > >> use the same lastDir to easily detect such objects and slightly
> adapt
> > > > >> sendFile again, which seems quite straightforward.
> > > >
> > > > > Question is then what number that should show -- only the checksum
> counter in non-database-fields, or the total number across the cluster?
> > > >
> > > > I'd say only for non-database-fields errors, especially if we can
> > > > reset each counters separately.  If necessary, we can add a new view
> > > > to give a global overview of checksum errors for DBA convenience.
> > >
> > > I'm considering adding a new PgStat_ChecksumStats for that purpose
> > > instead, but I don't know if that's acceptable to do so in the last
> > > commitfest.  It seems worthwhile to add it eventually, since we'll
> > > probably end up having more things to report to users related to
> > > checksum.  Online enabling of checksum could be the most immediate
> > > potential target.
> >
> > I wasn't aware that we were already storing informations about shared
> > objects in PgStat_StatDBEntry, with an InvalidOid as databaseid
> > (though we don't have any system view that are actually showing
> > information for such objects).
> >
> > As a result I ended up simply adding counters for the number of total
> > checks and the timestamp of the last failure in PgStat_StatDBEntry,
> > making attached patch very lightweight.  I moved all the checksum
> > related counters out of pg_stat_database in a new pg_stat_checksum
> > view.  It avoids to make pg_stat_database too wide, and also allows to
> > display information about shared object in this new view (some of the
> > other counters don't really make sense for shared objects or could
> > break existing monitoring query).  While at it, I tried to add a
> > little bit of documentation wrt. checksum monitoring.
>
> and of course I forgot to attach the patch.
>

Does it really make any sense to track "number of checksum checks"? In any
sort of interesting database that's just going to be an insanely high
number, isn't it? (And also, to stay consistent with checksum failures, we
should of course also count the checks done in base backups, which is not
in the patch. But I'm more thinking we should drop it)

I do like the addition of the "last failure" column, that's really useful.

Having thought some more about this, I wonder if the right thing to do is
to actually add a row to pg_stat_database for the global stats, rather than
invent a separate view. I can see the argument going both ways, but
particularly with the name pg_stat_checksums we are setting a pattern that
will create one view for each counter. That's not very good, I think.

In the end I'm somewhat split on the idea of pg_stat_database with a NULL
row or pg_stat_checkpoints. What do others think?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2019-03-30 Thread Andrew Dunstan


On 3/29/19 9:08 PM, Michael Paquier wrote:
> On Fri, Mar 29, 2019 at 11:22:55AM -0300, Alvaro Herrera wrote:
>> Yeah, that looks good to me too.  I wonder if we really need it as LOG
>> though; we don't say anything for actions unless they take more than the
>> min duration, so why say something for a no-op that takes almost no time?
>> Maybe make it DEBUG1.
> I think that this does not justify a WARNING, as that's harmless for
> the user even if we use WARNING for other skips (see
> vacuum_is_relation_owner).  However DEBUG1 is also too low in my
> opinion as this log can be used as an indicator that autovacuum is too
> much aggressive because there are too many workers for example.  I
> have seen that matter in some CPU-bound environments.  I won't fight
> hard if the consensus is to use DEBUG1 though.  So, more opinions?
> Andrew perhaps?
>
>


It's really just a matter of housekeeping as I see it, so probably
DEBUG1 is right.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Online verification of checksums

2019-03-30 Thread Magnus Hagander
On Fri, Mar 29, 2019 at 10:08 PM Michael Banck 
wrote:

> Hi,
>
> Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> > On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost 
> wrote:
> > > * Magnus Hagander (mag...@hagander.net) wrote:
> > > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> > > > wrote:
> > > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > > >> I agree that the current patch might have some corner-cases
> where it
> > > > > >> does not guarantee 100% accuracy in online mode, but I hope the
> current
> > > > > >> version at least has no more false negatives.
> > > > > >
> > > > > >False positives are *bad*. We shouldn't integrate code that has
> them.
> > > > >
> > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the
> online mode
> > > > > communicate with the server, which would presumably address these
> issues.
> > > > > Can someone explain why not to do that?
> > > >
> > > > I agree that this effort seems better spent on fixing those issues
> there
> > > > (of which many are the same), and then re-use that.
> > >
> > > This really seems like it depends on which of the options we're talking
> > > about..   Connecting to the server and asking what the current insert
> > > point is, so we can check that the LSN isn't completely insane, seems
> > > reasonable, but at least one option being discussed was to have
> > > pg_basebackup actually *lock the page* (even if just for I/O..) and
> then
> > > re-read it, and having an external tool doing that instead of the
> > > backend seems like a whole different level to me.  That would involve
> > > having an SQL function for "lock this page against I/O" and then
> another
> > > for "unlock this page", wouldn't it?
> >
> > Right.
> >
> > But what if we just added a flag to the BASE_BACKUP command in the
> > replication protocol that said "meh, I really just want to verify the
> > checksums, so please send the data to devnull and only feed me regular
> > status updates on this connection"?
>
> I don't know whether BASE_BACKUP is the best interface for that (at
> least right now) - backend/replication/basebackup.c's sendFile() gets
> only an absolute filename to send, which is not adequate for more in-
> depth server-based things like locking a particular page in a particular
> relation of some particular tablespace.


> ISTM that the fact that we had to teach it about different segment files
> for checksum verification by splitting up the filename at "." implies
> that it is not the correct level of abstraction (but maybe it could get
> schooled some more about Postgres internals, e.g. by passing it a
> RefFileNode struct and not a filename).
>

But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
we fix it there, we only have to fix it once...


//Magnus


Re: PostgreSQL pollutes the file system

2019-03-30 Thread Peter Eisentraut
On 2019-03-29 20:32, Joe Conway wrote:
>   pg_util  

How is that better than just renaming to pg_$oldname?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL pollutes the file system

2019-03-30 Thread Peter Eisentraut
On 2019-03-29 16:41, Tom Lane wrote:
> Or perhaps better, allow pg_ctl to grow new
> subcommands for those tasks?

pg_ctl is a tool to control the server; the commands being complained
about are client-side things.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX CONCURRENTLY 2.0

2019-03-30 Thread Peter Eisentraut
On 2019-03-29 16:10, Shinoda, Noriyoshi (PN Japan A Delivery) wrote:
> postgres=> CREATE TABLE part1(c1 INT) PARTITION BY RANGE(c1);
> CREATE TABLE
> postgres=> CREATE TABLE part1v1 PARTITION OF part1 FOR VALUES FROM (0) TO 
> (100);
> CREATE TABLE
> postgres=> CREATE INDEX idx1_part1 ON part1(c1);
> CREATE INDEX
> postgres=> REINDEX TABLE CONCURRENTLY part1v1;
> ERROR:  cannot drop index part1v1_c1_idx_ccold because index idx1_part1 
> requires it
> HINT:  You can drop index idx1_part1 instead.

The attached patch fixes this.  The issue was that we didn't move all
dependencies from the index (only in the other direction).  Maybe that
was sufficient when the patch was originally written, before partitioned
indexes.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 66a5ce802ede0bd28a6f4881e063ac92a35c0c8c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 30 Mar 2019 09:37:19 +0100
Subject: [PATCH] Fix REINDEX CONCURRENTLY of partitions

When swapping the old and new index, we not only need to change
dependencies referencing the index, but also dependencies of the
index referencing something else.  The previous code did this only
specifically for a constraint, but we also need to do this for
partitioned indexes.  So instead write a generic function that does it
for all dependencies.
---
 src/backend/catalog/index.c| 24 +-
 src/backend/catalog/pg_depend.c| 56 ++
 src/include/catalog/dependency.h   |  3 ++
 src/test/regress/expected/create_index.out |  6 +++
 src/test/regress/sql/create_index.sql  |  6 +++
 5 files changed, 73 insertions(+), 22 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0d9d405c54..77e2088034 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1555,29 +1555,9 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, 
const char *oldName)
}
 
/*
-* Move all dependencies on the old index to the new one
+* Move all dependencies of and on the old index to the new one
 */
-
-   if (OidIsValid(indexConstraintOid))
-   {
-   ObjectAddress myself,
-   referenced;
-
-   /* Change to having the new index depend on the constraint */
-   deleteDependencyRecordsForClass(RelationRelationId, oldIndexId,
-   
ConstraintRelationId, DEPENDENCY_INTERNAL);
-
-   myself.classId = RelationRelationId;
-   myself.objectId = newIndexId;
-   myself.objectSubId = 0;
-
-   referenced.classId = ConstraintRelationId;
-   referenced.objectId = indexConstraintOid;
-   referenced.objectSubId = 0;
-
-   recordDependencyOn(, , DEPENDENCY_INTERNAL);
-   }
-
+   changeDependenciesOf(RelationRelationId, oldIndexId, newIndexId);
changeDependenciesOn(RelationRelationId, oldIndexId, newIndexId);
 
/*
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index d63bf5e56d..f7caedcc02 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -395,6 +395,62 @@ changeDependencyFor(Oid classId, Oid objectId,
return count;
 }
 
+/*
+ * Adjust all dependency records to come from a different object of the same 
type
+ *
+ * classId/oldObjectId specify the old referencing object.
+ * newObjectId is the new referencing object (must be of class classId).
+ *
+ * Returns the number of records updated.
+ */
+long
+changeDependenciesOf(Oid classId, Oid oldObjectId,
+Oid newObjectId)
+{
+   longcount = 0;
+   RelationdepRel;
+   ScanKeyData key[2];
+   SysScanDesc scan;
+   HeapTuple   tup;
+
+   depRel = table_open(DependRelationId, RowExclusiveLock);
+
+   ScanKeyInit([0],
+   Anum_pg_depend_classid,
+   BTEqualStrategyNumber, F_OIDEQ,
+   ObjectIdGetDatum(classId));
+   ScanKeyInit([1],
+   Anum_pg_depend_objid,
+   BTEqualStrategyNumber, F_OIDEQ,
+   ObjectIdGetDatum(oldObjectId));
+
+   scan = systable_beginscan(depRel, DependDependerIndexId, true,
+ NULL, 2, key);
+
+   while (HeapTupleIsValid((tup = systable_getnext(scan
+   {
+   Form_pg_depend depform = (Form_pg_depend) GETSTRUCT(tup);
+
+   /* make a modifiable copy */
+   tup = heap_copytuple(tup);
+   depform = (Form_pg_depend) GETSTRUCT(tup);
+
+   depform->objid = newObjectId;
+
+   

Re: PostgreSQL pollutes the file system

2019-03-30 Thread Fred .Flintstone
I think the proposal you put forward is great, and would love to see
it go ahead and get implemented.

On Fri, Mar 29, 2019 at 5:35 PM Alvaro Herrera  wrote:
>
> On 2019-Mar-29, Tom Lane wrote:
>
> > Christoph Berg  writes:
> > > What might possibly make sense is to add options to psql to
> > > facilitate common tasks:
> >
> > > psql --createdb foo
> > > psql --createuser bar --superuser
> > > psql --reindex foo
> >
> > That's a thought.  Or perhaps better, allow pg_ctl to grow new
> > subcommands for those tasks?
>
> +1, as I proposed in 2016:
> https://www.postgresql.org/message-id/20160826202911.GA320593@alvherre.pgsql
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unix socket dir, an idea

2019-03-30 Thread Danylo Hlynskyi
Hi Tom, and much thanks for reply!

> I would also like to point out the extreme Unix-centricity (and
> even particular-distribution-centricity) of the alternative locations
> you mention

Yes! The /run/user and /var/run directories are absent on MacOS. That's why
I **don't** propose to change
default directory to /var/run. Each distribution **may** set it on it's own
or use default /tmp
- Ubuntu/Debian can set to /var/run/postgresql
- generic systemd distro can set to /run/postgresql
- /tmp everywhere else, including MacOS. Actually, a default
- (I think this is unrelated to Windows, but maybe windows has it's own
notion for runtime directories)

All those won't be hardcoded in PG source, it is build time param and
distribution holds all the responsibility
for changing the default.

> as well as the fact that all those locations are unfriendly
> to running an unprivileged postmaster (i.e. one that hasn't been
> explicitly blessed by whoever has root on the box).

Yes! That's why I propose to use **user runtime directory** first, when
it's available. Systemd distros do
have one (think of user's private /tmp), which is denoted by
XDG_RUNTIME_DIR envvar. No need
for server to be root, and no way for other users to hijack server socket
(which is currently possible
with 0777 /tmp)

If you are talking about two regular users, one of which runs server,
another client - they will have now
to agree which socket directory to use, yes. And what is nice, they won't
be able to override system-level
postgresql without having root rights (currently it is possible to do
between pg restarts).

> Uh, how is a client supposed to know what UID the postmaster is running
under?

It doesn't have to. It first looks up under current user runtime directory
(XDG_RUNTIME_DIR or /run/user/$(id -u))
and if it can't find socket there, it searches in CONFIG_PGSOCKET_DIR
(which is common for both server and client)

> we're going to have a Babel of servers and clients that can't talk to
each other.

I'd like to note, that exactly the curent Babel of servers and clients made
me write this email.
1. Debian/Ubuntu care about security, so they move socket directory from
0777 directory to 0755 directory
(/var/run/postgresql)
2. PG in Nix distro packageset used default setting (/tmp), and thus `psql`
installed via Nix on Ubuntu didn't connect
to Ubuntu server by default
3. Because Debian did change default directory, `pg_ctl start` doesn't work
with default params:
```
~$ /usr/lib/postgresql/9.6/bin/pg_ctl -D temppg -o "-p 5400" start
server starting
FATAL: could not create lock file "/var/run/postgresql/.s.PGSQL.5400.lock":
Permission denied
```

Thanks again for reading this!

сб, 30 бер. 2019 о 02:40 Tom Lane  пише:

> Danylo Hlynskyi  writes:
> > The problem (as I see it) is that everybody would like to move `/tmp`
> > socket dir to `/var/run`, or even `/var/run/postgresql` (or even
> > `/run/postgresql`), but compatibility with old clients (which connect to
> > /tmp by default) is a concern.
>
> *Some* people would like to move the default socket location.  Others
> of us see that as a recipe for chaos.  If it's really easy to change
> that, we're going to have a Babel of servers and clients that can't
> talk to each other.
>
> I would also like to point out the extreme Unix-centricity (and
> even particular-distribution-centricity) of the alternative locations
> you mention, as well as the fact that all those locations are unfriendly
> to running an unprivileged postmaster (i.e. one that hasn't been
> explicitly blessed by whoever has root on the box).
>
> > 1. Add a Makefile parameter CONFIG_PGSOCKET_DIR to explicitly switch to
> new
> > unix socket directory, and let distributions decide if they want this,
> and
> > if they want, they should handle socket dir change on their own.
>
> We already have DEFAULT_PGSOCKET_DIR in pg_config_manual.h, and distros
> that want to change it typically carry a patch to adjust that header.
> I'm not sure we really want to make it any easier than that.
>
> > 2. The new socket directory shouldn't be hardcoded to single directory,
> > instead it should be detected dynamically.
>
> This idea is just nuts.  It makes each of the problems I mentioned above
> about ten times worse.
>
> > For client:
> > - if host explicitly set, use it
> > - if not, check if /run/user/$(id -u) exists and socket file exists
> there.
> > If yes, use it as socket
>
> Uh, how is a client supposed to know what UID the postmaster is running
> under?
>
> regards, tom lane
>


PostgreSQL 12 Release Management Team & Feature Freeze

2019-03-30 Thread Michael Paquier
Hi,

The Release Management Team (RMT) for the PostgreSQL 12 release
has been assembled and has determined that the feature freeze date
for the PostgreSQL 12 release will be April 7, 2019.  This means that
any feature that will be going into the PostgreSQL 12 release must be
committed before 2019-04-08 00:00:00 AoE [1].

The exception to this are any patches related to pgindent rules which
are purposefully being committed later on, and of course bug fixes.
After the freeze is in effect, any open feature in the current commit
fest will be moved into the subsequent one.

Open items for the PostgreSQL 12 release will be tracked here:
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items

For the PostgreSQL 12 release, the release management team is composed
of:

Andres Freund 
Michael Paquier 
Tomas Vondra 

For the time being, if you have any questions about the process,
please feel free to email any member of the RMT.  We will send out
notes with updates and additional guidance in the near future.

[1]: https://en.wikipedia.org/wiki/Anywhere_on_Earth

Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] generated columns

2019-03-30 Thread Justin Pryzby
On Sat, Mar 30, 2019 at 09:03:03AM +0100, Peter Eisentraut wrote:
> On 2019-03-26 20:50, Pavel Stehule wrote:
> > It is great feature and I'll mark this feature as ready for commit
> 
> Committed, thanks.

create_table.sgml now has this:

https://www.postgresql.org/docs/devel/sql-createtable.html#id-1.9.3.85.6.2.18.1.2
+ 
+  The keyword STORED is required to signify that the
+  column will be computed on write and will be stored on disk.  default.
+ 

What does "default." mean ?

Also, this is working but not documented as valid:
postgres=# CREATE TABLE t (j int, i int GENERATED BY DEFAULT AS (j*j+1) STORED);

Justin




Removing a few more lseek() calls

2019-03-30 Thread Thomas Munro
Hello,

Patch 0001 gets rid of the unconditional lseek() calls for SLRU I/O,
as a small follow-up to commit c24dcd0c.  Patch 0002 gets rid of a few
places that usually do a good job of avoiding lseek() calls while
reading and writing WAL, but it seems better to have no code at all.

-- 
Thomas Munro
https://enterprisedb.com
From 68c1ec0531f802d39c80b8fa5c310c3c70d795d9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Mar 2019 17:04:08 +1300
Subject: [PATCH 1/2] Use pg_pread() and pg_pwrite() in slru.c.

This avoids lseek() system calls at every SLRU I/O, as was
done for relation files in commit c24dcd0c.

Author: Thomas Munro
Discussion:
---
 src/backend/access/transam/slru.c | 25 -
 1 file changed, 4 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 974d42fc86..6527cfce3b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -647,7 +647,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	SlruShared	shared = ctl->shared;
 	int			segno = pageno / SLRU_PAGES_PER_SEGMENT;
 	int			rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
-	int			offset = rpageno * BLCKSZ;
+	off_t		offset = rpageno * BLCKSZ;
 	char		path[MAXPGPATH];
 	int			fd;
 
@@ -677,17 +677,9 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 		return true;
 	}
 
-	if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
-	{
-		slru_errcause = SLRU_SEEK_FAILED;
-		slru_errno = errno;
-		CloseTransientFile(fd);
-		return false;
-	}
-
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
-	if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
+	if (pg_pread(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
 	{
 		pgstat_report_wait_end();
 		slru_errcause = SLRU_READ_FAILED;
@@ -727,7 +719,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 	SlruShared	shared = ctl->shared;
 	int			segno = pageno / SLRU_PAGES_PER_SEGMENT;
 	int			rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
-	int			offset = rpageno * BLCKSZ;
+	off_t		offset = rpageno * BLCKSZ;
 	char		path[MAXPGPATH];
 	int			fd = -1;
 
@@ -837,18 +829,9 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 		}
 	}
 
-	if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
-	{
-		slru_errcause = SLRU_SEEK_FAILED;
-		slru_errno = errno;
-		if (!fdata)
-			CloseTransientFile(fd);
-		return false;
-	}
-
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
-	if (write(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
+	if (pg_pwrite(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
 	{
 		pgstat_report_wait_end();
 		/* if write didn't set errno, assume problem is no disk space */
-- 
2.21.0

From 446007aed7fb1bafb125eaca5569f62acdf41069 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Mar 2019 17:33:30 +1300
Subject: [PATCH 2/2] Use pg_pread() and pg_pwrite() in various places.

Remove several copies of some code that usually does a good job of
avoiding lseek() calls, but it seems better to have no code at all.
Also remove a simple case of an unconditional lseek() call.

Author: Thomas Munro
Discussion:
---
 src/backend/access/heap/rewriteheap.c  |  9 +
 src/backend/access/transam/xlogutils.c | 25 ++---
 src/backend/replication/walreceiver.c  | 21 +++--
 src/backend/replication/walsender.c| 23 ---
 src/bin/pg_waldump/pg_waldump.c| 25 +++--
 5 files changed, 13 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..c121e44719 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1165,13 +1165,6 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 		path, (uint32) xlrec->offset)));
 	pgstat_report_wait_end();
 
-	/* now seek to the position we want to write our data to */
-	if (lseek(fd, xlrec->offset, SEEK_SET) != xlrec->offset)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not seek to end of file \"%s\": %m",
-		path)));
-
 	data = XLogRecGetData(r) + sizeof(*xlrec);
 
 	len = xlrec->num_mappings * sizeof(LogicalRewriteMappingData);
@@ -1179,7 +1172,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	/* write out tail end of mapping file (again) */
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE);
-	if (write(fd, data, len) != len)
+	if (pg_pwrite(fd, data, len, xlrec->offset) != len)
 	{
 		/* if write didn't set errno, assume problem is no disk space */
 		if (errno == 0)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..a63bfed3a6 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -664,7 +664,6 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr 

Re: [HACKERS] generated columns

2019-03-30 Thread Pavel Stehule
so 30. 3. 2019 v 9:03 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2019-03-26 20:50, Pavel Stehule wrote:
> > It is great feature and I'll mark this feature as ready for commit
>
> Committed, thanks.
>

great feature, it reduce some necessity of triggers

Regards

Pavel



> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] generated columns

2019-03-30 Thread Peter Eisentraut
On 2019-03-26 20:50, Pavel Stehule wrote:
> It is great feature and I'll mark this feature as ready for commit

Committed, thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Column lookup in a row performance

2019-03-30 Thread Павлухин Иван
Hi PostgresSQL developers,

I asked my question already on pgsql-general list and did not find an
explanation. Below is the question mainly copied from [0].


I am learning deeply how tuples are organized and column values are
accessed in different databases. As far as undertood postgres does not
store all column positions in a tuple (e.g. in header or footer). In
contrast MySQL InnoDB stores column lengths in a record header [1].
>From the first glance it seems that a postgres format can have a
significant performance penalty when accessing a single column which
is located after multiple variable-length columns because searching a
column value position in a row requires multiple jumps. And in InnoDB
a position of a particular column can be found right after reading a
header.

I found several related threads in pgsql-hackers archives [2,3]
describing significant performance wins in a prototype.

Does anyone know why the format is still the same? Perhaps InnoDB and
similar formats are not so good, are they?

Please respond if you have the clue!


I did a rough experiment to check if a difference is visible. I used
following table (200 pairs of columns):
create table layout(
i0 int,
s0 varchar(255),
...
i199 int,
s199 varchar(255)
);

And populated it with 1 million rows. And run following queries
SELECT AVG(i0) FROM layout;
SELECT AVG(i199) FROM layout;

On my machine calculating an average over column i0 took about 1
second and about 2.5 seconds for column i199. And similar observations
were described in threads mentioned before. Quite significant
difference!

I made a similar experiment for mysql as well (innodb). And results
are the same for first and last columns.

Some details how it is stored in innodb. They store varlen column
lengths only in a tuple header (there is no length prefix before
column data itself). Having a tuple descriptor and lengths in a tuple
header it is always possible to calculate each column position without
jumping through an entire record. And seems that space requirements
are same as in postgresql.

It seems that an innodb layout is better at least for reading. So, it
is still unclear for me why postgresql does not employ similar layout
if it can give significant benefits.


[0] 
https://www.postgresql.org/message-id/flat/CAOykqKc8Uoi3NKVfd5DpTmUzD4rJBWG9Gjo3pr7eaUGLtrstvw%40mail.gmail.com
[1] 
https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html#innodb-row-format-compact
[2] 
https://www.postgresql.org/message-id/flat/c58979e50702201307w64b12892uf8dfc3d8bf117ec0%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/87irj16umm.fsf%40enterprisedb.com

-- 
Best regards,
Ivan Pavlukhin




Indexscan failed assert caused by using index without lock

2019-03-30 Thread 高增琦
Following example can reproduce the problem:

```
create table d(a int);
create index di on d(a);
set enable_seqscan=off;
set enable_bitmapscan to off;
prepare p as delete from d where a=3;
execute p;
execute p;
```

The reason is that: ExecInitIndexScan will not lock index because it thinks
InitPlan
already write-locked index. But in some cases, such as DELETE+cache plan
will
not lock index, then failed assert.

Some thoughts on how to fix it:
1. Disable the optimization in ExecInitModifyTable, don't skip
ExecOpenIndices for DELETE
2. For DELETE, instead of open indices, just lock them
3. Lock index of target rel in ExecInitIndexScan for DELETE

PS: another question, why does ExecCloseIndices release index lock instead
of
keeping them?

-- 
GaoZengqi
pgf...@gmail.com
zengqi...@gmail.com