RE: Implement UNLOGGED clause for COPY FROM

2020-07-21 Thread osumi.takami...@fujitsu.com
Hi. Amit-san


> If you are going to suggest users not to replicate such tables then why can't 
> you
> suggest them to create such tables as UNLOGGED in the first place?  Another
> idea could be that you create an 'unlogged'
> table, copy the data to it.  Then perform Alter Table .. SET Logged and 
> attach it to
> the main table.  I think for this you need the main table to be partitioned 
> but I
> think if that is possible then it might be better than all the hacking you are
> proposing to do in the server for this special operation.
Thank you for your comment.

At the beginning, I should have mentioned this function was
for data warehouse, where you need to load large amounts of data
in the shortest amount of time. 
Sorry for my bad explanation. 

Based on the fact that data warehouse cannot be separated from
usage of applications like B.I. tool in general,
we cannot define unlogged table at the beginning easily.
Basically, such tools don't support to define unlogged table as far as I know.

And if you want to do so, you need *modification or fix of existing application*
which is implemented by a third party and commercially available for data 
analytics.
In other words, to make CREATE UNLOGGED TABLE available in that application,
you must revise the product's source code of the application directly,
which is an act to invalidate the warranty from the software company of B.I. 
tool.
In my opinion, it would be like unrealistic for everyone to do so.

Best,
Takamichi Osumi


Re: OpenSSL randomness seeding

2020-07-21 Thread Michael Paquier
On Tue, Jul 21, 2020 at 10:00:20PM -0700, Noah Misch wrote:
> These look good.  I'll push them on Saturday or later.  I wondered whether to
> do both RAND_cleanup() and RAND_poll(), to purge all traces of the old seed on
> versions supporting both.  Since that would strictly (albeit negligibly)
> increase seed predictability, I like your decision here.

Thanks Noah for taking care of it.  No plans for a backpatch, right?
--
Michael


signature.asc
Description: PGP signature


Re: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread David Rowley
On Wed, 22 Jul 2020 at 16:40, k.jami...@fujitsu.com
 wrote:
> I used the default max_parallel_workers & max_worker_proceses which is 8 by 
> default in postgresql.conf.
> IOW, I ran all those tests with maximum of 8 processes set. But my query 
> planner capped both the
> Workers Planned and Launched at 6 for some reason when increasing the value 
> for
> max_parallel_workers_per_gather.

max_parallel_workers_per_gather just imposes a limit on the planner as
to the maximum number of parallel workers it may choose for a given
parallel portion of a plan. The actual number of workers the planner
will decide is best to use is based on the size of the relation. More
pages = more workers. It sounds like in this case the planner didn't
think it was worth using more than 6 workers.

The parallel_workers reloption, when not set to -1 overwrites the
planner's decision on how many workers to use. It'll just always try
to use "parallel_workers".

> However, when I used the ALTER TABLE SET (parallel_workers = N) based from 
> your suggestion,
> the query planner acquired that set value only for "Workers Planned", but not 
> for "Workers Launched".
> The behavior of query planner is also different when I also set the value of 
> max_worker_processes
> and max_parallel_workers to parallel_workers + 1.

When it comes to execution, the executor is limited to how many
parallel worker processes are available to execute the plan. If all
workers happen to be busy with other tasks then it may find itself
having to process the entire query in itself without any help from
workers.  Or there may be workers available, just not as many as the
planner picked to execute the query.

The number of available workers is configured with the
"max_parallel_workers". That's set in postgresql.conf.   PostgreSQL
won't complain if you try to set a relation's parallel_workers
reloption to a number higher than the "max_parallel_workers" GUC.
"max_parallel_workers" is further limited by "max_worker_processes".
Likely you'll want to set both those to at least 32 for this test,
then just adjust the relation's parallel_workers setting for each
test.

David




Re: OpenSSL randomness seeding

2020-07-21 Thread Noah Misch
On Tue, Jul 21, 2020 at 02:13:32PM +0200, Daniel Gustafsson wrote:
> The silver lining here is that while OpenSSL nooped RAND_cleanup, they also
> changed what is mixed into seeding so we are still not sharing a sequence.  To
> fix this, changing the RAND_cleanup call to RAND_poll should be enough to
> ensure re-seeding after forking across all supported OpenSSL versions.  Patch
> 0001 implements this along with a comment referencing when it can be removed
> (which most likely won't be for quite some time).
> 
> Another thing that stood out when reviewing this code is that we optimize for
> RAND_poll failing in pg_strong_random, when we already have RAND_status
> checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
> code by letting RAND_status do the work as per 0002, and also (while unlikely)
> survive any transient failures in RAND_poll by allowing all the retries we've
> defined for the loop.

> Thoughts on these?

These look good.  I'll push them on Saturday or later.  I wondered whether to
do both RAND_cleanup() and RAND_poll(), to purge all traces of the old seed on
versions supporting both.  Since that would strictly (albeit negligibly)
increase seed predictability, I like your decision here.

Do you happen to know how OpenSSL 1.1.1 changed its reaction to forks in order
to make the RAND_poll() superfluous?  (No need to research it if you don't.)




Re: xl_heap_header alignment?

2020-07-21 Thread Antonin Houska
Tom Lane  wrote:

> I don't particularly want to remove the field, but we ought to
> change or remove the comment.

I'm not concerned about the existence of the field as well. The comment just
made me worried that I might be missing some fundamental concept. Thanks for
your opinion.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread Thomas Munro
On Wed, Jul 22, 2020 at 3:57 PM Amit Kapila  wrote:
> Yeah, that is true but every time before the test the same amount of
> data should be present in shared buffers (or OS cache) if any which
> will help in getting consistent results.  However, it is fine to
> reboot the machine as well if that is a convenient way.

We really should have an extension (pg_prewarm?) that knows how to
kick stuff out PostgreSQL's shared buffers and the page cache
(POSIX_FADV_DONTNEED).




Re: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread Amit Kapila
On Wed, Jul 22, 2020 at 5:25 AM David Rowley  wrote:
>
> I understand that Amit wrote:
>
> On Fri, 17 Jul 2020 at 21:18, Amit Kapila  wrote:
> > I think recreating the database and restarting the server after each
> > run might help in getting consistent results.  Also, you might want to
> > take median of three runs.
>
> Please also remember, if you're recreating the database after having
> restarted the machine that creating the table will likely end up
> caching some of the pages either in shared buffers or the Kernel's
> cache.
>

Yeah, that is true but every time before the test the same amount of
data should be present in shared buffers (or OS cache) if any which
will help in getting consistent results.  However, it is fine to
reboot the machine as well if that is a convenient way.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-21 Thread Amit Kapila
On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar  wrote:
>
> There was one warning in release mode in the last version in 0004 so
> attaching a new version.
>

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn).  We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes.  Is there any reason to do
things differently here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




RE: archive status ".ready" files may be created too early

2020-07-21 Thread matsumura....@fujitsu.com
Hello,

> At Mon, 13 Jul 2020 01:57:36 +, "Kyotaro Horiguchi 
> " wrote in
> Am I missing something here?

I write more detail(*).

  Record-A and Record-B are cross segment-border records.
  Record-A spans segment X and X+1.
  Record-B spans segment X+2 and X+3.
  If both records have been inserted to WAL buffer, lastSegContRecStart/End 
points to Record-B
* If a writer flushes segment X and a part of X+1 but record-A is not flushed 
completely,
  NotifyStableSegments() allows the writer to notify segment-X.

Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL 
record.
The segment-X is not same as the archived one.

Regard
Ryo Matsumura




Re: Stale external URL in doc?

2020-07-21 Thread Kyotaro Horiguchi
At Sat, 18 Jul 2020 22:48:47 +0900, Michael Paquier  wrote 
in 
> On Fri, Jul 17, 2020 at 02:03:18PM +0900, Michael Paquier wrote:
> > It would be better to get all that fixed and backpatched. Is somebody
> > already looking into that?
> 
> I have been through this set, and applied the changes as of 045d03f & 
> friends.  There was an extra URL broken in 9.5 and 9.6 related to the
> passphrase FAQ.

Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: OpenSSL randomness seeding

2020-07-21 Thread Michael Paquier
On Tue, Jul 21, 2020 at 10:36:53PM +0200, Daniel Gustafsson wrote:
> I think the original intention was to handle older OpenSSL versions where
> multiple successful RAND_poll calls were required for RAND_status to succeed,
> the check working as an optimization since a failing RAND_poll would render 
> all
> efforts useless anyway.  I'm not sure this is true for the OpenSSL versions we
> support in HEAD, and/or for modern platforms, but without proof of it not 
> being
> useful I would opt for keeping it.

Yeah, the retry loop refers to this part of the past discussion on the
matter:
https://www.postgresql.org/message-id/CAEZATCWYs6rAp36VKm4W7Sb3EF_7tNcRuhcnJC1P8=8w9nb...@mail.gmail.com

During the rewrite of the RNG engines, there was also a retry logic
introduced in 75e2c87, then removed in c16de9d for 1.1.1.  In short,
we may be able to live without that once we cut more support for
OpenSSL versions (minimum version support of 1.1.1 is a couple of
years ahead at least for us), but I see no reasons to not leave that
in place either.  And this visibly solved one problem for us.  I don't
see either a reason to not simplify the loop to fall back to
RAND_status() for the validation.

In short, the proposed patch set looks like a good idea to me to stick
with the recommendations of upstream's wiki to use RAND_poll() after a
fork, but only do that on HEAD (OpenSSL 1.1.0 mixes the current
timestamp and the PID in the random seed of the default engine, 1.0.2
only the PID but RAND_cleanup is a no-op only after 1.1.0).
--
Michael


signature.asc
Description: PGP signature


Re: Comment referencing incorrect algorithm

2020-07-21 Thread Michael Paquier
On Tue, Jul 21, 2020 at 01:57:11PM +0200, Daniel Gustafsson wrote:
> While poking around our crypto code, I noticed that a comment in sha2.h was
> referencing sha-1 which is an algorithm not supported by the code.  The
> attached fixes the comment aligning it with other comments in the file.

Thanks, fixed.  The style of the surroundings is to not use an hyphen,
so fine by me to stick with that.
--
Michael


signature.asc
Description: PGP signature


Re: v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Jul 21, 2020 at 06:25:00PM -0400, Tom Lane wrote:
>> Ugh.  It's clear from your stack trace that neqjoinsel() has forgotten to
>> pass through collation to eqjoinsel().  Will fix.

> Why didn't you include a regression test in bd0d893?

Didn't really see much point.  It's not like anybody's likely to
take out the collation handling now that it's there.

regards, tom lane




Re: v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread Michael Paquier
On Tue, Jul 21, 2020 at 06:25:00PM -0400, Tom Lane wrote:
> Ugh.  It's clear from your stack trace that neqjoinsel() has forgotten to
> pass through collation to eqjoinsel().  Will fix.

Why didn't you include a regression test in bd0d893?
--
Michael


signature.asc
Description: PGP signature


Re: Which SET TYPE don't actually require a rewrite

2020-07-21 Thread Michael Paquier
On Tue, Jul 21, 2020 at 04:55:37PM -0400, Bruce Momjian wrote:
> I know Tom put a wink on that, but I actually do feel that the only
> clean way to do this is to give users a way to issue the query in a
> non-executing way that will report if a rewrite is going to happen.

Yeah, when doing a schema upgrade for an application, that's the usual
performance pin-point and people used to other things than Postgres
write their queries without being aware of that.  We have something
able to track that with the event trigger table_rewrite, but there is
no easy option to store the event and bypass its execution.  I think
that using a plpgsql function wrapping an ALTER TABLE query with an
exception block for an error generated by an event trigger if seeing
table_rewrite allows to do that, though.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread David Rowley
Hi Kirk,

Thank you for doing some testing on this. It's very useful to get some
samples from other hardware / filesystem / os combinations.

On Tue, 21 Jul 2020 at 21:38, k.jami...@fujitsu.com
 wrote:
> Query Planner I/O Timings (ms):
> | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O 
> WRITE (Patch) |
> ||---|--||---|
> | 0  | "1,130.78"| "1,250.82"   | "1,698.05" | 
> "1,733.44"|


> | Worker | Buffers  |
> ||--|
> | 0  | shared read=442478 dirtied=442478 written=442446 |

I'm thinking the scale of this test might be a bit too small for the
machine you're using to test.  When you see "shared read" in the
EXPLAIN (ANALYZE, BUFFERS) output, it does not necessarily mean that
the page had to be read from disk.  We use buffered I/O, so the page
could just have been fetched from the kernel's cache.

If we do some maths here on the timing. It took 1130.78 milliseconds
to read 442478 pages, which, assuming the standard page size of 8192
bytes, that's 3457 MB in 1130.78 milliseconds, or 3057 MB/sec.  Is
that a realistic throughput for this machine in terms of I/O? Or do
you think that some of these pages might be coming from the Kernel's
cache?

I understand that Amit wrote:

On Fri, 17 Jul 2020 at 21:18, Amit Kapila  wrote:
> I think recreating the database and restarting the server after each
> run might help in getting consistent results.  Also, you might want to
> take median of three runs.

Please also remember, if you're recreating the database after having
restarted the machine that creating the table will likely end up
caching some of the pages either in shared buffers or the Kernel's
cache. It would be better to leave the database intact and just reboot
the machine.  I didn't really like that option with my tests so I just
increased the size of the table beyond any size that my machines could
have cached.  With the 16GB RAM Windows laptop, I used a 100GB table
and with the 64GB workstation, I used an 800GB table.  I think my test
using SELECT * FROM t WHERE a < 0; with a table that has a padding
column is likely going to be a more accurate test. Providing there is
no rows with a < 0 in the table then the executor will spend almost
all of the time in nodeSeqscan.c trying to find a row with a < 0.
There's no additional overhead of aggregation doing the count(*).
Having the additional padding column means that we read more data per
evaluation of the a < 0 expression.  Also, having a single column
table is not that realistic.

I'm pretty keen to see this machine running something closer to the
test I mentioned in [1] but the benchmark query I mentioned in [2]
with the "t" table being at least twice the size of RAM in the
machine. Larger would be better though. With such a scaled test, I
don't think there's much need to reboot the machine in between. Just
run a single query first to warm up the cache before timing anything.
Having the table a few times larger than RAM will mean that we can be
certain that the disk was actually used during the test. The more data
we can be certain came from disk the more we can trust that the
results are meaningful.

Thanks again for testing this.

David

[1] 
https://www.postgresql.org/message-id/caaphdvrfjfyh51_wy-iqqpw8ygr4fdotxaqkqn+sa7ntkev...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/caaphdvo+legkmcavoipyk8nebgp-lrxns2tj1n_xnrjve9x...@mail.gmail.com




Re: Index Skip Scan (new UniqueKeys)

2020-07-21 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > > +   currItem = >currPos.items[so->currPos.lastItem];
> > > +   itup = (IndexTuple) (so->currTuples + 
> > > currItem->tupleOffset);
> > > +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);
>
> Do you mean this last part with t_tid, which could also have a tid array
> in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

> Well, it's obviously wrong, thanks for noticing. What is necessary is to
> compare two index tuples, the start and the next one, to test if they're
> the same (in which case if I'm not mistaken probably we can compare item
> pointers). I've got this question when I was about to post a new version
> with changes to address feedback from Andy, now I'll combine them and
> send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

-- 
Peter Geoghegan




Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)

2020-07-21 Thread Peter Geoghegan
On Fri, Jul 17, 2020 at 5:53 PM Peter Geoghegan  wrote:
> Pushed the first patch just now, and intend to push the other one soon. 
> Thanks!

Pushed the second piece of this (the nbtree patch) just now.

Thanks for the review!
-- 
Peter Geoghegan




Re: v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread Tom Lane
Justin Pryzby  writes:
> We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039).
> explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE sites.config_site_name 
> != sectors.sect_name ;
> ERROR:  could not determine which collation to use for string comparison 

> I can workaround the issue by DELETEing stats for either column.

Ugh.  It's clear from your stack trace that neqjoinsel() has forgotten to
pass through collation to eqjoinsel().  Will fix.

regards, tom lane




Re: Infinities in type numeric

2020-07-21 Thread Tom Lane
I wrote:
> Dean Rasheed  writes:
>> I had a look at this, and I think it's mostly in good shape. It looks
>> like everything from the first message in this thread has been
>> resolved, except I don't know about the jsonpath stuff, because I
>> haven't been following that.

> Thanks for the careful review!

Here's a v4 that syncs numeric in_range() with the new behavior of
float in_range(), and addresses your other comments too.

regards, tom lane

diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c
index ed361efbe2..b81ba54b80 100644
--- a/contrib/jsonb_plperl/jsonb_plperl.c
+++ b/contrib/jsonb_plperl/jsonb_plperl.c
@@ -227,10 +227,8 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem)
 /*
  * jsonb doesn't allow infinity or NaN (per JSON
  * specification), but the numeric type that is used for the
- * storage accepts NaN, so we have to prevent it here
- * explicitly.  We don't really have to check for isinf()
- * here, as numeric doesn't allow it and it would be caught
- * later, but it makes for a nicer error message.
+ * storage accepts those, so we have to reject them here
+ * explicitly.
  */
 if (isinf(nval))
 	ereport(ERROR,
diff --git a/contrib/jsonb_plpython/jsonb_plpython.c b/contrib/jsonb_plpython/jsonb_plpython.c
index e09308daf0..836c178770 100644
--- a/contrib/jsonb_plpython/jsonb_plpython.c
+++ b/contrib/jsonb_plpython/jsonb_plpython.c
@@ -387,14 +387,17 @@ PLyNumber_ToJsonbValue(PyObject *obj, JsonbValue *jbvNum)
 	pfree(str);
 
 	/*
-	 * jsonb doesn't allow NaN (per JSON specification), so we have to prevent
-	 * it here explicitly.  (Infinity is also not allowed in jsonb, but
-	 * numeric_in above already catches that.)
+	 * jsonb doesn't allow NaN or infinity (per JSON specification), so we
+	 * have to reject those here explicitly.
 	 */
 	if (numeric_is_nan(num))
 		ereport(ERROR,
 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
  errmsg("cannot convert NaN to jsonb")));
+	if (numeric_is_inf(num))
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("cannot convert infinity to jsonb")));
 
 	jbvNum->type = jbvNumeric;
 	jbvNum->val.numeric = num;
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 7027758d28..50e370cae4 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -554,9 +554,9 @@ NUMERIC(precision)
 
 NUMERIC
 
- without any precision or scale creates a column in which numeric
- values of any precision and scale can be stored, up to the
- implementation limit on precision.  A column of this kind will
+ without any precision or scale creates an unconstrained
+ numeric column in which numeric values of any length can be
+ stored, up to the implementation limits.  A column of this kind will
  not coerce input values to any particular scale, whereas
  numeric columns with a declared scale will coerce
  input values to that scale.  (The SQL standard
@@ -568,10 +568,10 @@ NUMERIC
 
 
  
-  The maximum allowed precision when explicitly specified in the
-  type declaration is 1000; NUMERIC without a specified
-  precision is subject to the limits described in .
+  The maximum precision that can be explicitly specified in
+  a NUMERIC type declaration is 1000.  An
+  unconstrained NUMERIC column is subject to the limits
+  described in .
  
 
 
@@ -593,6 +593,11 @@ NUMERIC
  plus three to eight bytes overhead.
 
 
+
+ infinity
+ numeric (data type)
+
+
 
  NaN
  not a number
@@ -604,13 +609,44 @@ NUMERIC
 
 
 
- In addition to ordinary numeric values, the numeric
- type allows the special value NaN, meaning
- not-a-number.  Any operation on NaN
- yields another NaN.  When writing this value
- as a constant in an SQL command, you must put quotes around it,
- for example UPDATE table SET x = 'NaN'.  On input,
- the string NaN is recognized in a case-insensitive manner.
+ In addition to ordinary numeric values, the numeric type
+ has several special values:
+
+Infinity
+-Infinity
+NaN
+
+ These are adapted from the IEEE 754 standard, and represent
+ infinity, negative infinity, and
+ not-a-number, respectively. When writing these values
+ as constants in an SQL command, you must put quotes around them,
+ for example UPDATE table SET x = '-Infinity'.
+ On input, these strings are recognized in a case-insensitive manner.
+ The infinity values can alternatively be spelled inf
+ and -inf.
+
+
+
+ The infinity values behave as per mathematical expectations.  For
+ example, Infinity plus any finite value equals
+ Infinity, as does Infinity
+ plus Infinity; but Infinity
+ minus Infinity yields NaN (not a
+ number), because it has no well-defined interpretation.  Note 

Re: Using Valgrind to detect faulty buffer accesses (no pin or buffer content lock held)

2020-07-21 Thread Peter Geoghegan
On Mon, Jul 6, 2020 at 1:35 AM Georgios Kokolatos
 wrote:
> As a general overview, the series of patches in the mail thread do match 
> their description. The addition of the stricter, explicit use of 
> instrumentation does improve the design as the distinction of the use cases 
> requiring a pin or a lock is made more clear. The added commentary is 
> descriptive and appears grammatically correct, at least to a non native 
> speaker.

I didn't see this review until now because it ended up in gmail's spam
folder. :-(

Thanks for taking a look at it!

--
Peter Geoghegan




Re: Which SET TYPE don't actually require a rewrite

2020-07-21 Thread Bruce Momjian
On Fri, Jul 17, 2020 at 11:26:56AM -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > As Amit mentions it is also triggered by some store parameter changes. But
> > not all. So looking at it the other way, the part that the end user really
> > cares about it "which ALTER TABLE operations will rewrite the table and
> > which will not". Maybe what we need is a section specifically on this that
> > summarizes all the different ways that it can happen.
> 
> No, what we need is EXPLAIN for DDL ;-).  Trying to keep such
> documentation in sync with the actual code behavior would be impossible.
> (For one thing, some aspects can be affected by extension datatype
> behaviors.)

I know Tom put a wink on that, but I actually do feel that the only
clean way to do this is to give users a way to issue the query in a
non-executing way that will report if a rewrite is going to happen.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Improving psql slash usage help message

2020-07-21 Thread Peter Eisentraut

On 2020-07-21 17:10, Tom Lane wrote:

Hamid Akhtar  writes:

So are you suggesting to not fix this or do a more detailed review and
assess what other psql messages can be grouped together.


I was just imagining merging the entries for the commands that are
implemented by listTables().  If you see something else that would
be worth doing, feel free to suggest it, but that wasn't what I was
thinking of.


It used to be like that, but it was changed here: 
9491c82f7103d62824d3132b8c7dc44609f2f56b


I'm not sure which way is better.  Right now, a question like "how do I 
list all indexes" is easily answered by the help output.  Under the 
other scheme, it's hidden behind a layer of metasyntax.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: OpenSSL randomness seeding

2020-07-21 Thread Daniel Gustafsson
> On 21 Jul 2020, at 22:00, David Steele  wrote:
> 
> On 7/21/20 3:44 PM, Daniel Gustafsson wrote:
>>> On 21 Jul 2020, at 17:31, David Steele  wrote:
>>> On 7/21/20 8:13 AM, Daniel Gustafsson wrote:
 Another thing that stood out when reviewing this code is that we optimize 
 for
 RAND_poll failing in pg_strong_random, when we already have RAND_status
 checking for a sufficiently seeded RNG for us.  ISTM that we can simplify 
 the
 code by letting RAND_status do the work as per 0002, and also (while 
 unlikely)
 survive any transient failures in RAND_poll by allowing all the retries 
 we've
 defined for the loop.
>>> 
>>> I wonder how effective the retries are going to be if they happen 
>>> immediately. However, most of the code paths I followed ended in a hard 
>>> error when pg_strong_random() failed so it may not hurt to try. I just 
>>> worry that some caller is depending on a faster failure here.
>> There is that, but I'm not convinced that relying on specific timing for
>> anything RNG or similarly cryptographic-related is especially sane.
> 
> I wasn't thinking specific timing -- just that the caller might be expecting 
> it to give up quickly if it doesn't work. That's what the code is trying to 
> do and I wonder if there is a reason for it.

I think the original intention was to handle older OpenSSL versions where
multiple successful RAND_poll calls were required for RAND_status to succeed,
the check working as an optimization since a failing RAND_poll would render all
efforts useless anyway.  I'm not sure this is true for the OpenSSL versions we
support in HEAD, and/or for modern platforms, but without proof of it not being
useful I would opt for keeping it.

cheers ./daniel



Re: Default setting for enable_hashagg_disk

2020-07-21 Thread Bruce Momjian
On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote:
> Maybe I missed your point here. The problem is not so much that we'll
> get HashAggs that spill -- there is nothing intrinsically wrong with
> that. While it's true that the I/O pattern is not as sequential as a
> similar group agg + sort, that doesn't seem like the really important
> factor here. The really important factor is that in-memory HashAggs
> can be blazingly fast relative to *any* alternative strategy -- be it
> a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> whatever. We're mostly concerned about keeping the one available fast
> strategy than we are about getting a new, generally slow strategy.

Do we have any data that in-memory HashAggs are "blazingly fast relative
to *any* alternative strategy?"  The data I have tested myself and what
I saw from Tomas was that spilling sort or spilling hash are both 2.5x
slower.  Are we sure the quoted statement is true?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: OpenSSL randomness seeding

2020-07-21 Thread David Steele

On 7/21/20 3:44 PM, Daniel Gustafsson wrote:

On 21 Jul 2020, at 17:31, David Steele  wrote:
On 7/21/20 8:13 AM, Daniel Gustafsson wrote:



Another thing that stood out when reviewing this code is that we optimize for
RAND_poll failing in pg_strong_random, when we already have RAND_status
checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
code by letting RAND_status do the work as per 0002, and also (while unlikely)
survive any transient failures in RAND_poll by allowing all the retries we've
defined for the loop.


I wonder how effective the retries are going to be if they happen immediately. 
However, most of the code paths I followed ended in a hard error when 
pg_strong_random() failed so it may not hurt to try. I just worry that some 
caller is depending on a faster failure here.


There is that, but I'm not convinced that relying on specific timing for
anything RNG or similarly cryptographic-related is especially sane.


I wasn't thinking specific timing -- just that the caller might be 
expecting it to give up quickly if it doesn't work. That's what the code 
is trying to do and I wonder if there is a reason for it.


But you are probably correct and I'm just overthinking it.

Regards,
--
-David
da...@pgmasters.net




Re: v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread Justin Pryzby
Reproducer:

postgres=# CREATE TABLE t AS SELECT ''a FROM generate_series(1,99); CREATE 
TABLE u AS SELECT ''a FROM generate_series(1,99) ; VACUUM ANALYZE t,u;
postgres=# explain SELECT * FROM t JOIN u ON t.a!=u.a;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.




Re: OpenSSL randomness seeding

2020-07-21 Thread Daniel Gustafsson
> On 21 Jul 2020, at 17:31, David Steele  wrote:
> On 7/21/20 8:13 AM, Daniel Gustafsson wrote:

>> Another thing that stood out when reviewing this code is that we optimize for
>> RAND_poll failing in pg_strong_random, when we already have RAND_status
>> checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
>> code by letting RAND_status do the work as per 0002, and also (while 
>> unlikely)
>> survive any transient failures in RAND_poll by allowing all the retries we've
>> defined for the loop.
> 
> I wonder how effective the retries are going to be if they happen 
> immediately. However, most of the code paths I followed ended in a hard error 
> when pg_strong_random() failed so it may not hurt to try. I just worry that 
> some caller is depending on a faster failure here.

There is that, but I'm not convinced that relying on specific timing for
anything RNG or similarly cryptographic-related is especially sane.

cheers ./daniel



Re: v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread David G. Johnston
On Tuesday, July 21, 2020, Justin Pryzby  wrote:

> We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039).
>
> explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE
> sites.config_site_name != sectors.sect_name ;
> ERROR:  could not determine which collation to use for string comparison
>
> I can workaround the issue by DELETEing stats for either column.
>
> It's possible we're doing soemthing wrong and I need to revisit docs..but
> this
> was working in v12.
>

This sounds suspiciously like a side-effect of:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=022cd0bfd33968f2b004106cfeaa3b2951e7f322

David J.


v13 planner ERROR: could not determine which collation to use for string comparison

2020-07-21 Thread Justin Pryzby
We hit this on v13b2 and verified it fails on today's HEAD (ac25e7b039).

explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE sites.config_site_name 
!= sectors.sect_name ;
ERROR:  could not determine which collation to use for string comparison 

I can workaround the issue by DELETEing stats for either column.

It's possible we're doing soemthing wrong and I need to revisit docs..but this
was working in v12.

ts=# SELECT * FROM pg_stats WHERE tablename='sites' AND 
attname='config_site_name'; 
-[ RECORD 1 ]--+-
schemaname | public
tablename  | sites
attname| config_site_name
inherited  | f
null_frac  | 0
avg_width  | 1
n_distinct | 1
most_common_vals   | {""}
most_common_freqs  | {1}
histogram_bounds   | 
correlation| 1
most_common_elems  | 
most_common_elem_freqs | 
elem_count_histogram   | 

#1  0x00ab2993 in errfinish (filename=0xcaae40 "varlena.c", 
lineno=1476, funcname=0xcab7b0 <__func__.18296> "check_collation_set") at 
elog.c:502
#2  0x00a783ae in check_collation_set (collid=0) at varlena.c:1473
#3  0x00a78857 in texteq (fcinfo=0x7fff1ecae590) at varlena.c:1740
#4  0x00a4248c in eqjoinsel_inner (opfuncoid=67, collation=0, 
vardata1=0x7fff1ecae7a0, vardata2=0x7fff1ecae770, nd1=1, nd2=1, 
isdefault1=false, isdefault2=false, sslot1=0x7fff1ecae720, 
sslot2=0x7fff1ecae6e0, stats1=0x1a97c00, stats2=0x1a98230, have_mcvs1=true, 
have_mcvs2=true) at selfuncs.c:2466
#5  0x00a41f66 in eqjoinsel (fcinfo=0x7fff1ecae8a0) at selfuncs.c:2298
#6  0x00abb63c in DirectFunctionCall5Coll (func=0xa41caf , 
collation=0, arg1=28313248, arg2=98, arg3=28315832, arg4=0, 
arg5=140733710004032) at fmgr.c:908
#7  0x00a43197 in neqjoinsel (fcinfo=0x7fff1ecaea40) at selfuncs.c:2824
#8  0x00abc4a0 in FunctionCall5Coll (flinfo=0x7fff1ecaeb00, 
collation=100, arg1=28313248, arg2=531, arg3=28315832, arg4=0, 
arg5=140733710004032) at fmgr.c:1245
#9  0x00abcd1c in OidFunctionCall5Coll (functionId=106, collation=100, 
arg1=28313248, arg2=531, arg3=28315832, arg4=0, arg5=140733710004032) at 
fmgr.c:1463
#10 0x0084b2c2 in join_selectivity (root=0x1b006a0, operatorid=531, 
args=0x1b010b8, inputcollid=100, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at 
plancat.c:1822
#11 0x007dba29 in clause_selectivity (root=0x1b006a0, clause=0x1b01168, 
varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at clausesel.c:765
#12 0x007dacf4 in clauselist_selectivity_simple (root=0x1b006a0, 
clauses=0x1b05fe8, varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40, 
estimatedclauses=0x0) at clausesel.c:169
#13 0x007dac33 in clauselist_selectivity (root=0x1b006a0, 
clauses=0x1b05fe8, varRelid=0, jointype=JOIN_INNER, sjinfo=0x7fff1ecaef40) at 
clausesel.c:102
#14 0x007e44e3 in calc_joinrel_size_estimate (root=0x1b006a0, 
joinrel=0x1b02ce0, outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, outer_rows=311, 
inner_rows=1047, sjinfo=0x7fff1ecaef40, restrictlist_in=0x1b05de0)
at costsize.c:4857
#15 0x007e41eb in set_joinrel_size_estimates (root=0x1b006a0, 
rel=0x1b02ce0, outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, sjinfo=0x7fff1ecaef40, 
restrictlist=0x1b05de0) at costsize.c:4712
#16 0x008507a6 in build_join_rel (root=0x1b006a0, joinrelids=0x1b05c08, 
outer_rel=0x1afd4f0, inner_rel=0x1b01cf0, sjinfo=0x7fff1ecaef40, 
restrictlist_ptr=0x7fff1ecaef38) at relnode.c:728
#17 0x007f5ecb in make_join_rel (root=0x1b006a0, rel1=0x1afd4f0, 
rel2=0x1b01cf0) at joinrels.c:746
#18 0x007f542e in make_rels_by_clause_joins (root=0x1b006a0, 
old_rel=0x1afd4f0, other_rels_list=0x1b05d08, other_rels=0x1b05d28) at 
joinrels.c:312
#19 0x007f4f04 in join_search_one_level (root=0x1b006a0, level=2) at 
joinrels.c:123
#20 0x007d96a5 in standard_join_search (root=0x1b006a0, 
levels_needed=2, initial_rels=0x1b05d08) at allpaths.c:3097
#21 0x007d961e in make_rel_from_joinlist (root=0x1b006a0, 
joinlist=0x1b03b28) at allpaths.c:3028
#22 0x007d4f82 in make_one_rel (root=0x1b006a0, joinlist=0x1b03b28) at 
allpaths.c:227
#23 0x0080f835 in query_planner (root=0x1b006a0, qp_callback=0x816525 
, qp_extra=0x7fff1ecaf320) at planmain.c:269
#24 0x00813406 in grouping_planner (root=0x1b006a0, 
inheritance_update=false, tuple_fraction=0) at planner.c:2058
#25 0x008115b7 in subquery_planner (glob=0x1b00588, parse=0x1afdc48, 
parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1015
#26 0x0080fe34 in standard_planner (parse=0x1afdc48, 
query_string=0x1938e90 "explain SELECT 1 FROM sites NATURAL JOIN sectors WHERE 
sites. config_site_name != sectors.sect_name ;", cursorOptions=256, 
boundParams=0x0) at planner.c:405




Re: xl_heap_header alignment?

2020-07-21 Thread Tom Lane
Andres Freund  writes:
> On July 21, 2020 10:45:37 AM PDT, Antonin Houska  wrote:
>> I don't quite understand this part of the comment of the xl_heap_header
>> structure:
>> * NOTE: t_hoff could be recomputed, but we may as well store it because
>> * it will come for free due to alignment considerations.

> Unless you declare them as packed, structs will add padding to align members 
> correctly (if, and only if, the whole struct is stored well aligned).

I think that comment may be out of date, because what's there now is

 * NOTE: t_hoff could be recomputed, but we may as well store it because
 * it will come for free due to alignment considerations.
 */
typedef struct xl_heap_header
{
uint16  t_infomask2;
uint16  t_infomask;
uint8   t_hoff;
} xl_heap_header;

I find it hard to see how tacking t_hoff onto what would have been a
4-byte struct is "free".  Maybe sometime in the dim past there was
another field in this struct?  (But I checked back as far as 7.4
without finding one.)

I don't particularly want to remove the field, but we ought to
change or remove the comment.

regards, tom lane




Re: xl_heap_header alignment?

2020-07-21 Thread Andres Freund
Hi, 

On July 21, 2020 10:45:37 AM PDT, Antonin Houska  wrote:
>I don't quite understand this part of the comment of the xl_heap_header
>structure:
>
>* NOTE: t_hoff could be recomputed, but we may as well store it because
> * it will come for free due to alignment considerations.
>
>What are the alignment considerations? The WAL code does not appear to
>assume
>any alignment, and therefore it uses memcpy() to copy the structure
>into a
>local variable before accessing its fields. For example,
>heap_xlog_insert().

Unless you declare them as packed, structs will add padding to align members 
correctly (if, and only if, the whole struct is stored well aligned).

Regards,

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




xl_heap_header alignment?

2020-07-21 Thread Antonin Houska
I don't quite understand this part of the comment of the xl_heap_header
structure:

 * NOTE: t_hoff could be recomputed, but we may as well store it because
 * it will come for free due to alignment considerations.

What are the alignment considerations? The WAL code does not appear to assume
any alignment, and therefore it uses memcpy() to copy the structure into a
local variable before accessing its fields. For example, heap_xlog_insert().

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




RE: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread k.jami...@fujitsu.com
On Friday, July 17, 2020 6:18 PM (GMT+9), Amit Kapila wrote:

> On Fri, Jul 17, 2020 at 11:35 AM k.jami...@fujitsu.com 
> wrote:
> >
> > On Wednesday, July 15, 2020 12:52 PM (GMT+9), David Rowley wrote:
> >
> > >On Wed, 15 Jul 2020 at 14:51, Amit Kapila  wrote:
> > >>
> > >> On Wed, Jul 15, 2020 at 5:55 AM David Rowley 
> wrote:
> > >>> If we've not seen any performance regressions within 1 week, then
> > >>> I propose that we (pending final review) push this to allow wider
> > >>> testing.
> > >>
> > >> I think Soumyadeep has reported a regression case [1] with the
> > >> earlier version of the patch.  I am not sure if we have verified
> > >> that the situation improves with the latest version of the patch.
> > >> I request Soumyadeep to please try once with the latest patch.
> > >...
> > >Yeah, it would be good to see some more data points on that test.
> > >Jumping from 2 up to 6 workers just leaves us to guess where the
> > >performance started to become bad. >It would be good to know if the
> > >regression is repeatable or if it was affected by some other process.
> > >...
> > >It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> > >track_io_timing = on; for each value of >max_parallel_workers.
> >
> > Hi,
> >
> > If I'm following the thread correctly, we may have gains on this patch
> > of Thomas and David, but we need to test its effects on different
> > filesystems. It's also been clarified by David through benchmark tests
> > that synchronize_seqscans is not affected as long as the set cap per
> > chunk size of parallel scan is at 8192.
> >
> > I also agree that having a control on this through GUC can be
> > beneficial for users, however, that can be discussed in another thread
> > or development in the future.
> >
> > David Rowley wrote:
> > >I'd like to propose that if anyone wants to do further testing on
> > >other operating systems with SSDs or HDDs then it would be good if
> > >that could be done within a 1 week from this email. There are various
> > >benchmarking ideas on this thread for inspiration.
> >
> > I'd like to join on testing it, this one using HDD, and at the bottom
> > are the results. Due to my machine limitations, I only tested
> > 0~6 workers, that even if I increase max_parallel_workers_per_gather
> > more than that, the query planner would still cap the workers at 6.
> > I also set the track_io_timing to on as per David's recommendation.
> >
> > Tested on:
> > XFS filesystem, HDD virtual machine
> > RHEL4, 64-bit,
> > 4 CPUs, Intel Core Processor (Haswell, IBRS) PostgreSQL 14devel on
> > x86_64-pc-linux-gnu
> >
> >
> > Test Case (Soumyadeep's) [1]
> >
> > shared_buffers = 32MB (to use OS page cache)
> >
> > create table t_heap as select generate_series(1, 1) i;   --about
> 3.4GB size
> >
> > SET track_io_timing = on;
> >
> > \timing
> >
> > set max_parallel_workers_per_gather = 0;  --0 to 6
> >
> > SELECT count(*) from t_heap;
> > EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
> >
> > [Summary]
> > I used the same query from the thread. However, the sql query
> > execution time and query planner execution time results between the
> > master and patched do not vary much.
> > OTOH, in terms of I/O stats, I observed similar regression in both
> > master and patched as we increase max_parallel_workers_per_gather.
> >
> > It could also be possible that each benchmark setting for
> > max_parallel_workers_per_gather is affected by previous result . IOW,
> > later benchmark runs benefit from the data cached by previous runs on OS 
> > level.
> >
> 
> Yeah, I think to some extent that is visible in results because, after patch, 
> at 0
> workers, the execution time is reduced significantly whereas there is not much
> difference at other worker counts.  I think for non-parallel case (0 workers),
> there shouldn't be any difference.
> Also, I am not sure if there is any reason why after patch the number of 
> shared hits
> is improved, probably due to caching effects?
> 
> > Any advice?
> 
> I think recreating the database and restarting the server after each run 
> might help
> in getting consistent results.  Also, you might want to take median of three 
> runs.

Thank you for the advice. I repeated the test as per your advice and average of 
3 runs
per worker/s planned.
It still shows the following similar performance results between Master and 
Patch V2.
I wonder why there's no difference though.

The test on my machine is roughly like this:

createdb test
psql -d test
create table t_heap as select generate_series(1, 1) i;
\q

pg_ctl restart
psql -d test
SET track_io_timing = on;
SET max_parallel_workers_per_gather = 0;
SHOW max_parallel_workers_per_gather;
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) from t_heap;
\timing
SELECT count(*) from t_heap;

drop table t_heap;
\q
dropdb test
pg_ctl restart

Below are the results. Again, almost no discernible difference between the 
master and patch.
Also, the results when 

RE: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread k.jami...@fujitsu.com
On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com 
> wrote:
> >
> > Thank you for the advice. I repeated the test as per your advice and
> > average of 3 runs per worker/s planned.
> > It still shows the following similar performance results between Master and
> Patch V2.
> > I wonder why there's no difference though.
> >
> > The test on my machine is roughly like this:
> >
> > createdb test
> > psql -d test
> > create table t_heap as select generate_series(1, 1) i; \q
> >
> > pg_ctl restart
> > psql -d test
> > SET track_io_timing = on;
> > SET max_parallel_workers_per_gather = 0; SHOW
> > max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT
> > count(*) from t_heap; \timing SELECT count(*) from t_heap;
> >
> > drop table t_heap;
> > \q
> > dropdb test
> > pg_ctl restart
> >
> > Below are the results. Again, almost no discernible difference between the
> master and patch.
> > Also, the results when max_parallel_workers_per_gather is more than 4
> > could be inaccurate due to my machine's limitation of only having v4
> > CPUs. Even so, query planner capped it at
> > 6 workers.
> >
> > Query Planner I/O Timings (track_io_timing = on) in ms :
> > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) |
> > | I/O WRITE (Patch) |
> >
> ||---|--||-
> --|
> > | 0  | "1,130.777"   | "1,250.821"  | "01,698.051"   |
> "01,733.439"  |
> > | 1  | "1,603.016"   | "1,660.767"  | "02,312.248"   |
> "02,291.661"  |
> > | 2  | "2,036.269"   | "2,107.066"  | "02,698.216"   |
> "02,796.893"  |
> > | 3  | "2,298.811"   | "2,307.254"  | "05,695.991"   |
> "05,894.183"  |
> > | 4  | "2,098.642"   | "2,135.960"  | "23,837.088"   |
> "26,537.158"  |
> > | 5  | "1,956.536"   | "1,997.464"  | "45,891.851"   |
> "48,049.338"  |
> > | 6  | "2,201.816"   | "2,219.001"  | "61,937.828"   |
> "67,809.486"  |
> >
> > Query Planner Execution Time (ms):
> > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) |
> > ||---|--|
> > | 0.000  | "40,454.252"  | "40,521.578" |
> > | 1.000  | "21,332.067"  | "21,205.068" |
> > | 2.000  | "14,266.756"  | "14,385.539" |
> > | 3.000  | "11,597.936"  | "11,722.055" |
> > | 4.000  | "12,937.468"  | "13,439.247" |
> > | 5.000  | "14,383.083"  | "14,782.866" |
> > | 6.000  | "14,671.336"  | "15,507.581" |
> >
> > Based from the results above, the I/O latency increases as number of
> > workers also increase. Despite that, the query planner execution time
> > is almost closely same when 2 or more workers are used (14~11s). Same
> results between Master and Patch V2.
> >
> > As for buffers, same results are shown per worker (both Master and Patch).
> > | Worker | Buffers  |
> > ||--|
> > | 0  | shared read=442478 dirtied=442478 written=442446 |
> > | 1  | shared read=442478 dirtied=442478 written=442414 |
> > | 2  | shared read=442478 dirtied=442478 written=442382 |
> > | 3  | shared read=442478 dirtied=442478 written=442350 |
> > | 4  | shared read=442478 dirtied=442478 written=442318 |
> > | 5  | shared read=442478 dirtied=442478 written=442286 |
> > | 6  | shared read=442478 dirtied=442478 written=442254 |
> >
> >
> > SQL Query Execution Time (ms) :
> > | Worker | SQL (Master) | SQL (Patch)  |
> > ||--|--|
> > | 0  | "10,418.606" | "10,377.377" |
> > | 1  | "05,427.460"  | "05,402.727" |
> > | 2  | "03,662.998"  | "03,650.277" |
> > | 3  | "02,718.837"  | "02,692.871" |
> > | 4  | "02,759.802"  | "02,693.370" |
> > | 5  | "02,761.834"  | "02,682.590" |
> > | 6  | "02,711.434"  | "02,726.332" |
> >
> > The SQL query execution time definitely benefitted from previous run
> > of query planner, so the results are faster. But again, both Master and 
> > Patched
> have almost the same results.
> > Nonetheless, the execution time is almost consistent when
> > max_parallel_workers_per_gather is 2 (default) and above.
> >
> > I am definitely missing something. Perhaps I think I could not
> > understand why there's no I/O difference between the Master and
> > Patched (V2). Or has it been already improved even without this patch?
> >
> 
> I don't think it is strange that you are not seeing much difference because 
> as per
> the initial email by Thomas this patch is not supposed to give benefits on all
> systems.  I think we wanted to check that the patch should not regress
> performance in cases where it doesn't give benefits.  I think it might be 
> okay to
> run 

Re: pg_subscription.subslotname is wrongly marked NOT NULL

2020-07-21 Thread Tom Lane
I wrote:
> * On the other side of the ledger, if we don't fix these markings
> we cannot back-patch the additional assertions I proposed at [1].

> I'm kind of leaning to committing this as shown and back-patching
> the patch at [1], but certainly a case could be made in the other
> direction.  Thoughts?

After further thought about that I realized that the assertion patch
could be kluged in the same way as we did in llvmjit_deform.c, and
that that would really be the only safe way to do it pre-v13.
Otherwise the assertions would trip in pre-existing databases,
which would not be nice.

So what I've done is to back-patch the assertions that way, and
*not* apply BKI_FORCE_NULL in the back branches.  The possible
downsides of doing that seem to outweigh the upside of making
the catalog state cleaner in new installations.

regards, tom lane




Re: Improve handling of pg_stat_statements handling of bind "IN" variables

2020-07-21 Thread Dmitry Dolgov
> On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov  
> wrote:
>
>> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane  wrote:
>>
>> Greg Stark  writes:
>> > Actually thinking about this for two more seconds the question is what it
>> > would do with a query like
>> > WHERE col = ANY '1,2,3'::integer[]
>> > Or
>> > WHERE col = ANY ARRAY[1,2,3]
>> > Whichever the language binding that is failing to do parameterized queries
>> > is doing to emulate them.
>>
>> Yeah, one interesting question is whether this is actually modeling
>> what happens with real-world applications --- are they sending Consts,
>> or Params?
>>
>> I resolutely dislike the idea of marking arrays that came from IN
>> differently from other ones; that's just a hack and it's going to give
>> rise to unexplainable behavioral differences for logically-equivalent
>> queries.
>
> Thanks for your input.
>
> As for real-world applications – being a founder of a server monitoring saas
> (okmeter) I have access to stats on hundreds of postgres installations.
>
> It shows that IN with a variable number of params is ~7 times more used than
> ANY(array).

Hi,

I would like to do some archaeology and inquire about this thread, since
unfortunately there was no patch presented as far as I see.

IIUC the ideas suggested in this thread are evolving mostly about modifying
parser:

> On Fri, Jun 14, 2019 at 2:46 AM Tom Lane  wrote:
>
> I do not think you need new expression infrastructure. IMO what's going on
> here is that we're indulging in premature optimization in the parser. It
> would be better from a structural standpoint if the output of parse analysis
> were closer to what the user wrote, and then the business of separating Vars
> from Consts and reducing the Consts to an array were handled in the planner's
> expression preprocessing phase.
>
> So maybe what you should be thinking about is a preliminary patch that's
> mostly in the nature of refactoring, to move that processing to where it
> should be.
>
> Of course, life is never quite that simple; there are at least two
> issues you'd have to think about.
>
> * The parsing phase is responsible for determining the semantics of
> the query, in particular resolving the data types of the IN-list items
> and choosing the comparison operators that will be used.  The planner
> is not allowed to rethink that.  What I'm not clear about offhand is
> whether the existing coding in parse analysis might lead to different
> choices of data types/operators than a more straightforward approach
> does.  If so, we'd have to decide whether that's okay.
>
> * Postponing this work might make things slower overall, which wouldn't
> matter too much for short IN-lists, but you can bet that people who
> throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
> keep an eye on efficiency and make sure you don't end up repeating
> similar processing over and over.

This puzzles me, since the original issue sounds like a "representation"
problem, when we want to calculate jumble hash in a way that obviously
repeating parameters or constants are hashed into one value. I see the point in
ideas like this:

>> One idea that comes to me after looking at the code involved is that
>> it might be an improvement across-the-board if transformAExprIn were to
>> reduce the generated ArrayExpr to an array Const immediately, in cases
>> where all the inputs are Consts.  That is going to happen anyway come
>> plan time, so it'd have zero impact on semantics or query performance.
>> Doing it earlier would cost nothing, and could even be a net win, by
>> reducing per-parse-node overhead in places like the rewriter.
>>
>> The advantage for the problem at hand is that a Const that's an array
>> of 2 elements is going to look the same as a Const that's any other
>> number of elements so far as pg_stat_statements is concerned.
>>
>> This doesn't help of course in cases where the values aren't all
>> Consts.  Since we eliminated Vars already, the main practical case
>> would be that they're Params, leading us back to the previous
>> question of whether apps are binding queries with different numbers
>> of parameter markers in an IN, and how hard pg_stat_statements should
>> try to fuzz that if they are.
>>
>> Then, to Greg's point, there's a question of whether transformArrayExpr
>> should do likewise, ie try to produce an array Const immediately.
>> I'm a bit less excited about that, but consistency suggests that
>> we should have it act the same as the IN case.

Interestingly enough, something similar was already mentioned in [1]. But no
one jumped into this, probably due to its relative complexity, lack of personal
time resources or not clear way to handle Params (I'm actually not sure about
the statistics for Consts vs Params myself and need to check this, but can
easily imagine both could be an often problem).

Another idea also was mentioned in [1]:

> I wonder whether we could improve this by arranging things so that both
> Consts and 

Re: OpenSSL randomness seeding

2020-07-21 Thread David Steele

On 7/21/20 8:13 AM, Daniel Gustafsson wrote:

After forking we call RAND_cleanup in fork_process.c to force a re-seed to
ensure that two backends cannot share sequence.  OpenSSL 1.1.0 deprecated
RAND_cleanup, and contrary to how they usually leave deprecated APIs working
until removed, they decided to silently make this call a noop like below:

#   define RAND_cleanup() while(0) continue

This leaves our defence against pool sharing seemingly useless, and also
against the recommendations of OpenSSL for versions > 1.1.0 and < 1.1.1 where
the RNG was rewritten:

 https://wiki.openssl.org/index.php/Random_fork-safety

The silver lining here is that while OpenSSL nooped RAND_cleanup, they also
changed what is mixed into seeding so we are still not sharing a sequence.  To
fix this, changing the RAND_cleanup call to RAND_poll should be enough to
ensure re-seeding after forking across all supported OpenSSL versions.  Patch
0001 implements this along with a comment referencing when it can be removed
(which most likely won't be for quite some time).


This looks reasonable to me based on your explanation and the OpenSSL wiki.


Another thing that stood out when reviewing this code is that we optimize for
RAND_poll failing in pg_strong_random, when we already have RAND_status
checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
code by letting RAND_status do the work as per 0002, and also (while unlikely)
survive any transient failures in RAND_poll by allowing all the retries we've
defined for the loop.


I wonder how effective the retries are going to be if they happen 
immediately. However, most of the code paths I followed ended in a hard 
error when pg_strong_random() failed so it may not hurt to try. I just 
worry that some caller is depending on a faster failure here.


Regards,
--
-David
da...@pgmasters.net




Re: Improving psql slash usage help message

2020-07-21 Thread Tom Lane
Hamid Akhtar  writes:
> So are you suggesting to not fix this or do a more detailed review and
> assess what other psql messages can be grouped together.

I was just imagining merging the entries for the commands that are
implemented by listTables().  If you see something else that would
be worth doing, feel free to suggest it, but that wasn't what I was
thinking of.

regards, tom lane




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-07-21 Thread Dilip Kumar
On Tue, Jul 21, 2020 at 4:08 PM Dilip Kumar  wrote:
>
> On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar  wrote:
> >
> > On Tue, Jul 21, 2020 at 2:00 AM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > > The attached patch allows the vacuum to continue by emitting WARNING
> > > > for the corrupted tuple instead of immediately error out as discussed
> > > > at [1].
> > > >
> > > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > > control whether to continue the vacuum or to stop on the occurrence of
> > > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > > detected, it will emit a warning and return that nothing is changed in
> > > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > > Since we are returning false the caller will not try to freeze such
> > > > tuple and the tuple_totally_frozen is also set to false so that the
> > > > page will not be marked to all frozen even if all other tuples in the
> > > > page are frozen.
> > >
> > > I'm extremely doubtful this is a good idea. In all likelihood this will
> > > just exascerbate corruption.
> > >
> > > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > > getting *further* out of sync with the actual table contents. And you
> > > cannot just freeze such tuples, because that has a good chance of making
> > > deleted tuples suddenly visible, leading to unique constraint violations
> > > etc. Which will then subsequently lead to clog lookup errors and such.
> >
> > I agree with the point.  But, if we keep giving the ERROR in such
> > cases then also the situation is not any better.  Basically, we are
> > not freezing such tuple as well as we can not advance the
> > relfrozenxid.  So if we follow the same rule that we don't freeze
> > those tuples and also don't advance the relfrozenxid.  The only
> > difference is, allow the vacuum to continue with other tuples.
> >
> > > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > > cannot be increased. Without that I think it's entirely unacceptable to
> > > do this.
> >
> > I agree with that point.  I was just confused that shall we disallow
> > to advance the relfrozenxid in all such cases or in some cases where
> > the xid already precedes the relfrozenxid, we can allow it to advance
> > as it can not become any worse.  But, as Alvaro pointed out that there
> > is no point in optimizing such cases.   I will update the patch to
> > stop advancing the relfrozenxid if we find any corrupted xid during
> > tuple freeze.
> >
> > > If we really were to do something like this the option would need to be
> > > called vacuum_allow_making_corruption_worse or such. Its need to be
> > > *exceedingly* clear that it will likely lead to making everything much
> > > worse.
> > >
> > Maybe we can clearly describe this in the document.
> >
> > > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > > >   frz->t_infomask = tuple->t_infomask;
> > > >   frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> > >
> > > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > > but not FreezeMultiXactId().
> >
> > oh, I missed this part.  I will fix it.
>
> Please find the updated patch.  In this version, we don't allow the
> relfrozenxid and relminmxid to advance if the corruption is detected
> and also added the handling in FreezeMultiXactId.

In the previous version, the feature was enabled for cluster/vacuum
full command as well.   in the attached patch I have enabled it only
if we are running vacuum command.  It will not be enabled during a
table rewrite.  If we think that it should be enabled for the 'vacuum
full' then we might need to pass a flag from the cluster_rel, all the
way down to the heap_freeze_tuple.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch
Description: Binary data


Re: Postgres-native method to identify if a tuple is frozen

2020-07-21 Thread Amit Kapila
On Mon, Jul 20, 2020 at 9:07 PM Lawrence Jones  wrote:
>
>
> So we hit the question: how can we identify if a tuple is frozen? I know the 
> tuple has both committed and aborted hint bits set, but accessing those bits 
> seems to require superuser functions and are unlikely to be that fast.
>
> Are there system columns (similar to xmin, tid, cid) that we don't know about?
>

I think the way to get that information is to use pageinspect
extension and use some query like below but you are right that you
need superuser privilege for that:

SELECT t_ctid, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('pg_class', 0)),
   LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
 WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




OpenSSL randomness seeding

2020-07-21 Thread Daniel Gustafsson
After forking we call RAND_cleanup in fork_process.c to force a re-seed to
ensure that two backends cannot share sequence.  OpenSSL 1.1.0 deprecated
RAND_cleanup, and contrary to how they usually leave deprecated APIs working
until removed, they decided to silently make this call a noop like below:

#   define RAND_cleanup() while(0) continue

This leaves our defence against pool sharing seemingly useless, and also
against the recommendations of OpenSSL for versions > 1.1.0 and < 1.1.1 where
the RNG was rewritten:

https://wiki.openssl.org/index.php/Random_fork-safety

The silver lining here is that while OpenSSL nooped RAND_cleanup, they also
changed what is mixed into seeding so we are still not sharing a sequence.  To
fix this, changing the RAND_cleanup call to RAND_poll should be enough to
ensure re-seeding after forking across all supported OpenSSL versions.  Patch
0001 implements this along with a comment referencing when it can be removed
(which most likely won't be for quite some time).

Another thing that stood out when reviewing this code is that we optimize for
RAND_poll failing in pg_strong_random, when we already have RAND_status
checking for a sufficiently seeded RNG for us.  ISTM that we can simplify the
code by letting RAND_status do the work as per 0002, and also (while unlikely)
survive any transient failures in RAND_poll by allowing all the retries we've
defined for the loop.

Also, as a disclaimer, this was brought up with the PostgreSQL security team
first whom have given permission to discuss this in public.

Thoughts on these?

cheers ./daniel



0002-Remove-optimization-for-RAND_poll-failing.patch
Description: Binary data


0001-Use-RAND_poll-for-seeding-randomness-after-fork.patch
Description: Binary data


--
VMware





Comment referencing incorrect algorithm

2020-07-21 Thread Daniel Gustafsson
While poking around our crypto code, I noticed that a comment in sha2.h was
referencing sha-1 which is an algorithm not supported by the code.  The
attached fixes the comment aligning it with other comments in the file.

cheers ./daniel



sha1_comment.diff
Description: Binary data


Re: Improving psql slash usage help message

2020-07-21 Thread Hamid Akhtar
So are you suggesting to not fix this or do a more detailed review and
assess what other psql messages can be grouped together.

On Sun, Jul 12, 2020 at 8:15 PM Tom Lane  wrote:

> Hamid Akhtar  writes:
> > On Sun, Apr 26, 2020 at 1:03 AM David G. Johnston <
> > david.g.johns...@gmail.com> wrote:
> >> On Sat, Apr 25, 2020 at 12:29 PM Hamid Akhtar 
> >> wrote:
> >>> "\dE" displays the list with a "List of relations" heading whereas
> "\det"
> >>> displays "List of foreign tables". So, to differentiate the two, I
> suggest
> >>> to change the help message for "\dE" to:
> >>> \dE[S+] [PATTERN]  list foreign relations
>
> >> help.c and the documentation need to be synchronized a bit more than
> this
> >> single issue.
> >> Calling it "foreign relation" for \dE and "foreign table" for \det does
> >> convey that there is a difference - not sure it a huge improvement
> though.
> >> The "\d[Eimstv]" family of meta-commands should, in the help, probably
> be
> >> moved together to show the fact that they are basically "list relation
> >> names [of this type only]" while "\det" is "list foreign table info".
>
> FWIW, I agree with David on this point.  I find it bizarre and unhelpful
> that slashUsage shows \dt, \di, etc as separate commands and fails to
> indicate that they can be combined.  We could merge these entries into
>
> fprintf(output, _("  \\d[tivmsE][S+] [PATRN] list relations of
> specified type(s)\n"));
>
> which would both remind people that they can give more than one type,
> and shorten the already-overly-long list.
>
> > I think from a user perspective, grouping these wouldn't be helpful. For
> > example, it may cause FDW related commands to be spread through out the
> > help.
>
> That seems quite irrelevant to this proposal.
>
> regards, tom lane
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus


Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-07-21 Thread Dilip Kumar
On Tue, Jul 21, 2020 at 11:00 AM Dilip Kumar  wrote:
>
> On Tue, Jul 21, 2020 at 2:00 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2020-07-17 16:16:23 +0530, Dilip Kumar wrote:
> > > The attached patch allows the vacuum to continue by emitting WARNING
> > > for the corrupted tuple instead of immediately error out as discussed
> > > at [1].
> > >
> > > Basically, it provides a new GUC called vacuum_tolerate_damage, to
> > > control whether to continue the vacuum or to stop on the occurrence of
> > > a corrupted tuple.  So if the vacuum_tolerate_damage is set then in
> > > all the cases in heap_prepare_freeze_tuple where the corrupted xid is
> > > detected, it will emit a warning and return that nothing is changed in
> > > the tuple and the 'tuple_totally_frozen' will also be set to false.
> > > Since we are returning false the caller will not try to freeze such
> > > tuple and the tuple_totally_frozen is also set to false so that the
> > > page will not be marked to all frozen even if all other tuples in the
> > > page are frozen.
> >
> > I'm extremely doubtful this is a good idea. In all likelihood this will
> > just exascerbate corruption.
> >
> > You cannot just stop freezing tuples, that'll lead to relfrozenxid
> > getting *further* out of sync with the actual table contents. And you
> > cannot just freeze such tuples, because that has a good chance of making
> > deleted tuples suddenly visible, leading to unique constraint violations
> > etc. Which will then subsequently lead to clog lookup errors and such.
>
> I agree with the point.  But, if we keep giving the ERROR in such
> cases then also the situation is not any better.  Basically, we are
> not freezing such tuple as well as we can not advance the
> relfrozenxid.  So if we follow the same rule that we don't freeze
> those tuples and also don't advance the relfrozenxid.  The only
> difference is, allow the vacuum to continue with other tuples.
>
> > At the very least you'd need to signal up that relfrozenxid/relminmxid
> > cannot be increased. Without that I think it's entirely unacceptable to
> > do this.
>
> I agree with that point.  I was just confused that shall we disallow
> to advance the relfrozenxid in all such cases or in some cases where
> the xid already precedes the relfrozenxid, we can allow it to advance
> as it can not become any worse.  But, as Alvaro pointed out that there
> is no point in optimizing such cases.   I will update the patch to
> stop advancing the relfrozenxid if we find any corrupted xid during
> tuple freeze.
>
> > If we really were to do something like this the option would need to be
> > called vacuum_allow_making_corruption_worse or such. Its need to be
> > *exceedingly* clear that it will likely lead to making everything much
> > worse.
> >
> Maybe we can clearly describe this in the document.
>
> > > @@ -6123,6 +6124,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > >   frz->t_infomask = tuple->t_infomask;
> > >   frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
> >
> > I don't think it can be right to just update heap_prepare_freeze_tuple()
> > but not FreezeMultiXactId().
>
> oh, I missed this part.  I will fix it.

Please find the updated patch.  In this version, we don't allow the
relfrozenxid and relminmxid to advance if the corruption is detected
and also added the handling in FreezeMultiXactId.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch
Description: Binary data


Re: Parallel Seq Scan vs kernel read ahead

2020-07-21 Thread Amit Kapila
On Tue, Jul 21, 2020 at 3:08 PM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> > On Tue, Jul 21, 2020 at 8:06 AM k.jami...@fujitsu.com 
> > 
> > wrote:
> > >
> > > I am definitely missing something. Perhaps I think I could not
> > > understand why there's no I/O difference between the Master and
> > > Patched (V2). Or has it been already improved even without this patch?
> > >
> >
> > I don't think it is strange that you are not seeing much difference because 
> > as per
> > the initial email by Thomas this patch is not supposed to give benefits on 
> > all
> > systems.  I think we wanted to check that the patch should not regress
> > performance in cases where it doesn't give benefits.  I think it might be 
> > okay to
> > run with a higher number of workers than you have CPUs in the system as we
> > wanted to check if such cases regress as shown by Soumyadeep above [1].  Can
> > you once try with
> > 8 and or 10 workers as well?
> >
>
> You are right. Kindly excuse me on that part, which only means it may or may 
> not have any
> benefits on the filesystem I am using. But for other fs, as we can see from 
> David's benchmarks
> significant results/benefits.
>
> Following your advice on regression test case, I increased the number of 
> workers,
> but the query planner still capped the workers at 6, so the results from 6 
> workers onwards
> are almost the same.
>

I am slightly confused if the number of workers are capped at 6, then
what exactly the data at 32 worker count means?  If you want query
planner to choose more number of workers, then I think either you need
to increase the data or use Alter Table  Set
(parallel_workers = );

> I don't see significant difference between master and patched on my machine
> as per my test results below. (Just for reconfirmation)
>

I see the difference of about 7-8% at higher (32) client-count.  Am, I
missing something?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-07-21 Thread Amit Kapila
On Fri, Jul 17, 2020 at 2:09 PM vignesh C  wrote:
>
> >
> > Please find the updated patch with the fixes included.
> >
>
> Patch 0003-Allow-copy-from-command-to-process-data-from-file-ST.patch
> had few indentation issues, I have fixed and attached the patch for
> the same.
>

Ensure to use the version with each patch-series as that makes it
easier for the reviewer to verify the changes done in the latest
version of the patch.  One way is to use commands like "git
format-patch -6 -v " or you can add the
version number manually.

Review comments:
===

0001-Copy-code-readjustment-to-support-parallel-copy
1.
@@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate)
  else
  nbytes = 0; /* no data need be saved */

+ if (cstate->copy_dest == COPY_NEW_FE)
+ minread = RAW_BUF_SIZE - nbytes;
+
  inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
-   1, RAW_BUF_SIZE - nbytes);
+   minread, RAW_BUF_SIZE - nbytes);

No comment to explain why this change is done?

0002-Framework-for-leader-worker-in-parallel-copy
2.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * Leader process will be populating data block, data block offset &
the size of
+ * the record in DSM for the workers to copy the data into the relation.
+ * This is protected by the following sequence in the leader & worker. If they
+ * don't follow this order the worker might process wrong line_size and leader
+ * might populate the information which worker has not yet processed or in the
+ * process of processing.
+ * Leader should operate in the following order:
+ * 1) check if line_size is -1, if not wait, it means worker is still
+ * processing.
+ * 2) set line_state to LINE_LEADER_POPULATING.
+ * 3) update first_block, start_offset & cur_lineno in any order.
+ * 4) update line_size.
+ * 5) update line_state to LINE_LEADER_POPULATED.
+ * Worker should operate in the following order:
+ * 1) check line_state is LINE_LEADER_POPULATED, if not it means
leader is still
+ * populating the data.
+ * 2) read line_size.
+ * 3) only one worker should choose one line for processing, this is handled by
+ *using pg_atomic_compare_exchange_u32, worker will change the sate to
+ *LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 4) read first_block, start_offset & cur_lineno in any order.
+ * 5) process line_size data.
+ * 6) update line_size to -1.
+ */
+typedef struct ParallelCopyLineBoundary

Are we doing all this state management to avoid using locks while
processing lines?  If so, I think we can use either spinlock or LWLock
to keep the main patch simple and then provide a later patch to make
it lock-less.  This will allow us to first focus on the main design of
the patch rather than trying to make this datastructure processing
lock-less in the best possible way.

3.
+ /*
+ * Actual lines inserted by worker (some records will be filtered based on
+ * where condition).
+ */
+ pg_atomic_uint64 processed;
+ pg_atomic_uint64 total_worker_processed; /* total processed records
by the workers */

The difference between processed and total_worker_processed is not
clear.  Can we expand the comments a bit?

4.
+ * SerializeList - Insert a list into shared memory.
+ */
+static void
+SerializeList(ParallelContext *pcxt, int key, List *inputlist,
+   Size est_list_size)
+{
+ if (inputlist != NIL)
+ {
+ ParallelCopyKeyListInfo *sharedlistinfo = (ParallelCopyKeyListInfo
*)shm_toc_allocate(pcxt->toc,
+ est_list_size);
+ CopyListSharedMemory(inputlist, est_list_size, sharedlistinfo);
+ shm_toc_insert(pcxt->toc, key, sharedlistinfo);
+ }
+}

Why do we need to write a special mechanism (CopyListSharedMemory) to
serialize a list.  Why can't we use nodeToString?  It should be able
to take care of List datatype, see outNode which is called from
nodeToString.  Once you do that, I think you won't need even
EstimateLineKeysList, strlen should work instead.

Check, if you have any similar special handling for other types that
can be dealt with nodeToString?

5.
+ MemSet(shared_info_ptr, 0, est_shared_info);
+ shared_info_ptr->is_read_in_progress = true;
+ shared_info_ptr->cur_block_pos = -1;
+ shared_info_ptr->full_transaction_id = full_transaction_id;
+ shared_info_ptr->mycid = GetCurrentCommandId(true);
+ for (count = 0; count < RINGSIZE; count++)
+ {
+ ParallelCopyLineBoundary *lineInfo =
_info_ptr->line_boundaries.ring[count];
+ pg_atomic_init_u32(&(lineInfo->line_size), -1);
+ }
+

You can move this initialization in a separate function.

6.
In function BeginParallelCopy(), you need to keep a provision to
collect wal_usage and buf_usage stats.  See _bt_begin_parallel for
reference.  Those will be required for pg_stat_statements.

7.
DeserializeString() -- it is better to name this function as RestoreString.
ParallelWorkerInitialization() -- it is better to name this function
as InitializeParallelCopyInfo or something like that, the current name
is quite confusing.
ParallelCopyLeader() -- how about 

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-21 Thread Amit Khandekar
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar  wrote:
> I tried this in utils/adt/Makefile :
> +
> +numeric.o: CFLAGS += ${CFLAGS_VECTOR}
> +
> and it works.
>
> CFLAGS_VECTOR also includes the -funroll-loops option, which I
> believe, had showed improvements in the checksum.c runs ( [1] ). This
> option makes the object file a bit bigger. For numeric.o, it's size
> increased by 15K; from 116672 to 131360 bytes. I ran the
> multiplication test, and didn't see any additional speed-up with this
> option. Also, it does not seem to be related to vectorization. So I
> was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
> CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
> numeric.c can use only CFLAGS_VECTOR.

I did as above. Attached is the v2 patch.

In case of existing CFLAGS_VECTOR, an env variable also could be set
by that name when running configure. I did the same for
CFLAGS_UNROLL_LOOPS.

Now, developers who already are using CFLAGS_VECTOR env while
configur'ing might be using this env because their compilers don't
have these compiler options  so they must be using some equivalent
compiler options. numeric.c will now be compiled with CFLAGS_VECTOR,
so for them  it will now be compiled with their equivalent of
vectorize and unroll-loops option, which is ok, I think. Just that the
numeric.o size will be increased, that's it.

>
> [1] 
> https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e




-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 029373ab2e9d2e6802326871f248ba2f21ffb204 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Tue, 21 Jul 2020 16:42:51 +0800
Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric
 product

A 'for' loop in mul_var() runs backwards by decrementing two
variables. This prevents the gcc compiler from auto-vectorizing the
for loop. So make it a forward loop with a single variable. This gives
performance benefits for product of numeric types with large
precision, with speedups becoming noticeable from values
with precisions starting from 20-40. Typical pattern of benefit is :
precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on.
On some CPU architectures, the speedup starts from 20 precision
onwards.
With the precisions used in the numeric_big regression test, the
multiplication speeds up by 2.5 to 2.7 times.

Auto-vectorization happens with gcc -O3 flag or -ftree-loop-vectorize.
So arrange for -free-loop-vectorize flag specifically when compiling
numeric.c. CFLAGS_VECTOR was already present for similar
functionality in checksum.c, but CFLAGS_VECTOR also includes
-funroll-loops which unecessarily makes numeric.o larger. So split
CFLAGS_VECTOR into CFLAGS_UNROLL_LOOPS and CFLAGS_VECTOR.
---
 configure | 17 +++--
 configure.in  |  9 +++--
 src/Makefile.global.in|  1 +
 src/backend/storage/page/Makefile |  2 +-
 src/backend/utils/adt/Makefile|  3 +++
 src/backend/utils/adt/numeric.c   | 11 ---
 6 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/configure b/configure
index cb8fbe1051..36ca734ad5 100755
--- a/configure
+++ b/configure
@@ -734,6 +734,7 @@ CPP
 CFLAGS_SL
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
+CFLAGS_UNROLL_LOOPS
 CFLAGS_VECTOR
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
@@ -5266,10 +5267,13 @@ BITCODE_CFLAGS=""
 user_BITCODE_CXXFLAGS=$BITCODE_CXXFLAGS
 BITCODE_CXXFLAGS=""
 
-# set CFLAGS_VECTOR from the environment, if available
+# set CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS from the environment, if available
 if test "$ac_env_CFLAGS_VECTOR_set" = set; then
   CFLAGS_VECTOR=$ac_env_CFLAGS_VECTOR_value
 fi
+if test "$ac_env_CFLAGS_UNROLL_LOOPS_set" = set; then
+  CFLAGS_UNROLL_LOOPS=$ac_env_CFLAGS_UNROLL_LOOPS_value
+fi
 
 # Some versions of GCC support some additional useful warning flags.
 # Check whether they are supported, and add them to CFLAGS if so.
@@ -6102,16 +6106,16 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then
 fi
 
 
-  # Optimization flags for specific files that benefit from vectorization
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5
-$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; }
+  # Optimization flags for specific files that benefit from loop unrolling
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS" >&5
+$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS... " >&6; }
 if ${pgac_cv_prog_CC_cflags__funroll_loops+:} false; then :
   $as_echo_n "(cached) " >&6
 else
   pgac_save_CFLAGS=$CFLAGS
 pgac_save_CC=$CC
 CC=${CC}
-CFLAGS="${CFLAGS_VECTOR} -funroll-loops"
+CFLAGS="${CFLAGS_UNROLL_LOOPS} -funroll-loops"
 ac_save_c_werror_flag=$ac_c_werror_flag
 

RE: Performing partition pruning using row value

2020-07-21 Thread kato-...@fujitsu.com
>So, after looking at these functions and modifying this patch, I would like to 
>add this patch to the next

I updated this patch and registered for the next CF .

https://commitfest.postgresql.org/29/2654/

regards, 
sho kato


pruning-with-row-wise-comparison-v2.patch
Description: pruning-with-row-wise-comparison-v2.patch


Re: Wrong results from in_range() tests with infinite offset

2020-07-21 Thread Dean Rasheed
On Tue, 21 Jul 2020 at 03:06, Tom Lane  wrote:
>
> Pushed, but I chickened out of back-patching.  The improvement in what
> happens for finite comparison values seems somewhat counterbalanced by
> the possibility that someone might not like the definition we arrived
> at for infinities.  So, it's not quite an open-and-shut bug fix, so
> I just put it in HEAD (for now anyway).
>

Yeah, that seems fair enough, and it's quite an obscure corner-case
that has gone unnoticed for quite some time.

Regards,
Dean




WAL segment switch on pg_start_backup()

2020-07-21 Thread @usernamedt
Hi,

I am currently exploring the pg_start_backup() and pg_stop_backup() functions.

In the documentation 
(https://www.postgresql.org/docs/9.0/functions-admin.html), it is stated that 
after calling pg_stop_backup() Postgres switches to the new WAL segment file. 
But it doesn’t say the same for pg_start_backup().

However, I found the following comment regarding pg_start_backup() in the 
source code:

Excerpt from Postgres source code 
https://doxygen.postgresql.org/xlog_8c_source.html#l10595

* Force an XLOG file switch before the checkpoint, to ensure that the
* WAL segment the checkpoint is written to doesn't contain pages with
* old timeline IDs. That would otherwise happen if you called
* pg_start_backup() right after restoring from a PITR archive: the
* first WAL segment containing the startup checkpoint has pages in
* the beginning with the old timeline ID. That can cause trouble at
* recovery: we won't have a history file covering the old timeline if
* pg_wal directory was not included in the base backup and the WAL
* archive was cleared too before starting the backup.

So does it mean that Postgres always switches to the new WAL segment file on 
pg_start_backup() call too?

If so, as I understood, the newly created WAL segment file should start from 
the checkpoint and should not contain any WAL records regarding the events that 
happened before pg_start_backup() call?

Thanks,
Daniil Zakhlystov

Re: new heapcheck contrib module

2020-07-21 Thread Amul Sul
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul  wrote:
>
> Hi Mark,
>
> I think new structures should be listed in src/tools/pgindent/typedefs.list,
> otherwise, pgindent might disturb its indentation.
>
> Regards,
> Amul
>
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
>  wrote:
> >
> >
> >
> > > On Jul 16, 2020, at 12:38 PM, Robert Haas  wrote:
> > >
> > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger  
> > > wrote:
> > >> The v10 patch without these ideas is here:
> > >
> > > Along the lines of what Alvaro was saying before, I think this
> > > definitely needs to be split up into a series of patches. The commit
> > > message for v10 describes it doing three pretty separate things, and I
> > > think that argues for splitting it into a series of three patches. I'd
> > > argue for this ordering:
> > >
> > > 0001 Refactoring existing amcheck btree checking functions to optionally
> > > return corruption information rather than ereport'ing it.  This is
> > > used by the new pg_amcheck command line tool for reporting back to
> > > the caller.
> > >
> > > 0002 Adding new function verify_heapam for checking a heap relation and
> > > associated toast relation, if any, to contrib/amcheck.
> > >
> > > 0003 Adding new contrib module pg_amcheck, which is a command line
> > > interface for running amcheck's verifications against tables and
> > > indexes.
> > >
> > > It's too hard to review things like this when it's all mixed together.
> >
> > The v11 patch series is broken up as you suggest.
> >
> > > +++ b/contrib/amcheck/t/skipping.pl
> > >
> > > The name of this file is inconsistent with the tree's usual
> > > convention, which is all stuff like 001_whatever.pl, except for
> > > src/test/modules/brin, which randomly decided to use two digits
> > > instead of three. There's no precedent for a test file with no leading
> > > numeric digits. Also, what does "skipping" even have to do with what
> > > the test is checking? Maybe it's intended to refer to the new error
> > > handling "skipping" the actual error in favor of just reporting it
> > > without stopping, but that's not really what the word "skipping"
> > > normally means. Finally, it seems a bit over-engineered: do we really
> > > need 183 test cases to check that detecting a problem doesn't lead to
> > > an abort? Like, if that's the purpose of the test, I'd expect it to
> > > check one corrupt relation and one non-corrupt relation, each with and
> > > without the no-error behavior. And that's about it. Or maybe it's
> > > talking about skipping pages during the checks, because those pages
> > > are all-visible or all-frozen? It's not very clear to me what's going
> > > on here.
> >
> > The "skipping" did originally refer to testing verify_heapam()'s option to 
> > skip all-visible or all-frozen blocks.  I have renamed it 
> > 001_verify_heapam.pl, since it tests that function.
> >
> > >
> > > + TransactionId nextKnownValidXid;
> > > + TransactionId oldestValidXid;
> > >
> > > Please add explanatory comments indicating what these are intended to
> > > mean.
> >
> > Done.
> >
> > > For most of the the structure members, the brief comments
> > > already present seem sufficient; but here, more explanation looks
> > > necessary and less is provided. The "Values for returning tuples"
> > > could possibly also use some more detail.
> >
> > Ok, I've expanded the comments for these.
> >
> > > +#define HEAPCHECK_RELATION_COLS 8
> > >
> > > I think this should really be at the top of the file someplace.
> > > Sometimes people have adopted this style when the #define is only used
> > > within the function that contains it, but that's not the case here.
> >
> > Done.
> >
> > >
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > > + errhint("please choose from 'all visible', 'all frozen', "
> > > + "or NULL")));
> > >
> > > I think it would be better if we had three string values selecting the
> > > different behaviors, and made the parameter NOT NULL but with a
> > > default. It seems like that would be easier to understand. Right now,
> > > I can tell that my options for what to skip are "all visible", "all
> > > frozen", and, uh, some other thing that I don't know what it is. I'm
> > > gonna guess the third option is to skip nothing, but it seems best to
> > > make that explicit. Also, should we maybe consider spelling this
> > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > > Spaces in an option value seems a little icky to me somehow.
> >
> > I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults 
> > to 'none'.  I did not mark the function as strict, as I think NULL is a 
> > reasonable value (and the default) for startblock and endblock.
> >
> > > + int64 startblock = -1;
> > > + int64 endblock = -1;
> > > ...
> > > + if (!PG_ARGISNULL(3))
> > > + startblock = PG_GETARG_INT64(3);
> > > + if (!PG_ARGISNULL(4))
> > > + endblock =