Re: SyncRepLock acquired exclusively in default configuration

2020-08-26 Thread Asim Praveen

> On 26-Aug-2020, at 11:10 PM, Fujii Masao  wrote:
> 
> I added the following comments based on the suggestion by Sawada-san 
> upthread. Thought?
> 
> +  * Since this routine gets called every commit time, it's important to
> +  * exit quickly if sync replication is not requested. So we check
> +  * WalSndCtl->sync_standbys_define without the lock and exit
> +  * immediately if it's false. If it's true, we check it again later
> +  * while holding the lock, to avoid the race condition described
> +  * in SyncRepUpdateSyncStandbysDefined().
> 

+1.  May I suggest the following addition to the above comment (feel free to
rephrase / reject)?

"If sync_standbys_defined was being set from false to true and we observe it as
false, it ok to skip the wait.  Replication was async and it is in the process
of being changed to sync, due to user request.  Subsequent commits will observe
the change and start waiting.”

> 
> Attached is the updated version of the patch. I didn't change how to
> fix the issue. But I changed the check for fast exit so that it's called
> before setting the "mode", to avoid a few cycle.
> 


Looks good to me.  There is a typo in the comment:

s/sync_standbys_define/sync_standbys_defined/

Asim

Re: list of extended statistics on psql

2020-08-26 Thread Tatsuro Yamada

Hi Julien and Pavel!


How about using \dX rather than \dz?


Thanks for your suggestion!
I'll replace it if I got consensus. :-D



How about using \dX rather than \dz?


Thanks for your suggestion!
I'll replace it if I got consensus. :-D



I re-read a help message of \d* commands and realized it's better to
use "\dX".
There are already cases where the commands differ due to differences
in case, so I did the same way. Please find attached patch. :-D
 
For example:

==
  \da[S]  [PATTERN]  list aggregates
  \dA[+]  [PATTERN]  list access methods
==

Attached patch uses "\dX" instead of "\dz":
==
  \dx[+]  [PATTERN]  list extensions
  \dX [PATTERN]  list extended statistics
==

Results of regress test of the feature are the following:
==
-- check printing info about extended statistics
create table t1 (a int, b int);
create statistics stts_1 (dependencies) on a, b from t1;
create statistics stts_2 (dependencies, ndistinct) on a, b from t1;
create statistics stts_3 (dependencies, ndistinct, mcv) on a, b from t1;
create table t2 (a int, b int, c int);
create statistics stts_4 on b, c from t2;
create table hoge (col1 int, col2 int, col3 int);
create statistics stts_hoge on col1, col2, col3 from hoge;

\dX
  List of extended statistics
 Schema | Table |   Name| Columns  | Ndistinct | Dependencies | MCV
+---+---+--+---+--+-
 public | hoge  | stts_hoge | col1, col2, col3 | t | t| t
 public | t1| stts_1| a, b | f | t| f
 public | t1| stts_2| a, b | t | t| f
 public | t1| stts_3| a, b | t | t| t
 public | t2| stts_4| b, c | t | t| t
(5 rows)

\dX stts_?
List of extended statistics
 Schema | Table |  Name  | Columns | Ndistinct | Dependencies | MCV
+---++-+---+--+-
 public | t1| stts_1 | a, b| f | t| f
 public | t1| stts_2 | a, b| t | t| f
 public | t1| stts_3 | a, b| t | t| t
 public | t2| stts_4 | b, c| t | t| t
(4 rows)

\dX *hoge
  List of extended statistics
 Schema | Table |   Name| Columns  | Ndistinct | Dependencies | MCV
+---+---+--+---+--+-
 public | hoge  | stts_hoge | col1, col2, col3 | t | t| t
(1 row)
==


Thanks,
Tatsuro Yamada


diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index fc16e6c..ace8e5f 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1893,6 +1893,18 @@ testdb=>
 
 
   
+  
+ 
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the 
pattern
+are listed.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 9902a4a..dcc9fba 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -929,6 +929,9 @@ exec_command_d(PsqlScanState scan_state, bool 
active_branch, const char *cmd)
else
success = listExtensions(pattern);
break;
+   case 'X':   /* Extended Statistics 
*/
+   success = listExtendedStats(pattern);
+   break;
case 'y':   /* Event Triggers */
success = listEventTriggers(pattern, 
show_verbose);
break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d81f157..77b3074 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4368,6 +4368,74 @@ listEventTriggers(const char *pattern, bool verbose)
 }
 
 /*
+ * \dX
+ *
+ * Briefly describes extended statistics.
+ */
+bool
+listExtendedStats(const char *pattern)
+{
+   PQExpBufferData buf;
+   PGresult   *res;
+   printQueryOpt myopt = pset.popt;
+
+   if (pset.sversion < 10)
+   {
+   charsverbuf[32];
+
+   pg_log_error("The server (version %s) does not support extended 
statistics.",
+formatPGVersionNumber(pset.sversion, 
false,
+   
   sverbuf, sizeof(sverbuf)));
+   return true;
+   }
+
+   initPQExpBuffer(&buf);
+   printfPQExpBuffer(&buf,
+  

Re: [POC]Enable tuple change partition caused by BEFORE TRIGGER

2020-08-26 Thread Ashutosh Bapat
On Wed, 26 Aug 2020 at 22:47, Alvaro Herrera 
wrote:

>
> What this is saying to me is that we'd need to make sure to run the
> final target partition's AFTER triggers, not the original target
> partition.


Agreed.


>   But I'm not 100% about running the BEFORE triggers.  Maybe
> one way to address this is to check whether the BEFORE triggers in the
> new target partition are clones; if so then they would have run in the
> original target partition and so must not be run, but otherwise they
> have to.
>

This will work as long as the two BEFORE ROW triggers have the same effect.
Consider two situations resulting in inserting identical rows 1. row that
the before row trigger has redirected to a new partition, say part2 2. a
row inserted directly into the part2 - if both these rows are identical
before the BEFORE ROW triggers have been applied, they should remain
identical while inserting into part2. Any divergence might be problematic
for the application.

-- 
Best Wishes,
Ashutosh


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-26 Thread Amit Kapila
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes  wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila  wrote:
>
>>
>>  I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c 
> line 288 needs to be:
>
> boolfound PG_USED_FOR_ASSERTS_ONLY = false;
>

Thanks for the report. Tom Lane has already fixed this [1].

[1] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

-- 
With Regards,
Amit Kapila.




Re: Improvements in Copy From

2020-08-26 Thread Peter Smith
Hello.

FYI - that patch has conflicts when applied.

Kind Regards
Peter Smith
Fujitsu Australia.

On Thu, Aug 27, 2020 at 3:11 PM vignesh C  wrote:
>
> On Tue, Jul 14, 2020 at 12:17 PM vignesh C  wrote:
> >
> > On Tue, Jul 14, 2020 at 11:13 AM David Rowley  wrote:
> > >
> > > On Tue, 14 Jul 2020 at 17:22, David Rowley  wrote:
> > > >
> > > > On Thu, 2 Jul 2020 at 00:46, vignesh C  wrote:
> > > > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > > > that is not being used, it can be removed.
> > > >
> > > > This was raised in [1]. We decided not to remove it.
> > >
> > > I just added a comment to the function to mention why we want to keep
> > > the parameter. I hope that will save any wasted time proposing its
> > > removal in the future.
> > >
> > > FWIW, the function is inlined.  Removing it will gain us nothing
> > > performance-wise anyway.
> > >
> > > David
> > >
> > > > [1] 
> > > > https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef
> >
> > Thanks David for pointing it out, as this has been discussed and
> > concluded no point in discussing the same thing again. This patch has
> > a couple of other improvements which can still be taken forward. I
> > will remove this change and post a new patch to retain the other
> > issues that were fixed.
> >
>
> I have removed the changes that david had pointed out and retained the
> remaining changes. Attaching the patch for the same.
> Thoughts?
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com




Clang UndefinedBehaviorSanitize (Postgres14) Detected undefined-behavior

2020-08-26 Thread Ranier Vilela
Hi,

Per Clang UBSan
Clang 10 (64 bits)
Postgres 14 (latest)

2020-08-27 01:02:14.930 -03 client backend[42432] pg_regress/create_table
STATEMENT:  create table defcheck_0 partition of defcheck for values in (0);
indexcmds.c:1162:22: runtime error: null pointer passed as argument 2,
which is declared to never be null
/usr/include/string.h:44:28: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior indexcmds.c:1162:22
in
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in

indexcmds.c (1162):
memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);

clog.c (299):
memcmp(subxids, MyProc->subxids.xids,
  nsubxids * sizeof(TransactionId)) == 0)

xact.c (5285)
memcpy(&workspace[i], s->childXids,
  s->nChildXids * sizeof(TransactionId));

snapmgr.c (590)
memcpy(CurrentSnapshot->xip, sourcesnap->xip,
  sourcesnap->xcnt * sizeof(TransactionId));
snapmgr.c (594)
memcpy(CurrentSnapshot->subxip, sourcesnap->subxip,
  sourcesnap->subxcnt * sizeof(TransactionId));

copyfuncs.c:1190
COPY_POINTER_FIELD(uniqColIdx, from->uniqNumCols * sizeof(AttrNumber));

1.STATEMENT:  CREATE TABLESPACE regress_tblspacewith LOCATION
'/usr/src/postgres/src/test/regress/testtablespace' WITH
(some_nonexistent_parameter = true);
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
2.STATEMENT:  CREATE TABLE testschema.dflt (a int PRIMARY KEY USING INDEX
TABLESPACE regress_tblspace) PARTITION BY LIST (a);
indexcmds.c:1162:22: runtime error: null pointer passed as argument 2,
which is declared to never be null
3.STATEMENT:  SELECT bool 'nay' AS error;
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
4.STATEMENT:  SELECT U&'wrong: +0061' UESCAPE '+';
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
5. STATEMENT:  ALTER TABLE circles ADD EXCLUDE USING gist
 (c1 WITH &&, (c2::circle) WITH &&);
xact.c:5285:25: runtime error: null pointer passed as argument 2, which is
declared to never be null
6.STATEMENT:  COMMENT ON CONSTRAINT the_constraint ON DOMAIN
no_comments_dom IS 'another bad comment';
snapmgr.c:590:31: runtime error: null pointer passed as argument 2, which
is declared to never be null
7.STATEMENT:  create trigger my_table_col_update_trig
 after update of b on my_table referencing new table as new_table
 for each statement execute procedure dump_insert();
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in
xact.c:5285:25: runtime error: null pointer passed as argument 2, which is
declared to never be null
/usr/include/string.h:44:28: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior xact.c:5285:25 in
snapmgr.c:590:31: runtime error: null pointer passed as argument 2, which
is declared to never be null
/usr/include/string.h:44:28: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior snapmgr.c:590:31 in
snapmgr.c:594:34: runtime error: null pointer passed as argument 2, which
is declared to never be null
/usr/include/string.h:44:28: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior snapmgr.c:594:34 in
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in 8.
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in
clog.c:299:10: runtime error: null pointer passed as argument 1, which is
declared to never be null
/usr/include/string.h:65:33: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior clog.c:299:10 in
8.STATEMENT:  select array_fill(1, array[[1,2],[3,4]]);
copyfuncs.c:1190:2: runtime error: null pointer passed as argument 2, which
is declared to never be null

I stopped counting clog.c (299).
If anyone wants, the full report, it has 2mb.

Ranier Vilela


RE: Implement UNLOGGED clause for COPY FROM

2020-08-26 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> Sure, but on a daily basis, one requires only incremental WAL to
> complete the backup but in this case, it would require the entire
> database back up unless we have some form of block-level incremental
> backup method. 

Regarding the backup time, I think users can shorten it by using the storage 
device's snapshoting (or split mirroring?), filesystem's snapshot feature.


> OTOH, I don't deny that there is some use case with
> wal_level = 'none' for initial data loading as MySQL provides but I
> think that is a separate feature than what is proposed here (Copy
> Unlogged). It might be better if you can start a separate thread for
> that with some more details on the implementation side as well.

Yeah, the feature doesn't match the title of this thread and could confuse 
other readers that join later.  Nevertheless, I think MySQL's feature could be 
used for additional data loading as well if the user understands what he/she is 
doing.  So, we can discuss it in another thread just in case the ongoing 
discussion gets stuck.


Regards
Takayuki Tsunakawa





Re: Implement UNLOGGED clause for COPY FROM

2020-08-26 Thread Amit Kapila
On Thu, Aug 27, 2020 at 7:04 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > So you want your users to shutdown and restart the server before Copy
> > because that would be required if you want to change the wal_level.
>
> Yes.  They seem to be fine with it, as far as I heard from a person who is 
> involved in the system design.
>
>
> > However, even if we do that, users who are running the server
> > previously with wal_level as 'replica' won't be happy after doing this
> > change. Because if they change the wal_level to 'none' for certain
> > operations like bulk load and then again change back the mode to
> > 'replica' they need to back up the database again to setup 'replica'
> > as they can't continue replication from the previous point (consider
> > update on a page for which previously WAL was not written).
>
> Yes, it requires the database backup.  The database backup should be a daily 
> task anyway, so I expect it wouldn't impose extra maintenance burdon on the 
> user.
>

Sure, but on a daily basis, one requires only incremental WAL to
complete the backup but in this case, it would require the entire
database back up unless we have some form of block-level incremental
backup method. OTOH, I don't deny that there is some use case with
wal_level = 'none' for initial data loading as MySQL provides but I
think that is a separate feature than what is proposed here (Copy
Unlogged). It might be better if you can start a separate thread for
that with some more details on the implementation side as well.

-- 
With Regards,
Amit Kapila.




Clang Address Sanitizer (Postgres14) Detected Memory Leaks

2020-08-26 Thread Ranier Vilela
Hi,

Is this something to worry about, or is it another problem with the
analysis tool, that nobody cares about?
clang 10 (64 bits)
postgres 14 (latest)

31422==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4560 byte(s) in 1 object(s) allocated from:
#0 0x50e33d in malloc
(/usr/src/postgres/tmp_install/usr/local/pgsql/bin/postgres+0x50e33d)
#1 0x186d52f in ConvertTimeZoneAbbrevs
/usr/src/postgres/src/backend/utils/adt/datetime.c:4511:8
#2 0x1d9b5e9 in load_tzoffsets
/usr/src/postgres/src/backend/utils/misc/tzparser.c:465:12
#3 0x1d8ca3f in check_timezone_abbreviations
/usr/src/postgres/src/backend/utils/misc/guc.c:11389:11
#4 0x1d6a398 in call_string_check_hook
/usr/src/postgres/src/backend/utils/misc/guc.c:11056:7
#5 0x1d68f29 in parse_and_validate_value
/usr/src/postgres/src/backend/utils/misc/guc.c:6870:10
#6 0x1d6567d in set_config_option
/usr/src/postgres/src/backend/utils/misc/guc.c:7473:11
#7 0x1d7f8f4 in ProcessGUCArray
/usr/src/postgres/src/backend/utils/misc/guc.c:10608:10
#8 0x9d0c8d in ApplySetting
/usr/src/postgres/src/backend/catalog/pg_db_role_setting.c:256:4
#9 0x1d4ad93 in process_settings
/usr/src/postgres/src/backend/utils/init/postinit.c:1174:2
#10 0x1d48e39 in InitPostgres
/usr/src/postgres/src/backend/utils/init/postinit.c:1059:2
#11 0x14a2c1a in BackgroundWorkerInitializeConnectionByOid
/usr/src/postgres/src/backend/postmaster/postmaster.c:5758:2
#12 0x853feb in ParallelWorkerMain
/usr/src/postgres/src/backend/access/transam/parallel.c:1373:2
#13 0x146e5fb in StartBackgroundWorker
/usr/src/postgres/src/backend/postmaster/bgworker.c:813:2
#14 0x14af69b in do_start_bgworker
/usr/src/postgres/src/backend/postmaster/postmaster.c:5879:4
#15 0x14a1487 in maybe_start_bgworkers
/usr/src/postgres/src/backend/postmaster/postmaster.c:6104:9
#16 0x149e5aa in sigusr1_handler
/usr/src/postgres/src/backend/postmaster/postmaster.c:5269:3
#17 0x7fcffa75a3bf  (/lib/x86_64-linux-gnu/libpthread.so.0+0x153bf)
#18 0x149d655 in PostmasterMain
/usr/src/postgres/src/backend/postmaster/postmaster.c:1414:11
#19 0x108402e in main /usr/src/postgres/src/backend/main/main.c:209:3
#20 0x7fcffa54e0b2 in __libc_start_main
/build/glibc-YYA7BZ/glibc-2.31/csu/../csu/libc-start.c:308:16

Direct leak of 1020 byte(s) in 15 object(s) allocated from:
#0 0x4fa6e4 in strdup
(/usr/src/postgres/tmp_install/usr/local/pgsql/bin/postgres+0x4fa6e4)
#1 0x1d6a1c7 in guc_strdup
/usr/src/postgres/src/backend/utils/misc/guc.c:4889:9
#2 0x1d7efc7 in set_config_sourcefile
/usr/src/postgres/src/backend/utils/misc/guc.c:7696:15
#3 0x1d7c95e in ProcessConfigFileInternal
/usr/src/postgres/src/backend/utils/misc/guc-file.l:478:4
#4 0x1d5b33f in ProcessConfigFile
/usr/src/postgres/src/backend/utils/misc/guc-file.l:156:9
#5 0x1d5ae7d in SelectConfigFiles
/usr/src/postgres/src/backend/utils/misc/guc.c:5674:2
#6 0x149b6ce in PostmasterMain
/usr/src/postgres/src/backend/postmaster/postmaster.c:884:7

Ranier Vilela


pg_index.indisreplident and invalid indexes

2020-08-26 Thread Michael Paquier
Hi all,

While digging into a different patch involving DROP INDEX CONCURRENTLY
and replica indexes, I have found that the handling of indisreplident
is inconsistent for invalid indexes:
https://www.postgresql.org/message-id/20200827022835.gm2...@paquier.xyz

In short, imagine the following sequence:
CREATE TABLE test_replica_identity_4 (id int NOT NULL);
CREATE UNIQUE INDEX test_replica_index_4 ON
  test_replica_identity_4(id);
ALTER TABLE test_replica_identity_4 REPLICA IDENTITY
  USING INDEX test_replica_index_4;
-- imagine that this fails in the second transaction used in
-- index_drop().
DROP INDEX CONCURRENTLY test_replica_index_4;
-- here the index still exists, is invalid, marked with
-- indisreplident.
CREATE UNIQUE INDEX test_replica_index_4_2 ON
  test_replica_identity_4(id);
ALTER TABLE test_replica_identity_4 REPLICA IDENTITY
  USING INDEX test_replica_index_4_2;
-- set back the index to a valid state.
REINDEX INDEX test_replica_index_4;
-- And here we have two valid indexes usable as replica identities.
SELECT indexrelid::regclass, indisvalid, indisreplident FROM pg_index
  WHERE indexrelid IN ('test_replica_index_4'::regclass,
   'test_replica_index_4_2'::regclass);
   indexrelid   | indisvalid | indisreplident
++
 test_replica_index_4_2 | t  | t
 test_replica_index_4   | t  | t
(2 rows)
 
You can just use the following trick to emulate a failure in DIC:
@@ -2195,6 +2195,9 @@ index_drop(Oid indexId, bool concurrent, bool
concurrent_lock_mode)
if (userIndexRelation->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
RelationDropStorage(userIndexRelation);
+   if (concurrent)
+   elog(ERROR, "burp");

This results in some problems for ALTER TABLE in tablecmds.c, as it is
possible to reach a state in the catalogs where we have *multiple*
indexes marked with indisreplindex for REPLICA_IDENTITY_INDEX set on
the parent table.  Even worse, this messes up with
RelationGetIndexList() as it would set rd_replidindex in the relcache
for the last index found marked with indisreplident, depending on the
order where the indexes are scanned, you may get a different replica
index loaded.

I think that this problem is similar to indisclustered, and that we
had better set indisreplident to false when clearing indisvalid for an
index concurrently dropped.  This would prevent problems with ALTER
TABLE of course, but also the relcache.

Any objections to the attached?  I am not sure that this is worth a
backpatch as that's unlikely going to be a problem in the field, so
I'd like to fix this issue only on HEAD.  This exists since 9.4 and
the introduction of replica identities.
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..141632b748 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3349,10 +3349,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			 * CONCURRENTLY that failed partway through.)
 			 *
 			 * Note: the CLUSTER logic assumes that indisclustered cannot be
-			 * set on any invalid index, so clear that flag too.
+			 * set on any invalid index, so clear that flag too.  Similarly,
+			 * ALTER TABLE assumes that indisreplident cannot be set for
+			 * invalid indexes.
 			 */
 			indexForm->indisvalid = false;
 			indexForm->indisclustered = false;
+			indexForm->indisreplident = false;
 			break;
 		case INDEX_DROP_SET_DEAD:
 
@@ -3364,6 +3367,8 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			 * the index at all.
 			 */
 			Assert(!indexForm->indisvalid);
+			Assert(!indexForm->indisclustered);
+			Assert(!indexForm->indisreplident);
 			indexForm->indisready = false;
 			indexForm->indislive = false;
 			break;


signature.asc
Description: PGP signature


Re: Parallel copy

2020-08-26 Thread Amit Kapila
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow  wrote:
>
> > I have attached new set of patches with the fixes.
> > Thoughts?
>
> Hi Vignesh,
>
> I don't really have any further comments on the code, but would like
> to share some results of some Parallel Copy performance tests I ran
> (attached).
>
> The tests loaded a 5GB CSV data file into a 100 column table (of
> different data types). The following were varied as part of the test:
> - Number of workers (1 – 10)
> - No indexes / 4-indexes
> - Default settings / increased resources (shared_buffers,work_mem, etc.)
>
> (I did not do any partition-related tests as I believe those type of
> tests were previously performed)
>
> I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
> The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.
>
>
> I observed the following trends:
> - For the data file size used, Parallel Copy achieved best performance
> using about 9 – 10 workers. Larger data files may benefit from using
> more workers. However, I couldn’t really see any better performance,
> for example, from using 16 workers on a 10GB CSV data file compared to
> using 8 workers. Results may also vary depending on machine
> characteristics.
> - Parallel Copy with 1 worker ran slower than normal Copy in a couple
> of cases (I did question if allowing 1 worker was useful in my patch
> review).

I think the reason is that for 1 worker case there is not much
parallelization as a leader doesn't perform the actual load work.
Vignesh, can you please once see if the results are reproducible at
your end, if so, we can once compare the perf profiles to see why in
some cases we get improvement and in other cases not. Based on that we
can decide whether to allow the 1 worker case or not.

> - Typical load time improvement (load factor) for Parallel Copy was
> between 2x and 3x. Better load factors can be obtained by using larger
> data files and/or more indexes.
>

Nice improvement and I think you are right that with larger load data
we will get even better improvement.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-08-26 Thread Greg Nancarrow
> I have attached new set of patches with the fixes.
> Thoughts?

Hi Vignesh,

I don't really have any further comments on the code, but would like
to share some results of some Parallel Copy performance tests I ran
(attached).

The tests loaded a 5GB CSV data file into a 100 column table (of
different data types). The following were varied as part of the test:
- Number of workers (1 – 10)
- No indexes / 4-indexes
- Default settings / increased resources (shared_buffers,work_mem, etc.)

(I did not do any partition-related tests as I believe those type of
tests were previously performed)

I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4).
The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM.


I observed the following trends:
- For the data file size used, Parallel Copy achieved best performance
using about 9 – 10 workers. Larger data files may benefit from using
more workers. However, I couldn’t really see any better performance,
for example, from using 16 workers on a 10GB CSV data file compared to
using 8 workers. Results may also vary depending on machine
characteristics.
- Parallel Copy with 1 worker ran slower than normal Copy in a couple
of cases (I did question if allowing 1 worker was useful in my patch
review).
- Typical load time improvement (load factor) for Parallel Copy was
between 2x and 3x. Better load factors can be obtained by using larger
data files and/or more indexes.
- Increasing Postgres resources made little or no difference to
Parallel Copy performance when the target table had no indexes.
Increasing Postgres resources improved Parallel Copy performance when
the target table had indexes.

Regards,
Greg Nancarrow
Fujitsu Australia
(1) Postgres default settings, 5GB CSV (510 rows), no indexes on table:

Copy Type   Duration (s)Load factor
===
Normal Copy 132.838 -

Parallel Copy
(#workers)
1   97.537  1.36
2   61.700  2.15
3   52.788  2.52
4   46.607  2.85
5   45.524  2.92
6   43.799  3.03
7   42.970  3.09
8   42.974  3.09
9   43.698  3.04
10  43.362  3.06


(2) Postgres default settings, 5GB CSV (510 rows), 4 indexes on table:

Copy Type   Duration (s)Load factor
===
Normal Copy 221.111 -

Parallel Copy
(#workers)
1   331.609 0.66
2   99.085  2.23
3   89.751  2.46
4   81.137  2.73
5   79.138  2.79
6   77.155  2.87
7   75.813  2.92
8   74.961  2.95
9   77.803  2.84
10  75.399  2.93


(3) Postgres increased resources, 5GB CSV (510 rows), no indexes on table:

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16 
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type   Duration (s)Load factor
===
Normal Copy 78.138  -

Parallel Copy
(#workers)
1   95.203  0.82
2   62.596  1.24
3   52.318  1.49
4   48.246  1.62
5   42.832  1.82
6   42.921  1.82
7   43.146  1.81
8   41.557  1.88
9   43.489  1.80
10  43.362  1.80


(4) Postgres increased resources, 5GB CSV (510 rows), 4 indexes on table:

Copy Type   Duration (s)Load factor

Re: More tests with USING INDEX replident and dropped indexes

2020-08-26 Thread Michael Paquier
On Wed, Aug 26, 2020 at 09:08:51PM +0900, Michael Paquier wrote:
> I'll go through this part again tomorrow, it is late here.

There is actually a huge gotcha here that exists since replica
identities exist: relation_mark_replica_identity() only considers
valid indexes!  So, on HEAD, if DROP INDEX CONCURRENTLY fails in the
second transaction doing the physical drop, then we would finish with
a catalog entry that has indisreplident and !indisinvalid.  If you
reindex it afterwards and switch the index back to be valid, there
can be more than one valid index marked with indisreplident, which
messes up with the logic in tablecmds.c because we use
RelationGetIndexList(), that discards invalid indexes.  This case is
actually rather similar to the handling of indisclustered.

So we have a set of two issues here:
1) indisreplident should be switched to false when we clear the valid
flag of an index for INDEX_DROP_CLEAR_VALID.  This issue exists since
9.4.
2) To never finish in a state where we have REPLICA IDENTITY INDEX
without an index marked as indisreplident, we need to reset the
replident of the parent table in the same transaction as the one
clearing indisvalid for the concurrent case.  That's a problem of the
patch proposed upthread.

Attached is a patch for 1) and 2) grouped together, to ease review for
now.  I think that we had better fix 1) separately though, so I am
going to start a new thread about that with a separate patch as the
current thread is misleading.
--
Michael
From 8893acae8b4d14b1cccb3bdcf8c23e46215d2060 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 27 Aug 2020 11:27:04 +0900
Subject: [PATCH v4] Reset pg_class.relreplident when associated replica index
 is dropped

When an index associated to REPLICA IDENTITY USING INDEX gets dropped,
relreplident gets set to 'i', which can be confusing as the catalogs
have a state inconsistent with pg_index that lacks an associated entry
marked with indisreplident.  This changes the logic so as such an index
dropped leads to 'n' to be set for the parent table, equivalent to
REPLICA IDENTITY NOTHING.

Author: Michael Paquier
Reviewed-by: Masahiko Sawada, Rahila Syed, Euler Taveira
Discussion: https://postgr.es/m/20200522035028.go2...@paquier.xyz
---
 src/include/commands/tablecmds.h  |  2 +
 src/backend/catalog/index.c   | 43 +-
 src/backend/commands/tablecmds.c  | 58 ---
 .../regress/expected/replica_identity.out | 30 ++
 src/test/regress/sql/replica_identity.sql | 22 +++
 doc/src/sgml/catalogs.sgml|  3 +-
 6 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index c1581ad178..b163a82c52 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -61,6 +61,8 @@ extern void ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_
 
 extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass);
 
+extern void SetRelationReplIdent(Oid relationId, char ri_type);
+
 extern ObjectAddress renameatt(RenameStmt *stmt);
 
 extern ObjectAddress RenameConstraint(RenameStmt *stmt);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..ccdda7561a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2005,6 +2005,7 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 	Relation	indexRelation;
 	HeapTuple	tuple;
 	bool		hasexprs;
+	bool		resetreplident;
 	LockRelId	heaprelid,
 indexrelid;
 	LOCKTAG		heaplocktag;
@@ -2048,6 +2049,15 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 	 */
 	CheckTableNotInUse(userIndexRelation, "DROP INDEX");
 
+	/*
+	 * Check if the REPLICA IDENTITY of the parent relation needs to be
+	 * reset.  This should be done only if the index to drop here is
+	 * marked as used in a replica identity.  We cannot rely on the
+	 * contents of pg_index for the index either as a concurrent drop
+	 * may have changed indisreplident already.
+	 */
+	resetreplident = userIndexRelation->rd_index->indisreplident;
+
 	/*
 	 * Drop Index Concurrently is more or less the reverse process of Create
 	 * Index Concurrently.
@@ -2159,10 +2169,32 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 
 		/* Finish invalidation of index and mark it as dead */
 		index_concurrently_set_dead(heapId, indexId);
+	}
 
+	/*
+	 * If the index to be dropped was marked as indisreplident (it could
+	 * have been updated when clearing indisvalid previously), reset
+	 * pg_class.relreplident for the parent table.  Note that this part
+	 * needs to be done in the same transaction as the one marking the
+	 * index as invalid, so as we never finish with the case where the
+	 * parent's replica identity is based on an index, without any index
+	 * marked with indisreplident.
+	 */
+	if (resetreplident)
+	{
+		SetRel

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Ashutosh Sharma
On Wed, Aug 26, 2020 at 9:19 PM Robert Haas  wrote:
>
> On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma  wrote:
> > Removed this note from the documentation and added a note saying: "The
> > user needs to ensure that they do not operate pg_force_freeze function
> > on a deleted tuple because it may revive the deleted tuple."
>
> I do not agree with that note, either. I believe that trying to tell
> people what things specifically they should do or avoid doing with the
> tool is the wrong approach. Instead, the thrust of the message should
> be to tell people that if you use this, it may corrupt your database,
> and that's your problem. The difficulty with telling people what
> specifically they ought to avoid doing is that experts will be annoyed
> to be told that something is not safe when they know that it is fine,
> and non-experts will think that some uses are safer than they really
> are.
>

Okay, point noted.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Thomas Munro
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera  wrote:
> > --4.90%--smgropen
> >   |--2.86%--ReadBufferWithoutRelcache
>
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Right, my hypothesis requires an uncacheably large buffer mapping
table, and I think smgropen() needs a different explanation because
it's not expected to be as large or as random, at least not with a
pgbench workload.  I think the reasons for a profile with a smgropen()
showing up so high, and in particular higher than BufTableLookup(),
must be:

1.  We call smgropen() twice for every call to BufTableLookup().  Once
in XLogReadBufferExtended(), and then again in
ReadBufferWithoutRelcache().
2.  We also call it for every block forced out of the buffer pool, and
in recovery that has to be done by the recovery loop.
3.  We also call it for every block in the buffer pool during the
end-of-recovery checkpoint.

Not sure but the last two might perform worse due to proximity to
interleaving pwrite() system calls (just a thought, not investigated).
In any case, I'm going to propose we move those things out of the
recovery loop, in a new thread.




RE: Implement UNLOGGED clause for COPY FROM

2020-08-26 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> So you want your users to shutdown and restart the server before Copy
> because that would be required if you want to change the wal_level.

Yes.  They seem to be fine with it, as far as I heard from a person who is 
involved in the system design.


> However, even if we do that, users who are running the server
> previously with wal_level as 'replica' won't be happy after doing this
> change. Because if they change the wal_level to 'none' for certain
> operations like bulk load and then again change back the mode to
> 'replica' they need to back up the database again to setup 'replica'
> as they can't continue replication from the previous point (consider
> update on a page for which previously WAL was not written).

Yes, it requires the database backup.  The database backup should be a daily 
task anyway, so I expect it wouldn't impose extra maintenance burdon on the 
user.  Plus, not all users use the streaming replication for HA.  I think it's 
helpful for the maturing Postgres to provide some kind of solution for some 
context so that user can scratch their itches.


Regards
Takayuki Tsunakawa



Re: How is bushy plans generated in join_search_one_lev

2020-08-26 Thread Andy Fan
On Thu, Aug 27, 2020 at 8:05 AM Tom Lane  wrote:

> Andy Fan  writes:
> > I do see the README says we support bushy plans and I also see bushy
> > plans in real life (for example tpc-h Q20) like below. However I don't
> know
> > how it is generated with the algorithm in join_search_one_lev since it
> > always
> > make_rels_by_clause_join with joinrel[1] which is initial_rels which is
> > baserel.
>
> Hmm?  Bushy plans are created by the second loop in join_search_one_level,
> starting about line 150 in joinrels.c.
>
> regards, tom lane
>

Yes.. I missed the second loop:(:(:(

-- 
Best Regards
Andy Fan


Re: How is bushy plans generated in join_search_one_lev

2020-08-26 Thread Tom Lane
Andy Fan  writes:
> I do see the README says we support bushy plans and I also see bushy
> plans in real life (for example tpc-h Q20) like below. However I don't know
> how it is generated with the algorithm in join_search_one_lev since it
> always
> make_rels_by_clause_join with joinrel[1] which is initial_rels which is
> baserel.

Hmm?  Bushy plans are created by the second loop in join_search_one_level,
starting about line 150 in joinrels.c.

regards, tom lane




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-04, Robert Haas wrote:

> On Mon, Aug 3, 2020 at 7:49 PM Alvaro Herrera  
> wrote:
> > Why two transactions?  The reason is that in order for this to work, we
> > make a catalog change (mark it detached), and commit so that all
> > concurrent transactions can see the change.  A second transaction waits
> > for anybody who holds any lock on the partitioned table and grabs Access
> > Exclusive on the partition (which now no one cares about, if they're
> > looking at the partitioned table), where the DDL action on the partition
> > can be completed.
> 
> Is there a more detailed theory of operation of this patch somewhere?
> What exactly do you mean by marking it detached? Committing the change
> makes it possible for all concurrent transactions to see the change,
> but does not guarantee that they will. If you can't guarantee that,
> then I'm not sure how you can guarantee that they will behave sanely.

Sorry for the long delay.  I haven't written up the theory of operation.
I suppose it is complicated enough that it should be documented
somewhere.

To mark it detached means to set pg_inherits.inhdetached=true.  That
column name is a bit of a misnomer, since that column really means "is
in the process of being detached"; the pg_inherits row goes away
entirely once the detach process completes.  This mark guarantees that
everyone will see that row because the detaching session waits long
enough after committing the first transaction and before deleting the
pg_inherits row, until everyone else has moved on.

The other point is that the partition directory code can be asked to
include detached partitions, or not to.  The executor does the former,
and the planner does the latter.  This allows a transient period during
which the partition descriptor returned to planner and executor is
different; this makes the situation equivalent to what would have
happened if the partition was attached during the operation: in executor
we would detect that there is an additional partition that was not seen
by the planner, and we already know how to deal with that situation by
your handling of the ATTACH code.

> Even if you can, it's not clear what the sane behavior is: what
> happens when a tuple gets routed to an ex-partition? What happens when
> an ex-partition needs to be scanned? 

During the transient period, any transaction that obtained a partition
descriptor before the inhdetached mark is committed should be able to
get the tuple routing done successfully, but transactions using later
snapshots should get their insertions rejected.  Reads should behave in
the same way.

> What prevents problems if a partition is detached, possibly modified,
> and then reattached, possibly with different partition bounds?

This should not be a problem, because the partition needs to be fully
detached before it can be attached again.  And if the partition bounds
are different, that won't be a problem, because the previous partition
bounds won't be present in the pg_class row.  Of course, the new
partition bounds will be checked to the existing contents.

There is one fly in the ointment though, which is that if you cancel the
wait and then immediately do the ALTER TABLE DETACH FINALIZE without
waiting for as long as the original execution would have waited, you
might end up killing the partition ahead of time.  One solution to this
would be to cause the FINALIZE action to wait again at start.  This
would cause it to take even longer, but it would be safer.  (If you
don't want it to take longer, you can just not cancel it in the first
place.)  This is not a problem if the server crashes in between (which
is the scenario I had in mind when doing the FINALIZE thing), because of
course no transaction can continue to run across a crash.


I'm going to see if I can get the new delay_execution module to help
test this, to verify whether my claims are true.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Issue with past commit: Allow fractional input values for integer GUCs ...

2020-08-26 Thread Tom Lane
Robert Haas  writes:
> While the new behavior seems fine -- and indeed convenient -- for GUCs
> that are numeric with a unit, it does not seem very nice at all for
> GUCs that are unitless integers.

I find that distinction to be entirely without merit; not least because
we also have unitless float GUCs.  I think the fact that we have some
float and some integer GUCs is an implementation detail more than a
fundamental property --- especially since SQL considers integers
to be just the scale-zero subset of numerics.  I recognize that your
opinion is different, but to me it seems fine as-is.

regards, tom lane




How is bushy plans generated in join_search_one_lev

2020-08-26 Thread Andy Fan
Greeting.

I do see the README says we support bushy plans and I also see bushy
plans in real life (for example tpc-h Q20) like below. However I don't know
how it is generated with the algorithm in join_search_one_lev since it
always
make_rels_by_clause_join with joinrel[1] which is initial_rels which is
baserel.
Am I missing something?

===
 Sort
   Sort Key: supplier.s_name
   ->  Nested Loop Semi Join
 ->  Nested Loop
   Join Filter: (supplier.s_nationkey = nation.n_nationkey)
   ->  Index Scan using supplier_pkey on supplier
   ->  Materialize
 ->  Seq Scan on nation
   Filter: (n_name = 'KENYA'::bpchar)
 ->  Nested Loop
   ->  Index Scan using idx_partsupp_suppkey on partsupp
 Index Cond: (ps_suppkey = supplier.s_suppkey)
 Filter: ((ps_availqty)::numeric > (SubPlan 1))
 SubPlan 1
   ->  Aggregate
 ->  Index Scan using idx_lineitem_part_supp on
lineitem
   Index Cond: ((l_partkey =
partsupp.ps_partkey) AND (l_suppkey = partsupp.ps_suppkey))
   Filter: ((l_shipdate >= '01-JAN-97
00:00:00'::timestamp without time zone) AND (l_shipdate < '01-JAN-98
00:00:00'::timestamp without time zone))
   ->  Index Scan using part_pkey on part
 Index Cond: (p_partkey = partsupp.ps_partkey)
 Filter: ((p_name)::text ~~ 'lavender%'::text)
(21 rows)


-- 
Best Regards
Andy Fan


Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-08-26 Thread Bruce Momjian
On Wed, Aug 26, 2020 at 06:13:23PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 25 Aug 2020 22:52:44 -0400, Bruce Momjian  wrote in 
> > > Because we think we need any named value for every alternatives
> > > including the default value?
> > 
> > Well, not putting clientcert at all gives the default behavior, so why
> > have clientcert=no-verify?
> 
> clientcert=verify-ca or verify-full don't allow absence of client
> certificate. We need an option to allow the absence.

Isn't the option not specifying clientcert?  Here are some valid
pg_hba.conf lines:

hostsslall all 127.0.0.1/32 trust 
clientcert=verify-full
hostsslall all 127.0.0.1/32 trust 
clientcert=verify-ca
hostsslall all 127.0.0.1/32 trust 
clientcert=no-verify
hostsslall all 127.0.0.1/32 trust

It is my understanding that the last two lines are the same.  Why isn't
it sufficient to just tell users not to specify clientcert if they want
the default behavior?  You can do:

hostall all 192.168.0.0/16  ident 
map=omicron

but there is no way to specify the default map value of 'no map', so why
have one for clientcert?

> > Well, sslmode=prefer gives encryption without identification. 
> > clientcert=no-verify has no value because it is just an optional CA
> > check that has no value because optional authentication is useless.  It
> 
> The point of the option is not to do optional CA check if possible,
> but to allow absence of client cert. We need to have that mode
> regardless of named or not named, and I believe we usually provide a
> name for default mode.

Uh, see above --- not really.  The absense of the option is the default
action.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Issue with past commit: Allow fractional input values for integer GUCs ...

2020-08-26 Thread Robert Haas
On Wed, Aug 26, 2020 at 4:47 PM Tom Lane  wrote:
> regression=# create table itable (f1 int);
> CREATE TABLE
> regression=# insert into itable values (384.234);
> INSERT 0 1
> regression=# table itable;
>  f1
> -
>  384
> (1 row)
>
> It's always worked like that, and nobody's complained about it.
> I suspect, in fact, that one could find chapter and verse in the
> SQL spec that requires it, just like "numeric" values should get
> rounded if you write too many digits.

That is a bit different from what I had in mind, because it does not
involve a call to int4in(). Instead, it involves a cast. I was
imagining that you would put quotation marks around 384.234, which
does indeed fail:

ERROR:  invalid input syntax for type integer: "384.234"

So the question is whether users who supply values for integer-valued
reloptions, or integer-valued GUCs, expect that they will be parsed as
integers, or whether they expect that they will be parsed as float
values and the cast to integers.

While the new behavior seems fine -- and indeed convenient -- for GUCs
that are numeric with a unit, it does not seem very nice at all for
GUCs that are unitless integers. Why do you think that anyone would be
pleased to discover that they can set port = 543.2? We must decide
whether it is more likely that such a setting is the result of a user
error about which we should issue some complaint, or on the other hand
whether the user is hoping that we will be good enough to round the
value off so as to spare them the trouble. My own view is that the
former is vastly more probably than the latter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-08-26 Thread Ibrar Ahmed
On Thu, Aug 27, 2020 at 2:14 AM Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> On 21.08.2020 19:43, Ibrar Ahmed wrote:
>
>
>
> On Wed, Aug 19, 2020 at 6:15 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> On 18.08.2020 02:54, Alvaro Herrera wrote:
>> > On 2020-Aug-14, Ibrar Ahmed wrote:
>> >
>> >> The table used for the test contains three columns (integer, text,
>> >> varchar).
>> >> The total number of rows is 1000 in total.
>> >>
>> >> Unpatched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
>> >> COPY: 9069.432 ms vacuum; 2567.961ms
>> >> COPY: 9004.533 ms vacuum: 2553.075ms
>> >> COPY: 8832.422 ms vacuum: 2540.742ms
>> >>
>> >> Patched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
>> >> COPY: 10031.723 ms vacuum: 127.524 ms
>> >> COPY: 9985.109  ms vacuum: 39.953 ms
>> >> COPY: 9283.373  ms vacuum: 37.137 ms
>> >>
>> >> Time to take the copy slightly increased but the vacuum time
>> significantly
>> >> decrease.
>> > "Slightly"?  It seems quite a large performance drop to me -- more than
>> > 10%.  Where is that time being spent?  Andres said in [1] that he
>> > thought the performance shouldn't be affected noticeably, but this
>> > doesn't seem to hold true.  As I understand, the idea was that there
>> > would be little or no additional WAL records .. only flags in the
>> > existing record.  So what is happening?
>> >
>> > [1]
>> https://postgr.es/m/20190408010427.4l63qr7h2fjcy...@alap3.anarazel.de
>>
>> I agree that 10% performance drop is not what we expect with this patch.
>> Ibrar, can you share more info about your tests? I'd like to reproduce
>> this slowdown and fix it, if necessary.
>>
>>
> Here is my test;
>
>
> postgres=# BEGIN;
>
> BEGIN
>
>
> postgres=*# TRUNCATE foo;
>
> TRUNCATE TABLE
>
>
> postgres=*# COPY foo(id, name, address) FROM '/home/ibrar/bar.csv'
> DELIMITER ',' FREEZE;
>
> COPY 1000
>
>
>
> --
> Ibrar Ahmed
>
>
> I've repeated the test and didn't notice any slowdown for COPY FREEZE.
> Test data is here [1].
>
> The numbers do fluctuate a bit, but there is no dramatic difference
> between master and patched version. So I assume that the performance drop
> in your test has something to do with the measurement error. Unless, you
> have some non-default configuration that could affect it.
>
> patched:
>
> COPY: 12327,090 ms vacuum: 37,555 ms
> COPY: 12939,540 ms vacuum: 35,703 ms
> COPY: 12245,819 ms vacuum: 36,273 ms
>
> master:
> COPY
> COPY: 13253,605 ms vacuum: 3592,849 ms
> COPY: 12619,428 ms vacuum: 4253,836 ms
> COPY: 12512,940 ms vacuum: 4009,847 ms
>
> I also slightly cleaned up comments, so the new version of the patch is
> attached. As this is just a performance optimization documentation is not
> needed. It would be great, if other reviewers could run some independent
> performance tests, as I believe that this patch is ready for committer.
>
> [1] https://drive.google.com/file/d/11r19NX6yyPjvxdDub8Ce-kmApRurp4Nx/view
>
> --
> Anastasia Lubennikova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Companyt
>
> I gave another try with latest v3 patch on latest master branch
(ff60394a8c9a7af8b32de420ccb54a20a0f019c1) with all default settings.
11824.495 is median with master and 11884.089 is median value with patch.


Note: There are two changes such as (1) used the v3 patch (2) now test is
done on latest master (ff60394a8c9a7af8b32de420ccb54a20a0f019c1).


Master (ff60394a8c9a7af8b32de420ccb54a20a0f019c1)

postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 11824.495 ms (00:11.824)

postgres=*# COMMIT;


Restart


postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 14096.987 ms (00:14.097)

postgres=*# commit;


Restart


postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 11108.289 ms (00:11.108)

postgres=*# commit;



Patched (ff60394a8c9a7af8b32de420ccb54a20a0f019c1)

postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 10749.945 ms (00:10.750)

postgres=*# commit;


Restart


postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 14274.361 ms (00:14.274)

postgres=*# commit;


Restart


postgres=# \timing

postgres=# BEGIN;

postgres=*# TRUNCATE  foo;

postgres=*# COPY foo(id, name, address) FROM '/Users/ibrar/bar.csv'
DELIMITER ',' FREEZE;

Time: 11884.089 ms (00:11.884)

postgres=*# commit;


-- 
Ibrar Ahmed


Re: [PATCH] Covering SPGiST index

2020-08-26 Thread Anastasia Lubennikova

On 24.08.2020 16:19, Pavel Borisov wrote:


I added some extra comments and mentions in manual to make all the 
things clear (see v7 patch)


The patch implements the proposed functionality, passes tests, and in 
general looks good to me.
I don't mind using a macro to differentiate tuples with and without 
included attributes. Any approach will require code changes. Though, I 
don't have a strong opinion about that.


A bit of nitpicking:

1) You mention backward compatibility in some comments. But, after this 
patch will be committed, it will be uneasy to distinct new and old 
phrases.  So I suggest to rephrase them.  You can either refer a 
specific version or just call it "compatibility with indexes without 
included attributes".


2) SpgLeafSize() function name seems misleading, as it actually refers 
to a tuple's size, not a leaf page. I suggest to rename it to 
SpgLeafTupleSize().


3) I didn't quite get the meaning of the assertion, that is added in a 
few places:

 Assert(so->state.includeTupdesc->natts);
Should it be Assert(so->state.includeTupdesc->natts > 1) ?

4) There are a few typos in comments and docs:
s/colums/columns
s/include attribute/included attribute

and so on.

5) This comment in index_including.sql is outdated:
  * 7. Check various AMs. All but btree and gist must fail.

6) New test lacks SET enable_seqscan TO off;
in addition to SET enable_bitmapscan TO off;

I also wonder, why both index_including_spgist.sql and 
index_including.sql tests are stable without running 'vacuum analyze' 
before the EXPLAIN that shows Index Only Scan plan. Is autovacuum just 
always fast enough to fill a visibility map, or I miss something?



--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: CREATE RULE may generate duplicate entries in pg_depend

2020-08-26 Thread Tom Lane
Ashwin Agrawal  writes:
> If action and qual reference same object in CREATE RULE, it results in
> creating duplicate entries in pg_depend for it. Doesn't pose any harm, just
> unnecessarily bloats pg_depend.

Yeah, we generally don't try that hard to prevent duplicate pg_depend
entries.  It's relatively easy to get rid of them in limited contexts
like a single expression, but over a wider scope, I doubt it's worth
the trouble.

regards, tom lane




CREATE RULE may generate duplicate entries in pg_depend

2020-08-26 Thread Ashwin Agrawal
If action and qual reference same object in CREATE RULE, it results in
creating duplicate entries in pg_depend for it. Doesn't pose any harm, just
unnecessarily bloats pg_depend. Reference InsertRule(). I think should be
able to avoid adding duplicate entries.

Don't know if this behaviour was discussed earlier, I didn't find it on
search.
We accidentally encountered it while enhancing a catalog check tool for
Greenplum Database.

For example (from rules test):
create table rtest_t5 (a int4, b text);
create table rtest_t7 (a int4, b text);

create rule rtest_t5_ins as on insert to rtest_t5
where new.a > 15 do
insert into rtest_t7 values (new.a, new.b);

# select classid::regclass, refobjid::regclass,* from pg_depend where
refobjid='rtest_t5'::regclass and deptype = 'n';
  classid   | refobjid | classid | objid | objsubid | refclassid | refobjid
| refobjsubid | deptype
+--+-+---+--++--+-+-
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   1 | n
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   1 | n
 pg_rewrite | rtest_t5 |2618 | 16457 |0 |   1259 |16445
|   2 | n
(3 rows)


-- 
*Ashwin Agrawal (VMware)*


Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-08-26 Thread Anastasia Lubennikova

On 21.08.2020 19:43, Ibrar Ahmed wrote:



On Wed, Aug 19, 2020 at 6:15 PM Anastasia Lubennikova 
mailto:a.lubennik...@postgrespro.ru>> 
wrote:


On 18.08.2020 02:54, Alvaro Herrera wrote:
> On 2020-Aug-14, Ibrar Ahmed wrote:
>
>> The table used for the test contains three columns (integer, text,
>> varchar).
>> The total number of rows is 1000 in total.
>>
>> Unpatched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
>> COPY: 9069.432 ms vacuum; 2567.961ms
>> COPY: 9004.533 ms vacuum: 2553.075ms
>> COPY: 8832.422 ms vacuum: 2540.742ms
>>
>> Patched (Master: 92c12e46d5f1e25fc85608a6d6a19b8f5ea02600)
>> COPY: 10031.723 ms vacuum: 127.524 ms
>> COPY: 9985.109  ms vacuum: 39.953 ms
>> COPY: 9283.373  ms vacuum: 37.137 ms
>>
>> Time to take the copy slightly increased but the vacuum time
significantly
>> decrease.
> "Slightly"?  It seems quite a large performance drop to me --
more than
> 10%.  Where is that time being spent?  Andres said in [1] that he
> thought the performance shouldn't be affected noticeably, but this
> doesn't seem to hold true.  As I understand, the idea was that there
> would be little or no additional WAL records .. only flags in the
> existing record.  So what is happening?
>
> [1]
https://postgr.es/m/20190408010427.4l63qr7h2fjcy...@alap3.anarazel.de

I agree that 10% performance drop is not what we expect with this
patch.
Ibrar, can you share more info about your tests? I'd like to
reproduce
this slowdown and fix it, if necessary.


Here is my test;

postgres=# BEGIN;

BEGIN


postgres=*# TRUNCATE foo;

TRUNCATE TABLE


postgres=*# COPY foo(id, name, address) FROM '/home/ibrar/bar.csv' 
DELIMITER ',' FREEZE;


COPY 1000



--
Ibrar Ahmed



I've repeated the test and didn't notice any slowdown for COPY FREEZE.
Test data is here [1].

The numbers do fluctuate a bit, but there is no dramatic difference 
between master and patched version. So I assume that the performance 
drop in your test has something to do with the measurement error. 
Unless, you have some non-default configuration that could affect it.


patched:

COPY: 12327,090 ms vacuum: 37,555 ms
COPY: 12939,540 ms vacuum: 35,703 ms
COPY: 12245,819 ms vacuum: 36,273 ms

master:
COPY
COPY: 13253,605 ms vacuum: 3592,849 ms
COPY: 12619,428 ms vacuum: 4253,836 ms
COPY: 12512,940 ms vacuum: 4009,847 ms

I also slightly cleaned up comments, so the new version of the patch is 
attached. As this is just a performance optimization documentation is 
not needed. It would be great, if other reviewers could run some 
independent performance tests, as I believe that this patch is ready for 
committer.


[1] https://drive.google.com/file/d/11r19NX6yyPjvxdDub8Ce-kmApRurp4Nx/view

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companyt

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index ca4b6e186b..0017e3415c 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -179,6 +179,69 @@ select pg_truncate_visibility_map('test_partition');
  
 (1 row)
 
+-- test copy freeze
+create table copyfreeze (a int, b char(1500));
+-- load all rows via COPY FREEZE and ensure that all pages are set all-visible
+-- and all-frozen.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | t   | t
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- load half the rows via regular COPY and rest via COPY FREEZE. The pages
+-- which are touched by regular COPY must not be set all-visible/all-frozen. On
+-- the other hand, pages allocated by COPY FREEZE should be marked
+-- all-frozen/all-visible.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | f   | f
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- Try a mix of regular COPY and COPY FREEZE.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
 -- cleanup
 drop table test_partitioned;
 drop view 

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2020-08-26 Thread Bossart, Nathan
On 8/26/20, 12:16 PM, "Alvaro Herrera"  wrote:
> On 2020-Aug-20, Jeremy Schneider wrote:
>> Alternatively, if we don't want to take this approach, then I'd argue
>> that we should update README.tuplock to explicitly state that
>> XMAX_LOCK_ONLY and XMAX_COMMITTED are incompatible (just as it already
>> states for HEAP_XMAX_IS_MULTI and HEAP_XMAX_COMMITTED) and clean up the
>> code in heapam_visibility.c for consistency.
>
> Yeah, I like this approach better for the master branch; not just clean
> up as in remove the cases that handle it, but also actively elog(ERROR)
> if the condition ever occurs (hopefully with some known way to fix the
> problem; maybe by "WITH tup AS (DELETE FROM tab WHERE .. RETURNING *)
> INSERT * INTO tab FROM tup" or similar.)

+1.  I wouldn't mind picking this up, but it might be some time before
I can get to it.

Nathan



Re: Issue with past commit: Allow fractional input values for integer GUCs ...

2020-08-26 Thread Tom Lane
Robert Haas  writes:
> I don't think any of these cases should be allowed. Surely if we
> allowed 384.234 to be inserted into an integer column, everyone would
> say that we'd lost our minds.

regression=# create table itable (f1 int);
CREATE TABLE
regression=# insert into itable values (384.234);
INSERT 0 1
regression=# table itable;
 f1  
-
 384
(1 row)

It's always worked like that, and nobody's complained about it.
I suspect, in fact, that one could find chapter and verse in the
SQL spec that requires it, just like "numeric" values should get
rounded if you write too many digits.

regards, tom lane




Re: Issue with past commit: Allow fractional input values for integer GUCs ...

2020-08-26 Thread Robert Haas
On Mon, Aug 24, 2020 at 5:32 AM Greg Nancarrow  wrote:
> For example,
>
> log_file_mode = 384.234
> max_connections = 1.0067e2
> port = 5432.123
> CREATE TABLE ... WITH (fillfactor = 23.45);
> CREATE TABLE ... WITH (parallel_workers = 5.4);

I don't think any of these cases should be allowed. Surely if we
allowed 384.234 to be inserted into an integer column, everyone would
say that we'd lost our minds. These cases seem no different. The
discussion to which the commit links is mainly about allowing 0.2s to
work like 200ms, or something of that sort, when the value is
specified as a fraction but works out to an integer when converted to
the base unit. That is a completely different thing from letting
people configure 5.4 parallel workers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-20, Jeremy Schneider wrote:

> While working with Nathan Bossart on an extension, we came across an
> interesting quirk and possible inconsistency in the PostgreSQL code
> around infomask flags.  I'd like to know if there's something I'm
> misunderstanding here or if this really is a correctness/robustness gap
> in the code.
> 
> At the root of it is the relationship between the XMAX_LOCK_ONLY and
> XMAX_COMMITTED infomask bits.

I spent a lot of time wondering about XMAX_COMMITTED.  It seemed to me
that it would make no sense to have xacts that were lock-only yet have
XMAX_COMMITTED set; so I looked hard to make sure no place would ever
cause such a situation to arise.  However, I still made my best effort
to make the code cope with such a combination correctly if it ever
showed up.

And I may have missed assumptions such as this one, even if they seem
obvious in retrospect, such as in compute_new_xmax_infomask (which, as
you'll notice, changed considerably from what was initially committed):

> But then there's one place in heapam.c where an assumption appears that
> this state will never happen - the compute_new_xmax_infomask() function:
> 
> https://github.com/postgres/postgres/blob/9168793d7275b4b318c153d607fba55d14098c19/src/backend/access/heap/heapam.c#L4800
> 
> else if (old_infomask & HEAP_XMAX_COMMITTED)
> {
> ...
> if (old_infomask2 & HEAP_KEYS_UPDATED)
> status = MultiXactStatusUpdate;
> else
> status = MultiXactStatusNoKeyUpdate;
> new_status = get_mxact_status_for_lock(mode, is_update);
> ...
> new_xmax = MultiXactIdCreate(xmax, status, add_to_xmax, new_status);
> 
> When that code sees XMAX_COMMITTED, it assumes the xmax can't possibly
> be LOCK_ONLY and it sets the status to something sufficiently high
> enough to guarantee that ISUPDATE_from_mxstatus() returns true. That
> means that when you try to update this tuple and
> compute_new_xmax_infomask() is called with an "update" status, two
> "update" statuses are sent to MultiXactIdCreate() which then fails
> (thank goodness) with the error "new multixact has more than one
> updating member".
> 
> https://github.com/postgres/postgres/blob/cd5e82256de5895595cdd99ecb03aea15b346f71/src/backend/access/transam/multixact.c#L784

Have you ever observed this error case hit?  I've never seen a report of
that.

> The UpdateXmaxHintBits() code to always set the INVALID bit wasn't in
> any patches on the mailing list but it was committed and it seems to
> have worked very well. As a result it seems nearly impossible to get
> into the state where you have both XMAX_LOCK_ONLY and XMAX_COMMITTED
> bits set; thus it seems we've avoided problems in
> compute_new_xmax_infomask().

Phew.

(I guess I should fully expect that somebody, given sufficient time,
would carefully inspect the committed code against submitted patches ...
especially given that I do such inspections myself.)

> But nonetheless it seems worth making the code more robust by having the
> compute_new_xmax_infomask() function handle this state correctly just as
> the visibility code does. It should only require a simple patch like
> this (credit to Nathan Bossart):
> 
> diff --git a/src/backend/access/heap/heapam.c
> b/src/backend/access/heap/heapam.c
> index d881f4cd46..371e3e4f61 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -4695,7 +4695,9 @@ compute_new_xmax_infomask(TransactionId xmax,
> uint16 old_infomask,
>  l5:
> new_infomask = 0;
> new_infomask2 = 0;
> -   if (old_infomask & HEAP_XMAX_INVALID)
> +   if (old_infomask & HEAP_XMAX_INVALID ||
> +   (old_infomask & HEAP_XMAX_COMMITTED &&
> +HEAP_XMAX_IS_LOCKED_ONLY(old_infomask)))
> {
> /*
>  * No previous locker; we just insert our own TransactionId.

We could do this in stable branches, if there were any reports that
this inconsistency is happening in real world databases.

> Alternatively, if we don't want to take this approach, then I'd argue
> that we should update README.tuplock to explicitly state that
> XMAX_LOCK_ONLY and XMAX_COMMITTED are incompatible (just as it already
> states for HEAP_XMAX_IS_MULTI and HEAP_XMAX_COMMITTED) and clean up the
> code in heapam_visibility.c for consistency.

Yeah, I like this approach better for the master branch; not just clean
up as in remove the cases that handle it, but also actively elog(ERROR)
if the condition ever occurs (hopefully with some known way to fix the
problem; maybe by "WITH tup AS (DELETE FROM tab WHERE .. RETURNING *)
INSERT * INTO tab FROM tup" or similar.)

> Might be worth adding a note to README.tuplock either way about
> valid/invalid combinations of infomask flags. Might help avoid some
> confusion as people approach the code base.

Absolutely.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Suppor

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-25, Jakub Wartak wrote:

> Turning on/off the defer SLRU patch and/or fsync doesn't seem to make
> any difference, so if anyone is curious the next sets of append-only
> bottlenecks is like below:
> 
> 14.69%  postgres  postgres[.] hash_search_with_hash_value
> ---hash_search_with_hash_value
>|--9.80%--BufTableLookup
>|  ReadBuffer_common
>|  ReadBufferWithoutRelcache
>|  XLogReadBufferExtended
>|  XLogReadBufferForRedoExtended
>|  |--7.76%--btree_xlog_insert
>|  |  btree_redo
>|  |  StartupXLOG
>|   --1.63%--heap_xlog_insert
> --4.90%--smgropen
>   |--2.86%--ReadBufferWithoutRelcache

Looking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
sure now that that'll pay off as clearly as I had hoped.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-25, Andres Freund wrote:

> Hi,
> 
> On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
> > > --12.51%--compactify_tuples
> > >   PageRepairFragmentation
> > >   heap2_redo
> > >   StartupXLOG
> > 
> > I wonder if there is something higher level that could be done to
> > reduce the amount of compaction work required in the first place, but
> > in the meantime I'm very happy if we can improve the situation so much
> > with such a microscopic improvement that might eventually benefit
> > other sorting stuff...
> 
> Another approach could be to not perform any sorting during recovery,
> instead including enough information in the WAL record to avoid doing a
> full blown PageRepairFragmentation during recovery.

Hmm, including the sorted ItemId array in the WAL record might make
sense to alleviate the qsort work ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: factorial function/phase out postfix operators?

2020-08-26 Thread Robert Haas
On Wed, Aug 26, 2020 at 11:57 AM Mark Dilger
 wrote:
> I wonder if we can get more comments for or against this patch, at least in 
> principle, in the very near future, to help determine whether the deprecation 
> notices should go into v13?

Speaking of that, has somebody written a specific patch for that?
Like, exactly what are we proposing that this deprecation warning is
going to say?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-26 Thread Jeff Janes
On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila  wrote:


>  I am planning
> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> this series tomorrow unless you have any comments on the same.
>


I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c
line 288 needs to be:

boolfound PG_USED_FOR_ASSERTS_ONLY = false;

Cheers,

Jeff


Re: SyncRepLock acquired exclusively in default configuration

2020-08-26 Thread Fujii Masao



On 2020/08/20 11:29, Kyotaro Horiguchi wrote:

At Wed, 19 Aug 2020 13:20:46 +, Asim Praveen  wrote in




On 12-Aug-2020, at 12:02 PM, Masahiko Sawada  
wrote:

On Wed, 12 Aug 2020 at 14:06, Asim Praveen  wrote:





On 11-Aug-2020, at 8:57 PM, Robert Haas  wrote:

I think this gets to the root of the issue. If we check the flag
without a lock, we might see a slightly stale value. But, considering
that there's no particular amount of time within which configuration
changes are guaranteed to take effect, maybe that's OK. However, there
is one potential gotcha here: if the walsender declares the standby to
be synchronous, a user can see that, right? So maybe there's this
problem: a user sees that the standby is synchronous and expects a
transaction committing afterward to provoke a wait, but really it
doesn't. Now the user is unhappy, feeling that the system didn't
perform according to expectations.


Yes, pg_stat_replication reports a standby in sync as soon as walsender updates 
priority of the standby to something other than 0.

The potential gotcha referred above doesn’t seem too severe.  What is the 
likelihood of someone setting synchronous_standby_names GUC with either “*” or 
a standby name and then immediately promoting that standby?  If the standby is 
promoted before the checkpointer on master gets a chance to update 
sync_standbys_defined in shared memory, commits made during this interval on 
master may not make it to standby.  Upon promotion, those commits may be lost.


I think that if the standby is quite behind the primary and in case of
the primary crashes, the likelihood of losing commits might get
higher. The user can see the standby became synchronous standby via
pg_stat_replication but commit completes without a wait because the
checkpointer doesn't update sync_standbys_defined yet. If the primary
crashes before standby catching up and the user does failover, the
committed transaction will be lost, even though the user expects that
transaction commit has been replicated to the standby synchronously.
And this can happen even without the patch, right?



It is correct that the issue is orthogonal to the patch upthread and I’ve marked
the commitfest entry as ready-for-committer.


Yes, thanks for the review!



I find the name of SyncStandbysDefined macro is very confusing with
the struct member sync_standbys_defined, but that might be another
issue..

-* Fast exit if user has not requested sync replication.
+* Fast exit if user has not requested sync replication, or there are no
+* sync replication standby names defined.

This comment sounds like we just do that twice. The reason for the
check is to avoid wasteful exclusive locks on SyncRepLock, or to form
double-checked locking on the variable. I think we should explain that
here.


I added the following comments based on the suggestion by Sawada-san upthread. 
Thought?

+* Since this routine gets called every commit time, it's important to
+* exit quickly if sync replication is not requested. So we check
+* WalSndCtl->sync_standbys_define without the lock and exit
+* immediately if it's false. If it's true, we check it again later
+* while holding the lock, to avoid the race condition described
+* in SyncRepUpdateSyncStandbysDefined().


Attached is the updated version of the patch. I didn't change how to
fix the issue. But I changed the check for fast exit so that it's called
before setting the "mode", to avoid a few cycle.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index df1e341c76..c30ebaa0a6 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -158,18 +158,27 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 */
Assert(InterruptHoldoffCount > 0);
 
+   /*
+* Fast exit if user has not requested sync replication, or there are no
+* sync replication standby names defined.
+*
+* Since this routine gets called every commit time, it's important to
+* exit quickly if sync replication is not requested. So we check
+* WalSndCtl->sync_standbys_define without the lock and exit
+* immediately if it's false. If it's true, we check it again later
+* while holding the lock, to avoid the race condition described
+* in SyncRepUpdateSyncStandbysDefined().
+*/
+   if (!SyncRepRequested() ||
+   !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+   return;
+
/* Cap the level for anything other than commit to remote flush only. */
if (commit)
mode = SyncRepWaitMode;
else
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
 
-   /*
-* Fast exit if user has not requested sync r

Re: [POC]Enable tuple change partition caused by BEFORE TRIGGER

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-21, Ashutosh Bapat wrote:

> On Fri, Aug 21, 2020 at 1:28 PM movead...@highgo.ca  
> wrote:

> > In current BEFORE TRIGGER implementation, it reports an error once a
> > trigger result out of current partition, but I think it should check
> > it after finish all triggers call, and you can see the discussion in
> > [1][2]. In the attached patch I have changed this rule, I check the
> > partition constraint only once after all BEFORE TRIGGERS have been
> > called. If you do not agree with this way, I can change the
> > implementation.
> 
> I think this change may be good irrespective of the row movement change.

Yeah, it makes sense to delay the complaint about partition movement
until all triggers have been executed ... although that makes it harder
to report *which* trigger caused the problem.  (It seems pretty bad that
the error message that you're changing is not covered in regression
tests -- mea culpa.)

> > And another point is that when inserting to new partition caused by
> > BEFORE TRIGGER, then it will not trigger the BEFORE TRIGGER on a new
> > partition. I think it's the right way, what's more, I think the
> > UPDATE approach cause partition change should not trigger the BEFORE
> > TRIGGERS too, you can see discussed on [1].
> 
> That looks dubious to me.

Yeah ...

> Before row triggers may be used in several different ways, for
> auditing, for verification of inserted data or to change some other
> data based on this change and so on.

Admittedly, these things should be done by AFTER triggers, not BEFORE
triggers, precisely because you want to do them with the final form of
each row -- not a form of the row that could still be changed by some
hypothetical BEFORE trigger that will fire next.

What this is saying to me is that we'd need to make sure to run the
final target partition's AFTER triggers, not the original target
partition.  But I'm not 100% about running the BEFORE triggers.  Maybe
one way to address this is to check whether the BEFORE triggers in the
new target partition are clones; if so then they would have run in the
original target partition and so must not be run, but otherwise they
have to.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: LWLockAcquire and LockBuffer mode argument

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-26, Robert Haas wrote:

> On Wed, Aug 26, 2020 at 7:47 AM Dagfinn Ilmari Mannsåker
>  wrote:
> > Would it be possible to make the compat versions only available when
> > building extensions, but not to core code?
> 
> I think that would be good if we can do it. We could even have it
> inside #ifdef LWLOCK_API_COMPAT, and extension authors who want the
> compatibility interface can define that.

We had ENABLE_LIST_COMPAT available for 16 years; see commits
d0b4399d81f3, 72b6ad631338, 1cff1b95ab6d.  We could do the same here.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Compiler warning

2020-08-26 Thread Tom Lane
Bruce Momjian  writes:
> I see a compiler warning on git master:
>sharedfileset.c:288:8: warning: variable ‘found’ set but not used 
> [-Wunused-but-set-variable]

Could get rid of the variable entirely: change the "break" to "return"
and then the final Assert can be "Assert(false)".

regards, tom lane




Compiler warning

2020-08-26 Thread Bruce Momjian
I see a compiler warning on git master:

   sharedfileset.c:288:8: warning: variable ‘found’ set but not used 
[-Wunused-but-set-variable]

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Robert Haas
On Wed, Aug 26, 2020 at 7:36 AM Ashutosh Sharma  wrote:
> Removed this note from the documentation and added a note saying: "The
> user needs to ensure that they do not operate pg_force_freeze function
> on a deleted tuple because it may revive the deleted tuple."

I do not agree with that note, either. I believe that trying to tell
people what things specifically they should do or avoid doing with the
tool is the wrong approach. Instead, the thrust of the message should
be to tell people that if you use this, it may corrupt your database,
and that's your problem. The difficulty with telling people what
specifically they ought to avoid doing is that experts will be annoyed
to be told that something is not safe when they know that it is fine,
and non-experts will think that some uses are safer than they really
are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: LWLockAcquire and LockBuffer mode argument

2020-08-26 Thread Robert Haas
On Wed, Aug 26, 2020 at 7:47 AM Dagfinn Ilmari Mannsåker
 wrote:
> Would it be possible to make the compat versions only available when
> building extensions, but not to core code?

I think that would be good if we can do it. We could even have it
inside #ifdef LWLOCK_API_COMPAT, and extension authors who want the
compatibility interface can define that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: poc - possibility to write window function in PL languages

2020-08-26 Thread Pavel Stehule
Hi

I simplified access to results of  winfuncargs functions by proxy type
"typedvalue". This type can hold any Datum value, and allows fast cast to
basic buildin types or it can use (slower) generic cast functions. It is
used in cooperation with a plpgsql assign statement that can choose the
correct cast implicitly. When the winfuncarg function returns a value of
the same type, that is expected by the variable on the left side of the
assign statement, then (for basic types), the value is just copied without
casts. With this proxy type is not necessary to have special statement for
assigning returned value from winfuncargs functions, so source code of
window function in plpgsql looks intuitive to me.

Example - implementation of "lag" function in plpgsql

create or replace function pl_lag(numeric)
returns numeric as $$
declare v numeric;
begin
  v := get_input_value_in_partition(windowobject, 1, -1, 'seek_current',
false);
  return v;
end;
$$ language plpgsql window;

I think this code is usable, and I assign this patch to commitfest.

Regards

Pavel
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ba5a23ac25..fdc364c05e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1418,6 +1418,18 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION
+  get_partition_context_value(windowobjectproxy, anyelement, int4 DEFAULT NULL)
+  RETURNS anyelement
+LANGUAGE internal
+AS 'windowobject_get_partition_context_value';
+
+CREATE OR REPLACE FUNCTION
+  set_partition_context_value(windowobjectproxy, anyelement, int4 DEFAULT NULL)
+  RETURNS void
+LANGUAGE internal
+AS 'windowobject_set_partition_context_value';
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 5d2aca8cfe..1974cfd7e6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -101,6 +101,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	typedvalue.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/pseudotypes.c b/src/backend/utils/adt/pseudotypes.c
index 3d6b2f9093..ebb2a16572 100644
--- a/src/backend/utils/adt/pseudotypes.c
+++ b/src/backend/utils/adt/pseudotypes.c
@@ -334,6 +334,17 @@ pg_node_tree_send(PG_FUNCTION_ARGS)
 PSEUDOTYPE_DUMMY_IO_FUNCS(pg_ddl_command);
 PSEUDOTYPE_DUMMY_BINARY_IO_FUNCS(pg_ddl_command);
 
+/*
+ * windowobjectproxy
+ *
+ * This type is pointer to WindowObjectProxyData. It is communication
+ * mechanism between PL environment and WinFuncArgs functions. Due
+ * performance reason I prefer using indirect result processing against
+ * using function returning polymorphic composite value. The indirect
+ * mechanism is implemented with proxy object represented by type
+ * WindowObjectProxyData.
+ */
+PSEUDOTYPE_DUMMY_IO_FUNCS(windowobjectproxy);
 
 /*
  * Dummy I/O functions for various other pseudotypes.
diff --git a/src/backend/utils/adt/typedvalue.c b/src/backend/utils/adt/typedvalue.c
new file mode 100644
index 00..d2d5dc34c0
--- /dev/null
+++ b/src/backend/utils/adt/typedvalue.c
@@ -0,0 +1,630 @@
+/*-
+ *
+ * typedvalue.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/typedvalue.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "utils/builtins.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/numeric.h"
+#include "utils/typedvalue.h"
+#include "fmgr.h"
+
+
+static Datum
+TypedValueGetDatum(TypedValue tv)
+{
+	if (tv->typbyval)
+		return *((Datum *) tv->data);
+	else
+		return PointerGetDatum(tv->data);
+}
+
+Datum
+typedvalue_in(PG_FUNCTION_ARGS)
+{
+	char	   *str =  PG_GETARG_CSTRING(0);
+	int			len;
+	Size		size;
+	TypedValue	tv;
+	text	   *txt;
+
+	len = strlen(str);
+
+	size = MAXALIGN(offsetof(TypedValueData, data) + len + VARHDRSZ);
+
+	tv = (TypedValue) palloc(size);
+	SET_VARSIZE(tv, size);
+
+	txt = (text *) tv->data;
+
+	SET_VARSIZE(txt, VARHDRSZ + len);
+	memcpy(VARDATA(txt), str, len);
+
+	tv->typid = TEXTOID;
+	tv->typbyval = false;
+	tv->typlen = -1;
+
+	PG_RETURN_POINTER(tv);
+}
+
+Datum
+typedvalue_out(PG_FUNCTION_ARGS)
+{
+	Oid			typOutput;
+	bool		isVarlena;
+	char	   *str;
+	TypedValue	tv;
+	Datum		value;
+
+	tv = (TypedValue) PG_GETARG_POINTER(0);
+
+	getTypeOutputInfo(tv->typid, &typOutput, &isVarlena);
+
+	value = TypedValueGetDatum(tv);
+
+	str = OidOutputFunctionCall(typOutput, value);
+
+	PG_RETURN_CSTRING(str);
+}
+
+Datum
+makeTypedValue(Datum value, Oid typid, int16 typlen, bool typbyval)
+{
+	TypedValue

Re: renaming configure.in to configure.ac

2020-08-26 Thread Peter Eisentraut

On 2020-08-25 18:44, Tom Lane wrote:

Peter Eisentraut  writes:

On 2020-07-24 15:23, Tom Lane wrote:

Sounds good.  Do we want to try Noah's idea of temporarily committing
the remaining changes, to see if the buildfarm is happy?



I think to get value out of this you'd have to compare the config.status
output files (mainly pg_config.h and Makefile.global) before and after.
Otherwise you're just testing that the shell can parse the script.
Perhaps some manual tests on, say, AIX and HP-UX using the native shell
would be of some value.


I already did that on assorted boxes, using the patches you previously
posted [1].  Do you think there's value in re-doing it manually,
rather than just having at it with the buildfarm?


I think right now we don't need any further organized testing.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-26 Thread Sait Talha Nisanci
I have run some benchmarks for this patch. Overall it seems that there is a 
good improvement with the patch on recovery times:

The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 
3000(so it doesn’t fit to memory, ~45GB).

In order to avoid checkpoints during benchmark, max_wal_size(200GB) and 
checkpoint_timeout(200 mins) are set to a high value. 

The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The 
recovery times are measured from the REDO logs.

I have tried combination of SSD, HDD, full_page_writes = on/off and 
max_io_concurrency = 10/50, the recovery times are as follows (in seconds):

   No prefetch  | Default prefetch 
values  |  Default + max_io_concurrency = 50
SSD, full_page_writes = on  852 301 
197
SSD, full_page_writes = off 16421359
1391
HDD, full_page_writes = on  60276345
6390
HDD, full_page_writes = off 738 275 
192

Default prefetch values:
-   Max_recovery_prefetch_distance = 256KB
-   Max_io_concurrency = 10

It probably makes sense to compare each row separately as the size of WAL can 
be different.

Talha.

-Original Message-
From: Thomas Munro  
Sent: Thursday, August 13, 2020 9:57 AM
To: Tomas Vondra 
Cc: Stephen Frost ; Dmitry Dolgov <9erthali...@gmail.com>; 
David Steele ; Andres Freund ; Alvaro 
Herrera ; pgsql-hackers 
Subject: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra  
wrote:
> On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:
> >On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
> >> Any luck trying to reproduce thigs? Should I try again and collect 
> >> some additional debug info?
> >
> >No luck.  I'm working on it now, and also trying to reduce the 
> >overheads so that we're not doing extra work when it doesn't help.
>
> OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no functional 
changes.  Soon I'll post a new improved version, but this version just fixes 
the bitrot and hopefully turns cfbot green.


Re: Asymmetric partition-wise JOIN

2020-08-26 Thread Amul Sul
On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai  wrote:
>
> 2019年8月24日(土) 7:02 Thomas Munro :
> >
> > On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai  wrote:
> > > We can consider the table join ptable X t1 above is equivalent to:
> > >   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> > > = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> > > It returns an equivalent result, however, rows are already reduced by 
> > > HashJoin
> > > in the individual leaf of Append, so CPU-cycles consumed by Append node 
> > > can
> > > be cheaper.
> > >
> > > On the other hands, it has a downside because t1 must be read 3 times and
> > > hash table also must be built 3 times. It increases the expected cost,
> > > so planner
> > > may not choose the asymmetric partition-wise join plan.
> >
> > What if you include the partition constraint as a filter on t1?  So you get:
> >
> > ptable X t1 =
> >   (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
> >   (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
> >   (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
> >   (ptable_p3 X (σ hash(dist)%4=3 (t1)))
> >
> > Pros:
> > 1.  The hash tables will not contain unnecessary junk.
> > 2.  You'll get the right answer if t1 is on the outer side of an outer join.
> > 3.  If this runs underneath a Parallel Append and t1 is big enough
> > then workers will hopefully cooperate and do a synchronised scan of
> > t1.
> > 4.  The filter might enable a selective and efficient plan like an index 
> > scan.
> >
> > Cons:
> > 1.  The filter might not enable a selective and efficient plan, and
> > therefore cause extra work.
> >
> > (It's a little weird in this example because don't usually see hash
> > functions in WHERE clauses, but that could just as easily be dist
> > BETWEEN 1 AND 99 or any other partition constraint.)
> >
> It requires the join-key must include the partition key and also must be
> equality-join, doesn't it?
> If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
> t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
> the partition bounds, indeed.
>
> In case when some of partition leafs are pruned, it is exactly beneficial
> because relevant rows to be referenced by the pruned child relations
> are waste of memory.
>
> On the other hands, it eventually consumes almost equivalent amount
> of memory to load the inner relations, if no leafs are pruned, and if we
> could extend the Hash-node to share the hash-table with sibling join-nodess.
>
> > > One idea I have is, sibling HashJoin shares a hash table that was built 
> > > once
> > > by any of the sibling Hash plan. Right now, it is not implemented yet.
> >
> > Yeah, I've thought a little bit about that in the context of Parallel
> > Repartition.  I'm interested in combining intra-node partitioning
> > (where a single plan node repartitions data among workers on the fly)
> > with inter-node partitioning (like PWJ, where partitions are handled
> > by different parts of the plan, considered at planning time); you
> > finish up needing to have nodes in the plan that 'receive' tuples for
> > each partition, to match up with the PWJ plan structure.  That's not
> > entirely unlike CTE references, and not entirely unlike your idea of
> > somehow sharing the same hash table.  I ran into a number of problems
> > while thinking about that, which I should write about in another
> > thread.
> >
> Hmm. Do you intend the inner-path may have different behavior according
> to the partition bounds definition where the outer-path to be joined?
> Let me investigate its pros & cons.
>
> The reasons why I think the idea of sharing the same hash table is reasonable
> in this scenario are:
> 1. We can easily extend the idea for parallel optimization. A hash table on 
> DSM
> segment, once built, can be shared by all the siblings in all the
> parallel workers.
> 2. We can save the memory consumption regardless of the join-keys and
> partition-keys, even if these are not involved in the query.
>
> On the other hands, below are the downside. Potentially, combined use of
> your idea may help these cases:
> 3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
> 4. Hash table contains rows to be referenced by only pruned partition leafs.
>

+ many, for the sharable hash of the inner table of the join. IMHO,
this could be the most interesting and captivating thing about this feature.
But might be a complicated piece, is that still on the plan?

Regards,
Amul




Re: factorial function/phase out postfix operators?

2020-08-26 Thread John Naylor
On Wed, Aug 26, 2020 at 6:12 AM Mark Dilger
 wrote:
>
> The construction colname AS colalias brings to mind the words "pseudonym" and 
> "alias".  The distinction we're trying to draw here is between implicit 
> pseudoyms and explicit ones, but "alias" is shorter and simpler, so I like 
> that better than "pseudonym".  Both are labels, so adding "label" to the name 
> doesn't really get us anything.  The constructions "implicit alias" vs. 
> "explicit alias" seem to me to be an improvement, along with their other 
> forms like "ImplicitAlias", or "implicit_alias", etc., so I've used those in 
> version 4.

> The word "status" here really means something like "plicity" (implict vs. 
> explicit), but "plicity" isn't a word, so I used "aliastype" instead.

Seems fine.

> A list of user defined postfix operators is in the file:
> postfix_ops.txt
>
> Failure, exiting
>
>
> With the contents of postfix_ops.txt:
>
> In database: regression
>   (oid=27113) public.#@# (pg_catalog.int8)
>   (oid=27114) public.#%# (pg_catalog.int8)
>
> which should be enough for a user to identify which operator is meant.  I 
> just invented that format.  Let me know if there is a preferred way to lay 
> out that information.

Not sure if there's a precedent here, and seems fine to me.

+ /*
+ * If neither argument is specified, do not mention postfix operators, as
+ * the user is unlikely to have meant to create one.  It is more likely
+ * they simply neglected to mention the args.
+ */
  if (!OidIsValid(typeId1) && !OidIsValid(typeId2))
  ereport(ERROR,
  (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
- errmsg("at least one of leftarg or rightarg must be specified")));
+ errmsg("operator arguments must be specified")));
+
+ /*
+ * But if only the right arg is missing, they probably do intend to create
+ * a postfix operator, so give them a hint about why that does not work.
+ */
+ if (!OidIsValid(typeId2))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
+ errmsg("operator right argument must be specified"),
+ errhint("Postfix operators are not supported.")));

This is just a nitpick -- I think the comments in this section would
flow better if order of checks were reversed, although the code might
not. I don't feel too strongly about it.

- * between POSTFIXOP and Op.  We can safely assign the same priority to
- * various unreserved keywords as needed to resolve ambiguities (this can't
- * have any bad effects since obviously the keywords will still behave the
- * same as if they weren't keywords).  We need to do this:
+ * greater than Op.  We can safely assign the same priority to various
+ * unreserved keywords as needed to resolve ambiguities (this can't have any
+ * bad effects since obviously the keywords will still behave the same as if
+ * they weren't keywords).  We need to do this:

I believe it's actually "lower than Op", and since POSTFIXOP is gone
it doesn't seem to matter how low it is. In fact, I found that the
lines with INDENT and UNBOUNDED now work as the lowest precedence
declarations. Maybe that's worth something?

Following on Peter E.'s example upthread, GENERATED can be removed
from precedence, and I also found the same is true for PRESERVE and
STRIP_P.

I've attached a patch which applies on top of 0001 to demonstrate
this. There might possibly still be syntax errors for things not
covered in the regression test, but there are no s/r conflicts at
least.

-{ oid => '389', descr => 'deprecated, use ! instead',
+{ oid => '389', descr => 'factorial',

Hmm, no objection, but it could be argued that we should just go ahead
and remove "!!" also, keeping only "factorial()". If we're going to
break a small amount of code using the normal math expression, it
seems silly to use a non-standard one that we deprecated before 2011
(cf. 908ab802864). On the other hand, removing it doesn't buy us
anything.

Some leftovers...

...in catalog/namespace.c:

OpernameGetOprid()
 * Pass oprleft = InvalidOid for a prefix op, oprright = InvalidOid for
 * a postfix op.

OpernameGetCandidates()
 * The returned items always have two args[] entries --- one or the other
 * will be InvalidOid for a prefix or postfix oprkind.  nargs is 2, too.

...in nodes/print.c:

/* we print prefix and postfix ops the same... */


> > 0002:
> >
> > + * All keywords can be used explicitly as a column label in expressions
> > + * like 'SELECT 1234 AS keyword', but only some keywords can be used
> > + * implicitly as column labels in expressions like 'SELECT 1234 keyword'.
> > + * Those that can be used implicitly should be listed here.
> >
> > In my mind, "AS" is the thing that's implied when not present, so we
> > should reword this to use the "bare" designation when talking about
> > the labels. I think there are contexts elsewhere where the implicit
> > column label is "col1, col2, col3...". I can't remember offhand where
> > that is though.
>
> Per my rambling above, I think what's really implied or explicit when "AS" is 
> missing 

Re: some unused parameters cleanup

2020-08-26 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-08-25 18:59, Tom Lane wrote:
>> For some of these, there's an argument for keeping the unused parameter
>> for consistency with sibling functions that do use it.  Not sure how
>> important that is, though.

> I had meant to exclude cases like this from this patch set.  If you see 
> a case like this in *this* patch set, please point it out.

I'd been thinking specifically of the changes in pg_backup_archiver.c.
But now that I look around a bit further, there's already very little
consistency in that file about whether to pass the ArchiveHandle* pointer
everywhere.  So no further objection here.

regards, tom lane




Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-08-26 Thread Daniel Gustafsson
> On 26 Aug 2020, at 09:56, Michael Paquier  wrote:
> On Tue, Aug 25, 2020 at 03:52:14PM +0200, Daniel Gustafsson wrote:

>> The attached moves all invocations under the correct guards.  RAND_poll() in
>> fork_process.c needs to happen for both OpenSSL and OpenSSL random, thus the
>> check for both.
> 
> Yeah, it could be possible that somebody still calls RAND_bytes() or
> similar without going through pg_strong_random(), so we still need to
> use USE_OPENSSL after forking.  Per this argument, I am not sure I see
> the point of the change in fork_process.c as it seems to me that 
> USE_OPENSSL_RANDOM should only be tied to pg_strong_random.c, and
> you'd still get a compilation failure if trying to use
> USE_OPENSSL_RANDOM without --with-openssl.

That's certainly true.  The intention though is to make the code easier to
follow (more explicit/discoverable) for anyone trying to implement support for
TLS backends. It's a very git grep intense process already as it is.

cheers ./daniel



Re: Strange behavior with polygon and NaN

2020-08-26 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Tue, 25 Aug 2020 19:03:50 -0400, Bruce Momjian  wrote in 
>> I can confirm that this two-month old email report still produces
>> different results with indexes on/off in git master, which I don't think
>> is ever correct behavior.

> I agree to that the behavior is broken. 

Yeah, but ... what is "non broken" in this case?  I'm not convinced
that having point_inside() return zero for any case involving NaN
is going to lead to noticeably saner behavior than today.

regards, tom lane




Re: More tests with USING INDEX replident and dropped indexes

2020-08-26 Thread Michael Paquier
On Tue, Aug 25, 2020 at 08:59:37PM +0530, Rahila wrote:
> Sorry, I didn't apply correctly.  The tests pass for me. In addition, I
> tested with partitioned tables.  It works as expected and makes the REPLICA
> IDENTITY 'n' for the partitions as well when an index on a partitioned
> table is dropped.

Indeed.  I have added a test for that.

While looking at this patch again, I noticed that the new tests for
contrib/test_decoding/ and the improvements for relreplident are
really two different bullet points, and that the new tests should not
be impacted as long as we switch to NOTHING the replica identity once
its index is dropped.  So, for now, I have applied the new decoding
tests with a first commit, and attached is an updated patch which
includes tests in the main regression test suite for
replica_identity.sql, which is more consistent with the rest as that's
the main place where we look at the state of pg_class.relreplident.
I'll go through this part again tomorrow, it is late here.
--
Michael
From 277d9b53e2be9098e72167b5de2a52c40e6e8542 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 26 Aug 2020 20:59:34 +0900
Subject: [PATCH v3] Reset pg_class.relreplident when associated replica index
 is dropped

When an index associated to REPLICA IDENTITY USING INDEX gets dropped,
relreplident gets set to 'i', which can be confusing as the catalogs
have a state inconsistent with pg_index that lacks an associated entry
marked with indisreplident.  This changes the logic so as such an index
dropped leads to 'n' to be set for the parent table, equivalent to
REPLICA IDENTITY NOTHING.

Author: Michael Paquier
Reviewed-by: Masahiko Sawada, Rahila Syed, Euler Taveira
Discussion: https://postgr.es/m/20200522035028.go2...@paquier.xyz
---
 src/include/commands/tablecmds.h  |  2 +
 src/backend/catalog/index.c   | 12 
 src/backend/commands/tablecmds.c  | 58 ---
 .../regress/expected/replica_identity.out | 30 ++
 src/test/regress/sql/replica_identity.sql | 22 +++
 doc/src/sgml/catalogs.sgml|  3 +-
 6 files changed, 104 insertions(+), 23 deletions(-)

diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index c1581ad178..b163a82c52 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -61,6 +61,8 @@ extern void ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_
 
 extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass);
 
+extern void SetRelationReplIdent(Oid relationId, char ri_type);
+
 extern ObjectAddress renameatt(RenameStmt *stmt);
 
 extern ObjectAddress RenameConstraint(RenameStmt *stmt);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..d344a070ce 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2195,6 +2195,18 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 	if (userIndexRelation->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
 		RelationDropStorage(userIndexRelation);
 
+	/*
+	 * If the index to be dropped is marked as indisreplident, reset
+	 * pg_class.relreplident for the parent table.
+	 */
+	if (userIndexRelation->rd_index->indisreplident)
+	{
+		SetRelationReplIdent(heapId, REPLICA_IDENTITY_NOTHING);
+
+		/* make the update visible */
+		CommandCounterIncrement();
+	}
+
 	/*
 	 * Close and flush the index's relcache entry, to ensure relcache doesn't
 	 * try to rebuild it while we're deleting catalog entries. We keep the
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d2b15a3387..968cb41f98 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3020,6 +3020,41 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
 	table_close(relationRelation, RowExclusiveLock);
 }
 
+/*
+ * SetRelationReplIdent
+ *		Set the value of the relation's relreplident field in pg_class.
+ *
+ * NOTE: caller must be holding an appropriate lock on the relation.
+ * ShareUpdateExclusiveLock is sufficient.
+ */
+void
+SetRelationReplIdent(Oid relationId, char ri_type)
+{
+	Relation	relationRelation;
+	HeapTuple	tuple;
+	Form_pg_class classtuple;
+
+	/*
+	 * Fetch a modifiable copy of the tuple, modify it, update pg_class.
+	 */
+	relationRelation = table_open(RelationRelationId, RowExclusiveLock);
+	tuple = SearchSysCacheCopy1(RELOID,
+ObjectIdGetDatum(relationId));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache lookup failed for relation \"%s\"",
+			 get_rel_name(relationId));
+	classtuple = (Form_pg_class) GETSTRUCT(tuple);
+
+	if (classtuple->relreplident != ri_type)
+	{
+		classtuple->relreplident = ri_type;
+		CatalogTupleUpdate(relationRelation, &tuple->t_self, tuple);
+	}
+
+	heap_freetuple(tuple);
+	table_close(relationRelation, RowExclusiveLock);
+}
+
 /*
  *		renameatt_check			- basic sanity checks before attribute rename
  */
@@ -14487,3

Re: LWLockAcquire and LockBuffer mode argument

2020-08-26 Thread Dagfinn Ilmari Mannsåker
Andres Freund  writes:

> Hi,
>
> On 2020-08-25 13:59:35 -0400, Robert Haas wrote:
>
>> However, if we do it all in a backward-compatible way as you propose,
>> then we're likely to keep reintroducing code that does it the old way
>> for a really long time. I'm not sure that actually makes a lot of
>> sense. It might be better to just bite the bullet and make a hard
>> break.
>
> It seems easy enough to slap a compiler "enforced" deprecation warning
> on the new compat version, in master only. Seems unnecessary to make
> life immediately harder for extensions authors desiring cross-version
> compatibility.

Would it be possible to make the compat versions only available when
building extensions, but not to core code?

In Perl we do that a lot, using #ifndef PERL_CORE.

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-26 Thread Ashutosh Sharma
Thanks for the review. Please find my comments inline below:

On Tue, Aug 25, 2020 at 5:47 PM Masahiko Sawada
 wrote:
>
> Let me share other comments on the latest version patch:
>
> Some words need to be tagged. For instance, I found the following words:
>
> VACUUM
> DISABLE_PAGE_SKIPPING
> HEAP_XMIN_FROZEN
> HEAP_XMAX_INVALID
>

Okay, done.

> ---
> +test=# select ctid from t1 where xmin = 507;
> + ctid
> +---
> + (0,3)
> +(1 row)
> +
> +test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
> +-[ RECORD 1 ]-+-
> +heap_force_freeze |
>
> I think it's better to use a consistent output format. The former uses
> the normal format whereas the latter uses the expanded format.
>

Yep, makes sense, done.

> ---
> + 
> + 
> +  While performing surgery on a damaged relation, we must not be doing 
> anything
> +  else on that relation in parallel. This is to ensure that when we are
> +  operating on a damaged tuple there is no other transaction trying to modify
> +  that tuple.
> + 
> + 
>
> If we prefer to avoid concurrent operations on the target relation why
> don't we use AccessExclusiveLock?
>

Removed this note from the documentation and added a note saying: "The
user needs to ensure that they do not operate pg_force_freeze function
on a deleted tuple because it may revive the deleted tuple."

> ---
> +CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_kill'
> +LANGUAGE C STRICT;
>
> +CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'heap_force_freeze'
> +LANGUAGE C STRICT;
>
> I think these functions should be PARALLEL UNSAFE.
>

By default the functions are marked PARALLEL UNSAFE, so I think there
is nothing to do here.

Attached patch with above changes.

This patch also takes care of all the other review comments from - [1].

[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k6%2BJWq2MfQt5b7fSJ2wMvCes9TRfbDhVO_fQP9B8JJRAA%40mail.gmail.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From a8067f03cf5ea900790a96ea0059ae080a164ed6 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Wed, 26 Aug 2020 15:57:59 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile   |   1 +
 contrib/pg_surgery/Makefile|  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 
 contrib/pg_surgery/heap_surgery.c  | 404 +
 contrib/pg_surgery/pg_surgery--1.0.sql |  18 ++
 contrib/pg_surgery/pg_surgery.control  |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql  |  89 +++
 doc/src/sgml/contrib.sgml  |   1 +
 doc/src/sgml/filelist.sgml |   1 +
 doc/src/sgml/pgsurgery.sgml| 144 ++
 10 files changed, 847 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0

Re: use pg_get_functiondef() in pg_dump

2020-08-26 Thread Peter Eisentraut

On 2020-08-15 16:23, Tom Lane wrote:

I wouldn't say that it's*fundamentally*  new, but nonethless it disturbs
me that this proposal has pg_dump assembling CREATE FUNCTION commands in
very different ways depending on the server version.  I'd rather see us
continuing to build the bulk of the command the same as before, and
introduce new behavior only for deparsing the function body.


OK, I'll work on something like that.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: use pg_get_functiondef() in pg_dump

2020-08-26 Thread Peter Eisentraut

On 2020-08-15 16:36, Tom Lane wrote:

I wrote:

I wouldn't say that it's *fundamentally* new, but nonethless it disturbs
me that this proposal has pg_dump assembling CREATE FUNCTION commands in
very different ways depending on the server version.  I'd rather see us
continuing to build the bulk of the command the same as before, and
introduce new behavior only for deparsing the function body.


BTW, a concrete argument for doing it that way is that if you make a
backend function that does the whole CREATE-FUNCTION-building job in
exactly the way pg_dump wants it, that function is nigh useless for
any other client with slightly different requirements.  A trivial
example here is that I don't think we want to become locked into
the proposition that psql's \ef and \sf must print functions exactly
the same way that pg_dump would.


That's why the patch adds optional arguments to the function to choose 
the behavior that is appropriate for the situation.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Implement UNLOGGED clause for COPY FROM

2020-08-26 Thread Amit Kapila
On Wed, Aug 26, 2020 at 12:54 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
>
> Following this idea, what do you think about adding a new value "none" to 
> wal_level, where no WAL is generated?  The setting of wal_level is recorded 
> in pg_control.  The startup process can see the value and reject recovery 
> after abnormal shutdown, emitting a message similar to MySQL's.
>

So you want your users to shutdown and restart the server before Copy
because that would be required if you want to change the wal_level.
However, even if we do that, users who are running the server
previously with wal_level as 'replica' won't be happy after doing this
change. Because if they change the wal_level to 'none' for certain
operations like bulk load and then again change back the mode to
'replica' they need to back up the database again to setup 'replica'
as they can't continue replication from the previous point (consider
update on a page for which previously WAL was not written).

-- 
With Regards,
Amit Kapila.




Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-08-26 Thread Kyotaro Horiguchi
At Tue, 25 Aug 2020 22:52:44 -0400, Bruce Momjian  wrote in 
> On Wed, Aug 26, 2020 at 11:41:39AM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 25 Aug 2020 10:04:44 -0400, Bruce Momjian  wrote 
> > in 
> > > I think there are a few problems here.  In the docs, it says "will still
> > > verify", but it doesn't say if it verifies the CA, or the CA _and_ the
> > > CN/username.
> > 
> > It verifies only CA.
> 
> OK, that will need to be clarified.
> 
> > > Second, since it is optional, what value does it have?
> > > 
> > > > > Why is this useful?
> > > > 
> > > > I agree, but there seems to be an implementation reason for the
> > > > behavior. To identify an hba-line, some connection parameters like
> > > > user name and others sent over a connection is required.  Thus the
> > > > clientcert option in the to-be-identified hba-line is unknown prior to
> > > > the time SSL connection is made. So the documentation might need
> > > > amendment. Roughly something like the following?
> > > 
> > > Well, I realize internally we need a way to indicate clientcert is not
> > > used, but why do we bother exposing that to the user as a named option?
> > 
> > Because we think we need any named value for every alternatives
> > including the default value?
> 
> Well, not putting clientcert at all gives the default behavior, so why
> have clientcert=no-verify?

clientcert=verify-ca or verify-full don't allow absence of client
certificate. We need an option to allow the absence.

> > > And you are right that the option name 'no-verify' is wrong since it
> > > will verify the CA if it exists, so it more like 'optionally-verify',
> > > which seems useless from a user interface perspective.
> > > 
> > > I guess the behavior of no-verify matches our client-side
> > > sslmode=prefer, but at least that has the value of using SSL if
> > > available, which prevents user-visible network traffic, but doesn't
> > > force it, but I am not sure what the value of optional certificate
> > > verification is, since verification is all it does.  I guess it should
> > > be called "prefer-verify".
> > 
> > The point of no-verify is to allow the absence of client
> > certificate. It is similar to "prefer" in a sense that it allows the
> > absence of availability of an SSL connection. (In a similar way to
> > "prefer", we could "fall back" to "no client cert" SSL connection
> > after verification failure but I think it's not worth doing.)
> 
> Well, sslmode=prefer gives encryption without identification. 
> clientcert=no-verify has no value because it is just an optional CA
> check that has no value because optional authentication is useless.  It

The point of the option is not to do optional CA check if possible,
but to allow absence of client cert. We need to have that mode
regardless of named or not named, and I believe we usually provide a
name for default mode.

> is like saying you can type in the password if you want, and we will
> check it, or you can just not type in the password.

Yes, since the point is the fact that I'm allowed to skip typing a
password. And the reason for the strange-looking behavior is that I
can't help entering a password if I had, but the server has no way
other than checking the password that I provided.

In the correct words, the server cannot ignore the certificate if
client sent it. But the client cannot identify whether the certificate
is needed by the server before sending it.

> > "prefer-verify" seems right in that sense. But I'm not sure we may
> > break backward compatibility for the reason.
> 
> True, but right now it is inaccurate so I think it just need to be fixed
> or removed and documented in the PG 14 release notes.

I'm fine with that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Problems with the FSM, heap fillfactor, and temporal locality

2020-08-26 Thread John Naylor
On Tue, Aug 25, 2020 at 5:17 AM Peter Geoghegan  wrote:
>
> I think that the sloppy approach to locking for the
> fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e.
> not using real atomic ops, even though we could) is one source of
> problems here. That might end up necessitating fixing the on-disk
> format, just to get the FSM to behave sensibly -- assuming that the
> value won't be too stale in practice is extremely dubious.
>
> This fp_next_slot business interacts poorly with the "extend a
> relation by multiple blocks" logic added by commit 719c84c1be5 --
> concurrently inserting backends are liable to get the same heap block
> from the FSM, causing "collisions". That almost seems like a bug IMV.
> We really shouldn't be given out the same block twice, but that's what
> my own custom instrumentation shows happens here. With atomic ops, it
> isn't a big deal to restart using a compare-and-swap at the end (when
> we set/reset fp_next_slot for other backends).

The fact that that logic extends by 20 * numwaiters to get optimal
performance is a red flag that resources aren't being allocated
efficiently. I have an idea to ignore fp_next_slot entirely if we have
extended by multiple blocks: The backend that does the extension
stores in the FSM root page 1) the number of blocks added and 2) the
end-most block number. Any request for space will look for a valid
value here first before doing the usual search. If there is then the
block to try is based on a hash of the xid. Something like:

candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks))

To guard against collisions, then peak in the FSM at that slot and if
it's not completely empty, then search FSM using a "look-nearby" API
and increment a counter every time we collide. When the counter gets
to some-value, clear the special area in the root page so that future
backends use the usual search.

I think this would work well with your idea to be more picky if the
xid stored with the relcache target block doesn't match the current
one.

Also num-new-blocks above could be scaled down from the actual number
of blocks added, just to make sure writes aren't happening all over
the place.

There might be holes in this idea, but it may be worth trying to be
better in this area without adding stricter locking.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Strange behavior with polygon and NaN

2020-08-26 Thread Kyotaro Horiguchi
At Tue, 25 Aug 2020 19:03:50 -0400, Bruce Momjian  wrote in 
> 
> I can confirm that this two-month old email report still produces
> different results with indexes on/off in git master, which I don't think
> is ever correct behavior.

I agree to that the behavior is broken. 

> ---
> 
> On Wed, Jun 24, 2020 at 03:11:03PM -0700, Jesse Zhang wrote:
> > Hi hackers,
> > 
> > While working with Chris Hajas on merging Postgres 12 with Greenplum
> > Database we stumbled upon the following strange behavior in the geometry
> > type polygon:
> > 
> > -- >8 
> > 
> > CREATE TEMP TABLE foo (p point);
> > CREATE INDEX ON foo USING gist(p);
> > 
> > INSERT INTO foo VALUES ('0,0'), ('1,1'), ('NaN,NaN');
> > 
> > SELECT $q$
> > SELECT * FROM foo WHERE p <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'
> > $q$ AS qry \gset
> > 
> > BEGIN;
> > SAVEPOINT yolo;
> > SET LOCAL enable_seqscan TO off;
> > :qry;
> > 
> > ROLLBACK TO SAVEPOINT yolo;
> > SET LOCAL enable_indexscan TO off;
> > SET LOCAL enable_bitmapscan TO off;
> > :qry;
> > 
> > -- 8< 
> > 
> > If you run the above repro SQL in HEAD (and 12, and likely all older
> > versions), you get the following output:
> > 
> > CREATE TABLE
> > CREATE INDEX
> > INSERT 0 3
> > BEGIN
> > SAVEPOINT
> > SET
> >p
> > ---
> >  (0,0)
> >  (1,1)
> > (2 rows)
> > 
> > ROLLBACK
> > SET
> > SET
> >  p
> > ---
> >  (0,0)
> >  (1,1)
> >  (NaN,NaN)
> > (3 rows)
> > 
> > 
> > At first glance, you'd think this is the gist AM's bad, but on a second
> > thought, something else is strange here. The following query returns
> > true:
> > 
> > SELECT point '(NaN, NaN)' <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)'
> > 
> > The above behavior of the "contained in" operator is surprising, and
> > it's probably not what the GiST AM is expecting. I took a look at
> > point_inside() in geo_ops.c, and it doesn't seem well equipped to handle
> > NaN. Similary ill-equipped is dist_ppoly_internal() which underlies the
> > distnace operator for polygon. It gives the following interesting
> > output:
> > 
> > SELECT *, c <-> polygon '(0,0),(0,100),(100,100),(100,0)' as distance
> > FROM (
> >   SELECT circle(point(100 * i, 'NaN'), 50) AS c
> >   FROM generate_series(-2, 4) i
> > ) t(c)
> > ORDER BY 2;
> > 
> > c| distance
> > -+--
> >  <(-200,NaN),50> |0
> >  <(-100,NaN),50> |0
> >  <(0,NaN),50>|0
> >  <(100,NaN),50>  |0
> >  <(200,NaN),50>  |  NaN
> >  <(300,NaN),50>  |  NaN
> >  <(400,NaN),50>  |  NaN
> > (7 rows)
> > 
> > Should they all be NaN? Am I alone in thinking the index is right but
> > the operators are wrong? Or should we call the indexes wrong here?


There may be other places that NaN is not fully considered. For
example, the following (perpendicular op) returns true.

SELECT lseg '[(0,0),(0,NaN)]' ?-| lseg '[(0,0),(10,0)]';

It is quite hard to fix all of the defect..

For the above cases, it's enough to make sure that point_inside don't
pass NaN's to lseg_crossing(), but it's much of labor to fill all
holes reasonable and principled way..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index a7db783958..1875f8d436 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -5350,6 +5350,9 @@ point_inside(Point *p, int npts, Point *plist)
 	x0 = float8_mi(plist[0].x, p->x);
 	y0 = float8_mi(plist[0].y, p->y);
 
+	if (isnan(x0) || isnan(y0) || isnan(p->x) || isnan(p->y))
+		return 0;
+
 	prev_x = x0;
 	prev_y = y0;
 	/* loop over polygon points and aggregate total_cross */
@@ -5359,6 +5362,9 @@ point_inside(Point *p, int npts, Point *plist)
 		x = float8_mi(plist[i].x, p->x);
 		y = float8_mi(plist[i].y, p->y);
 
+		if (isnan(x) || isnan(y))
+			return 0;
+
 		/* compute previous to current point crossing */
 		if ((cross = lseg_crossing(x, y, prev_x, prev_y)) == POINT_ON_POLYGON)
 			return 2;


Re: Improve planner cost estimations for alternative subplans

2020-08-26 Thread Andy Fan
On Mon, Aug 17, 2020 at 10:12 PM Andy Fan  wrote:

>
>
> On Mon, Jun 22, 2020 at 9:39 AM Tom Lane  wrote:
>
>> I wrote:
>> > Nope.  The entire reason why we have that kluge is that we don't know
>> > until much later how many times we expect to execute the subplan.
>> > AlternativeSubPlan allows the decision which subplan form to use to be
>> > postponed till runtime; but when we're doing things like estimating the
>> > cost and selectivity of a where-clause, we don't know that.
>>
>> > Maybe there's some way to recast things to avoid that problem,
>> > but I have little clue what it'd look like.
>>
>> Actually ... maybe it's not that bad.  Clearly there would be a
>> circularity issue for selectivity estimation, but all the alternatives
>> should have the same selectivity.  Cost estimation is a different story:
>> by the time we need to do cost estimation for a subexpression, we do in
>> many cases have an idea how often the subexpression will be executed.
>>
>>
> I read your idea of "ripping out all executor support for
> AlternativeSubPlan
>  and instead having the planner replace an AlternativeSubPlan with
>  the desired specific SubPlan somewhere late in planning, possibly
> setrefs.c."
> in [1].  I was thinking that if we can do such a replacement sooner,
> for example once we know the num_calls for the subplans, Unknown if it
> is possible though.  If we can, then we can handle the issue here as well.
>
> The attached is a very PoC version,  I'm not sure if it is the right
> direction
> to go.
>

The idea behind it is if we have a RelOptInfo which have
some  AlternativeSubPlan,
and assume these subplans have some correlated vars which can be expressed
as
deps_relids.  Then we can convert the AlternativeSubPlan to SubPlan once
bms_is_subset(deps_relids,  rel->relids).  My patch is able to fix the
issue reported
here and it only converts the AlternativeSubPlan in rel->reltarget for demo
purpose.

-- 
Best Regards
Andy Fan


Re: some unused parameters cleanup

2020-08-26 Thread Michael Paquier
On Wed, Aug 26, 2020 at 06:38:52AM +0200, Peter Eisentraut wrote:
> I had meant to exclude cases like this from this patch set.  If you see a
> case like this in *this* patch set, please point it out.

Last time I looked at that a lot of parameters are kept around as a
matter of symmetry with siblings, like tablecmds.c.  FWIW:
https://www.postgresql.org/message-id/20190130073317.gp3...@paquier.xyz

Saying that, I can see that you have been careful here and I don't see
anything like that in most of the changes you are proposing here.  You
could say that for findNamespace() or _moveBefore() perhaps, but there
are also some routines not making use of an Archive.  So this cleanup
looks fine to me.
--
Michael


signature.asc
Description: PGP signature


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-08-26 Thread Andy Fan
On Wed, Aug 26, 2020 at 8:14 AM David Rowley  wrote:

> On Wed, 26 Aug 2020 at 05:18, Andy Fan  wrote:
> >
> >
> > On Tue, Aug 25, 2020 at 11:53 PM Andres Freund 
> wrote:
> >>
> >> On 2020-08-25 20:48:37 +1200, David Rowley wrote:
> >> > Also, just in case anyone is misunderstanding this Andres' argument.
> >> > It's entirely based on the performance impact of having an additional
> >> > node.
> >>
> >> Not entirely, no. It's also just that it doesn't make sense to have two
> >> nodes setting parameters that then half magically picked up by a special
> >> subsidiary node type and used as a cache key. This is pseudo modularity,
> >> not real modularity. And makes it harder to display useful information
> >> in explain etc. And makes it harder to e.g. clear the cache in cases we
> >> know that there's no further use of the current cache. At least without
> >> piercing the abstraction veil.
> >>
> >>
> >> > However, given the correct planner choice, there will never be
> >> > a gross slowdown due to having the extra node.
> >>
> >> There'll be a significant reduction in increase in performance.
> >
> >
> > If this is a key blocking factor for this topic, I'd like to do a simple
> hack
> > to put the cache function into the subplan node, then do some tests to
> > show the real difference.  But it is better to decide how much difference
> > can be thought of as a big difference.  And  for education purposes,
> > I'd like to understand where these differences come from.  For my
> > current knowledge,  my basic idea is it saves some function calls?
>
> If testing this, the cache hit ratio will be pretty key to the
> results. You'd notice the overhead much less with a larger cache hit
> ratio since you're not pulling the tuple from as deeply a nested node.
> I'm unsure how you'd determine what is a good cache hit ratio to
> test it with.


I wanted to test the worst case where the cache hit ratio is 0. and then
compare the difference between putting the cache as a dedicated
node and in a SubPlan node.  However, we have a better way
to test the difference based on your below message.


>

The lower the cache expected cache hit ratio, the higher
> the cost of the Result Cache node will be, so the planner has less
> chance of choosing to use it.
>

IIRC, we add the ResultCache for subplan nodes unconditionally now.
The main reason is we lack of ndistinct estimation during the subquery
planning.  Tom suggested converting the AlternativeSubPlan to SubPlan
in setrefs.c [1], and I also ran into a case that can be resolved if we do
such conversion even earlier[2], the basic idea is we can do such
conversation
once we can get the actual values for the subplan.

something like
if (bms_is_subset(subplan->deps_relids,  rel->relids)
{
   convert_alternativesubplans_to_subplan(rel);
}
you can see if that can be helpful for ResultCache in this user case.   my
patch in [2] is still in a very PoC stage so it only takes care of subplan
in
rel->reltarget.


> Say you find a case with the hit ratio of 90%.  Going by [1] I found
> pulling a tuple through an additional node to cost about 12
> nanoseconds on an intel 4712HQ CPU.  With a hit ratio of 90% we'll
> only pull 10% of tuples through the additional node, so that's about
> 1.2 nanoseconds per tuple, or 1.2 milliseconds per million tuples. It
> might become hard to measure above the noise. More costly inner scans
> will have the planner choose to Result Cache with lower estimated hit
> ratios, but in that case, pulling the tuple through the additional
> node during a cache miss will be less noticeable due to the more
> costly inner side of the join.
>
> Likely you could test the overhead only in theory without going to the
> trouble of adapting the code to make SubPlan and Nested Loop do the
> caching internally.  If you just modify ExecResultCache() to have it
> simply return its subnode, then measure the performance with and
> without enable_resultcache, you should get an idea of the per-tuple
> overhead of pulling the tuple through the additional node on your CPU.
>

Thanks for the hints.  I think we can test it even easier with Limit node.

create table test_pull_tuples(a int);
insert into test_pull_tuples select i from generate_seri
insert into test_pull_tuples select i from generate_series(1, 10)i;
-- test with pgbench.
select * from test_pull_tuples;   18.850 ms
select * from test_pull_tuples limit 10;   20.500 ms

Basically it is 16 nanoseconds per tuple on my Intel(R) Xeon(R) CPU
E5-2650.
Personally I'd say the performance difference is negligible unless I see
some
different numbers.

[1]
https://www.postgresql.org/message-id/1992952.1592785225%40sss.pgh.pa.us
[2]
https://www.postgresql.org/message-id/CAKU4AWoMRzZKk1vPstKTjS7sYeN43j8WtsAZy2pv73vm_E_6dA%40mail.gmail.com


-- 
Best Regards
Andy Fan


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-08-26 Thread Michael Paquier
On Tue, Aug 25, 2020 at 03:52:14PM +0200, Daniel Gustafsson wrote:
> The USE_OPENSSL_RANDOM macro is defined when OpenSSL is used as a randomness
> provider, but the implementation of strong randomness is guarded by 
> USE_OPENSSL
> in most places.  This is technically the same thing today, but it seems
> hygienic to use the appropriate macro in case we ever want to allow OS
> randomness together with OpenSSL or something similar (or just make git grep
> easier which is my itch to scratch with this).

@@ -24,7 +24,7 @@
 #include 
 #include 

-#ifdef USE_OPENSSL
+#ifdef USE_OPENSSL_RANDOM
 #include 
 #endif
I agree that this makes the header declarations more consistent with
WIN32.

> The attached moves all invocations under the correct guards.  RAND_poll() in
> fork_process.c needs to happen for both OpenSSL and OpenSSL random, thus the
> check for both.

Yeah, it could be possible that somebody still calls RAND_bytes() or
similar without going through pg_strong_random(), so we still need to
use USE_OPENSSL after forking.  Per this argument, I am not sure I see
the point of the change in fork_process.c as it seems to me that 
USE_OPENSSL_RANDOM should only be tied to pg_strong_random.c, and
you'd still get a compilation failure if trying to use
USE_OPENSSL_RANDOM without --with-openssl.
--
Michael


signature.asc
Description: PGP signature


RE: Implement UNLOGGED clause for COPY FROM

2020-08-26 Thread tsunakawa.ta...@fujitsu.com
Hello,


I think it's worth thinking about a sophisticated feature like Oracle's 
UNRECOVERABLE data loading (because SQL Server's BCP load utility also has such 
a feature, but for an empty table), how about an easier approach like MySQL?  I 
expect this won't complicate Postgres code much.

The customer is using Oracle RAC for high availability of a data warehouse.  
Then, I think they can use the traditional shared disk-based HA clustering, not 
the streaming replication when they migrate to Postgres.

They load data into the data warehouse with the nightly ETL or ELT.  The 
loading window is limited, so they run multiple concurrent loading sessions, 
with the transaction logging off.  They probably use all resources for the data 
loading during that period.

Then, you might think "How about turning fsync and full_page_writes to off?"  
But the customer doesn't like to be worried about the massive amount of WAL 
generated during the loading.


OTOH, the latest MySQL 8.0.21 introduced the following feature.  This is for 
the initial data loading into a new database instance, though.


https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html#innodb-disable-redo-logging
--
Disabling Redo Logging
As of MySQL 8.0.21, you can disable redo logging using the ALTER INSTANCE 
DISABLE INNODB REDO_LOG statement. This functionality is intended for loading 
data into a new MySQL instance. Disabling redo logging speeds up data loading 
by avoiding redo log writes and doublewrite buffering.

Warning
This feature is intended only for loading data into a new MySQL instance. Do 
not disable redo logging on a production system. It is permitted to shutdown 
and restart the server while redo logging is disabled, but an unexpected server 
stoppage while redo logging is disabled can cause data loss and instance 
corruption.

Attempting to restart the server after an unexpected server stoppage while redo 
logging is disabled is refused with the following error:

[ERROR] [MY-013578] [InnoDB] Server was killed when Innodb Redo 
logging was disabled. Data files could be corrupt. You can try 
to restart the database with innodb_force_recovery=6
In this case, initialize a new MySQL instance and start the data loading 
procedure again.
--


Following this idea, what do you think about adding a new value "none" to 
wal_level, where no WAL is generated?  The setting of wal_level is recorded in 
pg_control.  The startup process can see the value and reject recovery after 
abnormal shutdown, emitting a message similar to MySQL's.

Just a quick idea.  I hope no devil will appear in the details.


Regards
Takayuki Tsunakawa