Re: making relfilenodes 56 bits

2023-01-05 Thread Dilip Kumar
On Wed, Jan 4, 2023 at 5:45 PM vignesh C  wrote:
>
> On Fri, 21 Oct 2022 at 11:31, Michael Paquier  wrote:
> >
> > On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > > any actual checks on the sanity of the output?  I realize that the
> > > output's far from static, but still ...
> >
> > Honestly, checking all the fields is not that exciting, but the
> > maximum I can think of that would be portable enough is something like
> > the attached.  No arithmetic operators for xid limits things a bit,
> > but at least that's something.
> >
> > Thoughts?
>
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:
>

Because of the extra WAL overhead, we are not continuing with the
patch, I will withdraw it.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2023-01-04 Thread vignesh C
On Fri, 21 Oct 2022 at 11:31, Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> > Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> > any actual checks on the sanity of the output?  I realize that the
> > output's far from static, but still ...
>
> Honestly, checking all the fields is not that exciting, but the
> maximum I can think of that would be portable enough is something like
> the attached.  No arithmetic operators for xid limits things a bit,
> but at least that's something.
>
> Thoughts?

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
33ab0a2a527e3af5beee3a98fc07201e555d6e45 ===
=== applying patch ./controldata-regression-2.patch
patching file src/test/regress/expected/misc_functions.out
Hunk #1 succeeded at 642 with fuzz 2 (offset 48 lines).
patching file src/test/regress/sql/misc_functions.sql
Hunk #1 FAILED at 223.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/sql/misc_functions.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3711.log

Regards,
Vignesh




Re: making relfilenodes 56 bits

2022-10-21 Thread Michael Paquier
On Thu, Sep 29, 2022 at 09:23:38PM -0400, Tom Lane wrote:
> Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
> any actual checks on the sanity of the output?  I realize that the
> output's far from static, but still ...

Honestly, checking all the fields is not that exciting, but the
maximum I can think of that would be portable enough is something like
the attached.  No arithmetic operators for xid limits things a bit,
but at least that's something.

Thoughts?
--
Michael
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..38987e2afc 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,89 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+--
+-- Test functions for control data
+--
+\x
+SELECT checkpoint_lsn > '0/0'::pg_lsn AS checkpoint_lsn,
+redo_lsn > '0/0'::pg_lsn AS redo_lsn,
+redo_wal_file IS NOT NULL AS redo_wal_file,
+timeline_id > 0 AS timeline_id,
+prev_timeline_id > 0 AS prev_timeline_id,
+next_xid IS NOT NULL AS next_xid,
+next_oid > 0 AS next_oid,
+next_multixact_id != '0'::xid AS next_multixact_id,
+next_multi_offset IS NOT NULL AS next_multi_offset,
+oldest_xid != '0'::xid AS oldest_xid,
+oldest_xid_dbid > 0 AS oldest_xid_dbid,
+oldest_active_xid != '0'::xid AS oldest_active_xid,
+oldest_multi_xid != '0'::xid AS oldest_multi_xid,
+oldest_multi_dbid > 0 AS oldest_multi_dbid,
+oldest_commit_ts_xid IS NOT NULL AS oldest_commit_ts_xid,
+newest_commit_ts_xid IS NOT NULL AS newest_commit_ts_xid
+  FROM pg_control_checkpoint();
+-[ RECORD 1 ]+--
+checkpoint_lsn   | t
+redo_lsn | t
+redo_wal_file| t
+timeline_id  | t
+prev_timeline_id | t
+next_xid | t
+next_oid | t
+next_multixact_id| t
+next_multi_offset| t
+oldest_xid   | t
+oldest_xid_dbid  | t
+oldest_active_xid| t
+oldest_multi_xid | t
+oldest_multi_dbid| t
+oldest_commit_ts_xid | t
+newest_commit_ts_xid | t
+
+SELECT max_data_alignment > 0 AS max_data_alignment,
+database_block_size > 0 AS database_block_size,
+blocks_per_segment > 0 AS blocks_per_segment,
+wal_block_size > 0 AS wal_block_size,
+max_identifier_length > 0 AS max_identifier_length,
+max_index_columns > 0 AS max_index_columns,
+max_toast_chunk_size > 0 AS max_toast_chunk_size,
+large_object_chunk_size > 0 AS large_object_chunk_size,
+float8_pass_by_value IS NOT NULL AS float8_pass_by_value,
+data_page_checksum_version >= 0 AS data_page_checksum_version
+  FROM pg_control_init();
+-[ RECORD 1 ]--+--
+max_data_alignment | t
+database_block_size| t
+blocks_per_segment | t
+wal_block_size | t
+max_identifier_length  | t
+max_index_columns  | t
+max_toast_chunk_size   | t
+large_object_chunk_size| t
+float8_pass_by_value   | t
+data_page_checksum_version | t
+
+SELECT min_recovery_end_lsn >= '0/0'::pg_lsn AS min_recovery_end_lsn,
+min_recovery_end_timeline >= 0 AS min_recovery_end_timeline,
+backup_start_lsn >= '0/0'::pg_lsn AS backup_start_lsn,
+backup_end_lsn >= '0/0'::pg_lsn AS backup_end_lsn,
+end_of_backup_record_required IS NOT NULL AS end_of_backup_record_required
+  FROM pg_control_recovery();
+-[ RECORD 1 ]-+--
+min_recovery_end_lsn  | t
+min_recovery_end_timeline | t
+backup_start_lsn  | t
+backup_end_lsn| t
+end_of_backup_record_required | t
+
+SELECT pg_control_version > 0 AS pg_control_version,
+catalog_version_no > 0 AS catalog_version_no,
+system_identifier >= 0 AS system_identifier,
+pg_control_last_modified <= now() AS pg_control_last_modified
+  FROM pg_control_system();
+-[ RECORD 1 ]+--
+pg_control_version   | t
+catalog_version_no   | t
+system_identifier| t
+pg_control_last_modified | t
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 639e9b352c..986e07c3a5 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -223,3 +223,47 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,1000) g ON a.unique1 = g;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
+
+--
+-- Test functions for control data
+--
+\x
+SELECT checkpoint_lsn > '0/0'::pg_lsn AS checkpoint_lsn,
+redo_lsn > '0/0'::pg_lsn AS redo_lsn,
+redo_wal_file IS NOT NULL AS redo_wal_file,
+timeline_id > 0 AS timeline_id,
+prev_timeline_id > 0 AS prev_timeline_id,
+next_xid IS NOT NULL AS next_xid,
+next_oid > 0 AS next_oid,
+next_multixact_id != '0'::xid AS next_multixact_id,
+next_multi_offset IS NOT NULL AS next_multi_offset,
+oldest_xid != 

Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane
Michael Paquier  writes:
> While passing by, I have noticed this thread.  We don't really care
> about the contents returned by these functions, and one simple trick
> to check their execution is SELECT FROM.  Like in the attached, for
> example.

Hmmm ... I'd tend to do SELECT COUNT(*) FROM.  But can't we provide
any actual checks on the sanity of the output?  I realize that the
output's far from static, but still ...

regards, tom lane




Re: making relfilenodes 56 bits

2022-09-29 Thread Michael Paquier
On Thu, Sep 29, 2022 at 02:39:44PM -0400, Tom Lane wrote:
> The assertions in TupleDescInitEntry would have caught that,
> if only utils/misc/pg_controldata.c had more than zero test coverage.
> Seems like somebody ought to do something about that.

While passing by, I have noticed this thread.  We don't really care
about the contents returned by these functions, and one simple trick
to check their execution is SELECT FROM.  Like in the attached, for
example.
--
Michael
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 9f106c2a10..93cba8e76d 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -594,3 +594,22 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
  Index Cond: (unique1 = g.g)
 (4 rows)
 
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+--
+(1 row)
+
+SELECT FROM pg_control_init();
+--
+(1 row)
+
+SELECT FROM pg_control_recovery();
+--
+(1 row)
+
+SELECT FROM pg_control_system();
+--
+(1 row)
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 639e9b352c..207d5a5292 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -223,3 +223,11 @@ SELECT * FROM tenk1 a JOIN my_gen_series(1,1000) g ON a.unique1 = g;
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1 a JOIN my_gen_series(1,10) g ON a.unique1 = g;
+
+--
+-- Test functions for control data
+--
+SELECT FROM pg_control_checkpoint();
+SELECT FROM pg_control_init();
+SELECT FROM pg_control_recovery();
+SELECT FROM pg_control_system();


signature.asc
Description: PGP signature


Re: making relfilenodes 56 bits

2022-09-29 Thread Tom Lane
Robert Haas  writes:
> On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
>> In other words, we have 19 attributes. But tupdesc here is constructed for 
>> 18 elements:
>> tupdesc = CreateTemplateTupleDesc(18);

> I think that's a mistake. Thanks for the report.

The assertions in TupleDescInitEntry would have caught that,
if only utils/misc/pg_controldata.c had more than zero test coverage.
Seems like somebody ought to do something about that.

regards, tom lane




Re: making relfilenodes 56 bits

2022-09-29 Thread Robert Haas
On Thu, Sep 29, 2022 at 10:50 AM Maxim Orlov  wrote:
> In other words, we have 19 attributes. But tupdesc here is constructed for 18 
> elements:
> tupdesc = CreateTemplateTupleDesc(18);
>
> Is that normal or not? Again, I'm not in this thread and if that is 
> completely ok, I'm sorry about the noise.

I think that's a mistake. Thanks for the report.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-29 Thread Maxim Orlov
Hi!

I'm not in the context of this thread, but I've notice something strange by
attempting to rebase my patch set from 64XID thread.
As far as I'm aware, this patch set is adding "relfilenumber". So, in
pg_control_checkpoint, we have next changes:

diff --git a/src/backend/utils/misc/pg_controldata.c
b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..d441cd97e2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-   Datum   values[18];
-   boolnulls[18];
+   Datum   values[19];
+   boolnulls[19];
TupleDesc   tupdesc;
HeapTuple   htup;
ControlFileData *ControlFile;
@@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
   XIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time",
   TIMESTAMPTZOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenumber",
+  INT8OID, -1, 0);
tupdesc = BlessTupleDesc(tupdesc);

/* Read the control file. */

In other words, we have 19 attributes. But tupdesc here is constructed for
18 elements:
tupdesc = CreateTemplateTupleDesc(18);

Is that normal or not? Again, I'm not in this thread and if that is
completely ok, I'm sorry about the noise.

-- 
Best regards,
Maxim Orlov.


Re: making relfilenodes 56 bits

2022-09-28 Thread Thomas Munro
On Wed, Sep 28, 2022 at 9:40 PM Dilip Kumar  wrote:
> Thanks, Thomas, these all look fine to me.  So far we have committed
> the patch to make relfilenode 56 bits wide.  The Tombstone file
> removal patch is still pending to be committed, so when I will rebase
> that patch I will accommodate all these comments in that patch.

I noticed that your new unlinking algorithm goes like this:

stat("x")
stat("x.1")
stat("x.2")
stat("x.3") -> ENOENT /* now we know how many segments there are */
truncate("x.2")
unlink("x.2")
truncate("x.1")
unlink("x.1")
truncate("x")
unlink("x")

Could you say what problem this solves, and, guessing that it's just
that you want the 0 file to be "in the way" until the other files are
gone (at least while we're running; who knows what'll be left if you
power-cycle), could you do it like this instead?

truncate("x")
truncate("x.1")
truncate("x.2")
truncate("x.3") -> ENOENT /* now we know how many segments there are */
unlink("x.2")
unlink("x.1")
unlink("x")




Re: making relfilenodes 56 bits

2022-09-28 Thread Dilip Kumar
On Wed, Sep 28, 2022 at 9:23 AM Thomas Munro  wrote:
>
> Hi Dilip,
>
> I am very happy to see these commits.  Here's some belated review for
> the tombstone-removal patch.
>
> > v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch
>
> More things you can remove:
>
>  * sync_unlinkfiletag in struct SyncOps
>  * the macro UNLINKS_PER_ABSORB
>  * global variable pendingUnlinks
>
> This comment after the question mark is obsolete:
>
> * XXX should we CHECK_FOR_INTERRUPTS in this loop?
> Escaping with an
> * error in the case of SYNC_UNLINK_REQUEST would leave the
> * no-longer-used file still present on disk, which
> would be bad, so
> * I'm inclined to assume that the checkpointer will
> always empty the
> * queue soon.
>
> (I think if the answer to the question is now yes, then we should
> replace the stupid sleep with a condition variable sleep, but there's
> another thread about that somewhere).
>
> In a couple of places in dbcommands.c you could now make this change:
>
> /*
> -* Force a checkpoint before starting the copy. This will
> force all dirty
> -* buffers, including those of unlogged tables, out to disk, to ensure
> -* source database is up-to-date on disk for the copy.
> -* FlushDatabaseBuffers() would suffice for that, but we also want to
> -* process any pending unlink requests. Otherwise, if a checkpoint
> -* happened while we're copying files, a file might be deleted just 
> when
> -* we're about to copy it, causing the lstat() call in copydir() to 
> fail
> -* with ENOENT.
> +* Force all dirty buffers, including those of unlogged tables, out to
> +* disk, to ensure source database is up-to-date on disk for the copy.
>  */
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
> - CHECKPOINT_WAIT |
> CHECKPOINT_FLUSH_ALL);
> +   FlushDatabaseBuffers(src_dboid);
>
> More obsolete comments you could change:
>
>  * If we were copying database at block levels then drop pages for the
>  * destination database that are in the shared buffer cache.  And tell
> -->  * checkpointer to forget any pending fsync and unlink
> requests for files
>
> --> * Tell checkpointer to forget any pending fsync and unlink requests 
> for
> * files in the database; else the fsyncs will fail at next
> checkpoint, or
> * worse, it will delete file
>
> In tablespace.c I think you could now make this change:
>
> if (!destroy_tablespace_directories(tablespaceoid, false))
> {
> -   /*
> -* Not all files deleted?  However, there can be
> lingering empty files
> -* in the directories, left behind by for example DROP
> TABLE, that
> -* have been scheduled for deletion at next checkpoint
> (see comments
> -* in mdunlink() for details).  We could just delete
> them immediately,
> -* but we can't tell them apart from important data
> files that we
> -* mustn't delete.  So instead, we force a checkpoint
> which will clean
> -* out any lingering files, and try again.
> -*/
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE |
> CHECKPOINT_FORCE | CHECKPOINT_WAIT);
> -
> +#ifdef WIN32
> /*
>  * On Windows, an unlinked file persists in the
> directory listing
>  * until no process retains an open handle for the
> file.  The DDL
> @@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
>
> /* And now try again. */
> if (!destroy_tablespace_directories(tablespaceoid, false))
> +#endif
> {
> /* Still not empty, the files must be important then 
> */
> ereport(ERROR,

Thanks, Thomas, these all look fine to me.  So far we have committed
the patch to make relfilenode 56 bits wide.  The Tombstone file
removal patch is still pending to be committed, so when I will rebase
that patch I will accommodate all these comments in that patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-27 Thread Thomas Munro
Hi Dilip,

I am very happy to see these commits.  Here's some belated review for
the tombstone-removal patch.

> v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch

More things you can remove:

 * sync_unlinkfiletag in struct SyncOps
 * the macro UNLINKS_PER_ABSORB
 * global variable pendingUnlinks

This comment after the question mark is obsolete:

* XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
* error in the case of SYNC_UNLINK_REQUEST would leave the
* no-longer-used file still present on disk, which
would be bad, so
* I'm inclined to assume that the checkpointer will
always empty the
* queue soon.

(I think if the answer to the question is now yes, then we should
replace the stupid sleep with a condition variable sleep, but there's
another thread about that somewhere).

In a couple of places in dbcommands.c you could now make this change:

/*
-* Force a checkpoint before starting the copy. This will
force all dirty
-* buffers, including those of unlogged tables, out to disk, to ensure
-* source database is up-to-date on disk for the copy.
-* FlushDatabaseBuffers() would suffice for that, but we also want to
-* process any pending unlink requests. Otherwise, if a checkpoint
-* happened while we're copying files, a file might be deleted just when
-* we're about to copy it, causing the lstat() call in copydir() to fail
-* with ENOENT.
+* Force all dirty buffers, including those of unlogged tables, out to
+* disk, to ensure source database is up-to-date on disk for the copy.
 */
-   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
- CHECKPOINT_WAIT |
CHECKPOINT_FLUSH_ALL);
+   FlushDatabaseBuffers(src_dboid);

More obsolete comments you could change:

 * If we were copying database at block levels then drop pages for the
 * destination database that are in the shared buffer cache.  And tell
-->  * checkpointer to forget any pending fsync and unlink
requests for files

--> * Tell checkpointer to forget any pending fsync and unlink requests for
* files in the database; else the fsyncs will fail at next
checkpoint, or
* worse, it will delete file

In tablespace.c I think you could now make this change:

if (!destroy_tablespace_directories(tablespaceoid, false))
{
-   /*
-* Not all files deleted?  However, there can be
lingering empty files
-* in the directories, left behind by for example DROP
TABLE, that
-* have been scheduled for deletion at next checkpoint
(see comments
-* in mdunlink() for details).  We could just delete
them immediately,
-* but we can't tell them apart from important data
files that we
-* mustn't delete.  So instead, we force a checkpoint
which will clean
-* out any lingering files, and try again.
-*/
-   RequestCheckpoint(CHECKPOINT_IMMEDIATE |
CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-
+#ifdef WIN32
/*
 * On Windows, an unlinked file persists in the
directory listing
 * until no process retains an open handle for the
file.  The DDL
@@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)

/* And now try again. */
if (!destroy_tablespace_directories(tablespaceoid, false))
+#endif
{
/* Still not empty, the files must be important then */
ereport(ERROR,




Re: making relfilenodes 56 bits

2022-09-27 Thread Robert Haas
On Tue, Sep 27, 2022 at 2:33 AM Dilip Kumar  wrote:
> Looks fine to me.

OK, committed. I also committed the 0002 patch with some wordsmithing,
and I removed a < 0 test an an unsigned value because my compiler
complained about it. 0001 turned out to make headerscheck sad, so I
just pushed a fix for that, too.

I'm not too sure about 0003. I think if we need an is_shared flag
maybe we might as well just pass the tablespace OID. The is_shared
flag seems to just make things a bit complicated for the callers for
no real benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-26 Thread Robert Haas
On Wed, Sep 21, 2022 at 6:09 AM Dilip Kumar  wrote:
> Yeah you are right we can make it uint64.  With respect to this, we
> can not directly use uint64 because that is declared in c.h and that
> can not be used in
> postgres_ext.h IIUC.

Ugh.

> Can we move the existing definitions from
> c.h file to some common file (common for client and server)?

Yeah, I think that would be a good idea. Here's a quick patch that
moves them to common/relpath.h, which seems like a possibly-reasonable
choice, though perhaps you or someone else will have a better idea.

> Based on the discussion [1], it seems we can not use
> INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
> I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?
>
> [1] 
> https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

Oh, hmm. So you're saying if the string is not translated then use
(U)INT64_FORMAT but if it is translated then cast? I guess that makes
sense. It feels a bit strange to have the style dependent on the
context like that, but maybe it's fine. I'll reread with that idea in
mind.

> If we're going to bank on that, we could adapt this more
> > heavily, e.g. RelidByRelfilenumber() could lose the reltablespace
> > parameter.
>
> Yeah we might, although we need a bool to identify whether it is
> shared relation or not.

Why?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


move-relfilenumber-decls-v1.patch
Description: Binary data


Re: making relfilenodes 56 bits

2022-09-22 Thread Dilip Kumar
On Tue, Sep 20, 2022 at 7:46 PM Amul Sul  wrote:
>

Thanks for the review

> Here are a few minor suggestions I came across while reading this
> patch, might be useful:
>
> +#ifdef USE_ASSERT_CHECKING
> +
> +   {
>
> Unnecessary space after USE_ASSERT_CHECKING.

Changed

>
> +   return InvalidRelFileNumber;/* placate compiler */
>
> I don't think we needed this after the error on the latest branches.
> --

Changed

> +   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
> +   if (shutdown)
> +   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
> +   else
> +   checkPoint.nextRelFileNumber = 
> ShmemVariableCache->loggedRelFileNumber;
> +
> +   LWLockRelease(RelFileNumberGenLock);
>
> This is done for the good reason, I think, it should have a comment
> describing different checkPoint.nextRelFileNumber assignment
> need and crash recovery perspective.
> --

Done

> +#define SizeOfRelFileLocatorBackend \
> +   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))
>
> Can append empty parenthesis "()" to the macro name, to look like a
> function call at use or change the macro name to uppercase?
> --

Yeah we could SizeOfXXX macros are general practice I see used
everywhere in Postgres code so left as it is.

>  +   if (val < 0 || val > MAX_RELFILENUMBER)
> ..
>  if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
>
> How about adding a macro for this condition as RelFileNumberIsValid()?
> We can replace all the checks referring to MAX_RELFILENUMBER with this.

Actually, RelFileNumberIsValid is used to just check whether it is
InvalidRelFileNumber value i.e. 0.  Maybe for this we can introduce
RelFileNumberInValidRange() but I am not sure whether it would be
cleaner than what we have now, so left as it is for now.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-21 Thread Dilip Kumar
On Tue, Sep 20, 2022 at 10:44 PM Robert Haas  wrote:

Thanks for the review, please see my response inline for some of the
comments, rest all are accepted.

> On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> > [ new patch ]
>
> +typedef pg_int64 RelFileNumber;
>
> This seems really random to me. First, why isn't this an unsigned
> type? OID is unsigned and I don't see a reason to change to a signed
> type. But even if we were going to change to a signed type, why
> pg_int64? That is declared like this:
>
> /* Define a signed 64-bit integer type for use in client API declarations. */
> typedef PG_INT64_TYPE pg_int64;
>
> Surely this is not a client API declaration
>
> Note that if we change this a lot of references to INT64_FORMAT will
> need to become UINT64_FORMAT.
>
> I think we should use int64 at the SQL level, because we don't have an
> unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
> So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
> similar. But internally I think using unsigned is cleaner.

Yeah you are right we can make it uint64.  With respect to this, we
can not directly use uint64 because that is declared in c.h and that
can not be used in
postgres_ext.h IIUC.  So what are the other option maybe we can
typedef the RelFIleNumber similar to what c.h done for uint64 i.e.

#ifdef HAVE_LONG_INT_64
typedef unsigned long int uint64;
#elif defined(HAVE_LONG_LONG_INT_64)
typedef long long int int64;
#endif

And maybe same for UINT64CONST ?

I am not liking duplicating this logic but is there any better
alternative for doing this?  Can we move the existing definitions from
c.h file to some common file (common for client and server)?

>
> +if (relnumber >= ShmemVariableCache->loggedRelFileNumber)
>
> It probably doesn't make any difference, but to me it seems better to
> test flushedRelFileNumber rather than logRelFileNumber here. What do
> you think?

Actually based on this condition are logging more so it make more
sense to check w.r.t loggedRelFileNumber, but OTOH technically,
without flushing log we are not supposed to use the relfilenumber so
make more sense to test flushedRelFileNumber.  But since both are the
same I am fine with flushedRelFileNumber.

>  /*
>   * We set up the lockRelId in case anything tries to lock the dummy
> - * relation.  Note that this is fairly bogus since relNumber may be
> - * different from the relation's OID.  It shouldn't really matter though.
> - * In recovery, we are running by ourselves and can't have any lock
> - * conflicts.  While syncing, we already hold AccessExclusiveLock.
> + * relation.  Note we are setting relId to just FirstNormalObjectId which
> + * is completely bogus.  It shouldn't really matter though. In recovery,
> + * we are running by ourselves and can't have any lock conflicts.  While
> + * syncing, we already hold AccessExclusiveLock.
>   */
>  rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
> -rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
> +rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;
>
> Boy, this makes me uncomfortable. The existing logic is pretty bogus,
> and we're replacing it with some other bogus thing. Do we know whether
> anything actually does try to use this for locking?
>
> One notable difference between the existing logic and your change is
> that, with the existing logic, we use a bogus value that will differ
> from one relation to the next, whereas with this change, it will
> always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
> (Oid) rlocator.relNumber would be a more natural adaptation?
>
> +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
> +do {\
> +if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> +ereport(ERROR,  \
> +errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
> +errmsg("relfilenumber %lld is out of range",\
> +(long long) (relfilenumber))); \
> +} while (0)
>
> Here, you take the approach of casting the relfilenumber to long long
> and then using %lld. But elsewhere, you use
> INT64_FORMAT/UINT64_FORMAT. If we're going to use this technique, we
> ought to use it everywhere.

Based on the discussion [1], it seems we can not use
INT64_FORMAT/UINT64_FORMAT while using ereport.  But all other places
I am using INT64_FORMAT/UINT64_FORMAT.  Does this make sense?

[1] 
https://www.postgresql.org/message-id/20220730113922.qd7qmenwcmzyacje%40alvherre.pgsql

>  typedef struct
>  {
> -Oid reltablespace;
> -RelFileNumber relfilenumber;
> -} RelfilenumberMapKey;
> -
> -typedef struct
> -{
> -RelfilenumberMapKey key;/* lookup key - must be first */
> +RelFileNumber relfilenumber;/* lookup key - must be first */
>  Oid relid; 

Re: making relfilenodes 56 bits

2022-09-20 Thread Robert Haas
On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+uint64  relnum;
+
+Assert(relnumber <= MAX_RELFILENUMBER);
+Assert(forknum <= MAX_FORKNUM);
+
+relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20  /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+{
+RelFileLocatorBackend rlocator;
+char   *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

 /*
  * We set up the lockRelId in case anything tries to lock the dummy
- * relation.  Note that this is fairly bogus since relNumber may be
- * different from the relation's OID.  It shouldn't really matter though.
- * In recovery, we are running by ourselves and can't have any lock
- * conflicts.  While syncing, we already hold AccessExclusiveLock.
+ * relation.  Note we are setting relId to just FirstNormalObjectId which
+ * is completely bogus.  It shouldn't really matter though. In recovery,
+ * we are running by ourselves and can't have any lock conflicts.  While
+ * syncing, we already hold AccessExclusiveLock.
  */
 rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
+do {\
+if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ereport(ERROR,  \
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
+errmsg("relfilenumber %lld is out of range",\
+(long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to 

Re: making relfilenodes 56 bits

2022-09-20 Thread Amul Sul
On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar  wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some 
> > risk of using the unflushed relfilenumber in GetNewRelFileNumber() 
> > function.  Basically, in the current code, the flushing logic is tightly 
> > coupled with the logging new relfilenumber logic and that might not work 
> > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea 
> > is we need to keep the flushing logic separate from the logging, I am 
> > working on the idea and I will post the patch soon.
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+   return InvalidRelFileNumber;/* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+   checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul




Re: making relfilenodes 56 bits

2022-09-09 Thread Dilip Kumar
On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:

> On a separate note, while reviewing the latest patch I see there is some risk 
> of using the unflushed relfilenumber in GetNewRelFileNumber() function.  
> Basically, in the current code, the flushing logic is tightly coupled with 
> the logging new relfilenumber logic and that might not work with all the 
> values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea is we need to 
> keep the flushing logic separate from the logging, I am working on the idea 
> and I will post the patch soon.

I have fixed the issue, so now we will track nextRelFileNumber,
loggedRelFileNumber and flushedRelFileNumber.  So whenever
nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
relfilenumbers.  And whenever nextRelFileNumber reaches the
flushedRelFileNumber then we will do XlogFlush for WAL upto the last
loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
avoid tracking the flushedRelFileNumber.  But I feel keeping track of
the flushedRelFileNumber looks cleaner and easier to understand.  For
more details refer to the code in GetNewRelFileNumber().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From d04432467d34ebdc95934ab85409b1fad037c6cd Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 26 Aug 2022 10:20:18 +0530
Subject: [PATCH v17] Widen relfilenumber from 32 bits to 56 bits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently relfilenumber is 32 bits wide and that has a risk of wraparound so
the relfilenumber can be reused.  And to guard against the relfilenumber reuse
there is some complicated hack which leaves a 0-length tombstone file around
until the next checkpoint.  And when we allocate a new relfilenumber
we also need to loop to check the on disk conflict.

As part of this patch we are making the relfilenumber 56 bits wide and there will be
no provision for wraparound.  So after this change we will be able to get rid of the
0-length tombstone file and the loop for checking the on-disk conflict of the
relfilenumbers.

The reason behind making it 56 bits wide instead of directly making 64 bits wide is
that if we make it 64 bits wide then the size of the BufferTag will be increased which
will increase the memory usage and that may also impact the performance.  So in order
to avoid that, inside the buffer tag, we will use 8 bits for the fork number and 56 bits
for the relfilenumber.
---
 contrib/pg_buffercache/Makefile|   4 +-
 .../pg_buffercache/pg_buffercache--1.3--1.4.sql|  30 
 contrib/pg_buffercache/pg_buffercache.control  |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c  |  39 +++-
 contrib/pg_prewarm/autoprewarm.c   |   4 +-
 contrib/pg_walinspect/expected/pg_walinspect.out   |   4 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql|   4 +-
 contrib/test_decoding/expected/rewrite.out |   2 +-
 contrib/test_decoding/sql/rewrite.sql  |   2 +-
 doc/src/sgml/catalogs.sgml |   2 +-
 doc/src/sgml/pgbuffercache.sgml|   2 +-
 doc/src/sgml/storage.sgml  |   5 +-
 src/backend/access/gin/ginxlog.c   |   2 +-
 src/backend/access/rmgrdesc/gistdesc.c |   2 +-
 src/backend/access/rmgrdesc/heapdesc.c |   2 +-
 src/backend/access/rmgrdesc/nbtdesc.c  |   2 +-
 src/backend/access/rmgrdesc/seqdesc.c  |   2 +-
 src/backend/access/rmgrdesc/xlogdesc.c |  21 ++-
 src/backend/access/transam/README  |   5 +-
 src/backend/access/transam/varsup.c| 199 -
 src/backend/access/transam/xlog.c  |  50 ++
 src/backend/access/transam/xlogprefetcher.c|  14 +-
 src/backend/access/transam/xlogrecovery.c  |   6 +-
 src/backend/access/transam/xlogutils.c |  12 +-
 src/backend/backup/basebackup.c|   2 +-
 src/backend/catalog/catalog.c  |  95 --
 src/backend/catalog/heap.c |  25 +--
 src/backend/catalog/index.c|  11 +-
 src/backend/catalog/storage.c  |   8 +
 src/backend/commands/tablecmds.c   |  12 +-
 src/backend/commands/tablespace.c  |   2 +-
 src/backend/nodes/gen_node_support.pl  |   4 +-
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/reorderbuffer.c|   2 +-
 src/backend/storage/file/reinit.c  |  28 +--
 src/backend/storage/freespace/fsmpage.c|   2 +-
 src/backend/storage/lmgr/lwlocknames.txt   |   1 +
 src/backend/storage/smgr/md.c  |   7 +
 

Re: making relfilenodes 56 bits

2022-09-08 Thread Dilip Kumar
On Tue, Sep 6, 2022 at 11:07 PM Robert Haas  wrote:

> On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar  wrote:
> > I have explored this area more and also tried to come up with a
> > working prototype, so while working on this I realized that we would
> > have almost to execute all the code which is getting generated as part
> > of the dumpDatabase() and dumpACL() which is basically,
> >
> > 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> > 2. DROP DATABASE
> > 3. CREATE DATABASE with all the database properties like ENCODING,
> > LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> > COLLATION_VERSION, TABLESPACE
> > 4. COMMENT ON DATABASE
> > 5. Logic inside dumpACL()
> >
> > I feel duplicating logic like this is really error-prone, but I do not
> > find any clear way to reuse the code as dumpDatabase() has a high
> > dependency on the Archive handle and generating the dump file.
>
> Yeah, I don't think this is the way to go at all. The duplicated logic
> is likely to get broken, and is also likely to annoy the next person
> who has to maintain it.
>

Right


> I suggest that for now we fall back on making the initial
> RelFileNumber for a system table equal to pg_class.oid. I don't really
> love that system and I think maybe we should change it at some point
> in the future, but all the alternatives seem too complicated to cram
> them into the current patch.
>

That makes sense.

On a separate note, while reviewing the latest patch I see there is some
risk of using the unflushed relfilenumber in GetNewRelFileNumber()
function.  Basically, in the current code, the flushing logic is tightly
coupled with the logging new relfilenumber logic and that might not work
with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea
is we need to keep the flushing logic separate from the logging, I am
working on the idea and I will post the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: making relfilenodes 56 bits

2022-09-06 Thread Robert Haas
On Tue, Sep 6, 2022 at 4:40 AM Dilip Kumar  wrote:
> I have explored this area more and also tried to come up with a
> working prototype, so while working on this I realized that we would
> have almost to execute all the code which is getting generated as part
> of the dumpDatabase() and dumpACL() which is basically,
>
> 1. UPDATE pg_catalog.pg_database SET datistemplate = false
> 2. DROP DATABASE
> 3. CREATE DATABASE with all the database properties like ENCODING,
> LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
> COLLATION_VERSION, TABLESPACE
> 4. COMMENT ON DATABASE
> 5. Logic inside dumpACL()
>
> I feel duplicating logic like this is really error-prone, but I do not
> find any clear way to reuse the code as dumpDatabase() has a high
> dependency on the Archive handle and generating the dump file.

Yeah, I don't think this is the way to go at all. The duplicated logic
is likely to get broken, and is also likely to annoy the next person
who has to maintain it.

I suggest that for now we fall back on making the initial
RelFileNumber for a system table equal to pg_class.oid. I don't really
love that system and I think maybe we should change it at some point
in the future, but all the alternatives seem too complicated to cram
them into the current patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-06 Thread Dilip Kumar
On Sun, Sep 4, 2022 at 9:27 AM Dilip Kumar  wrote:
>
> On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila  wrote:

> > Isn't this happening because we are passing "--clean
> > --create"/"--create" options to pg_restore in create_new_objects()? If
> > so, then I think one idea to decouple would be to not use those
> > options. Perform drop/create separately via commands (for create, we
> > need to generate the command as we are generating while generating the
> > dump in custom format), then rewrite the conflicting tables, and
> > finally restore the dump.
>
> Hmm, you are right.  So I think something like this is possible to do,
> I will explore this more. Thanks for the idea.

I have explored this area more and also tried to come up with a
working prototype, so while working on this I realized that we would
have almost to execute all the code which is getting generated as part
of the dumpDatabase() and dumpACL() which is basically,

1. UPDATE pg_catalog.pg_database SET datistemplate = false
2. DROP DATABASE
3. CREATE DATABASE with all the database properties like ENCODING,
LOCALE_PROVIDER, LOCALE, LC_COLLATE, LC_CTYPE, ICU_LOCALE,
COLLATION_VERSION, TABLESPACE
4. COMMENT ON DATABASE
5. Logic inside dumpACL()

I feel duplicating logic like this is really error-prone, but I do not
find any clear way to reuse the code as dumpDatabase() has a high
dependency on the Archive handle and generating the dump file.

So currently I have implemented most of this logic except for a few
e.g. dumpACL(), comments on the database, etc.  So before we go too
far in this direction I wanted to know the opinions of others.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar
On Sat, Sep 3, 2022 at 5:11 PM Amit Kapila  wrote:

> > I have found one more issue with this approach of rewriting the
> > conflicting table.  Earlier I thought we could do the conflict
> > checking and rewriting inside create_new_objects() right before the
> > restore command.  But after implementing (while testing) this I
> > realized that we DROP and CREATE the database while restoring the dump
> > that means it will again generate the conflicting system tables.  So
> > theoretically the rewriting should go in between the CREATE DATABASE
> > and restoring the object but as of now both create database and
> > restoring other objects are part of a single dump file.  I haven't yet
> > analyzed how feasible it is to generate the dump in two parts, first
> > part just to create the database and in second part restore the rest
> > of the object.
> >
>
> Isn't this happening because we are passing "--clean
> --create"/"--create" options to pg_restore in create_new_objects()? If
> so, then I think one idea to decouple would be to not use those
> options. Perform drop/create separately via commands (for create, we
> need to generate the command as we are generating while generating the
> dump in custom format), then rewrite the conflicting tables, and
> finally restore the dump.

Hmm, you are right.  So I think something like this is possible to do,
I will explore this more. Thanks for the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-09-03 Thread Amit Kapila
On Tue, Aug 30, 2022 at 6:15 PM Dilip Kumar  wrote:
>
> On Fri, Aug 26, 2022 at 9:33 PM Robert Haas  wrote:
> >
> > On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> > > While working on this solution I noticed one issue. Basically, the
> > > problem is that during binary upgrade when we try to rewrite a heap we
> > > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > > creating a new heap. But we are not preserving anything so we don't
> > > have those values. One option to this problem is that we can first
> > > start the postmaster in non-binary upgrade mode perform all conflict
> > > checking and rewrite and stop the postmaster.  Then start postmaster
> > > again and perform the restore as we are doing now.  Although we will
> > > have to start/stop the postmaster one extra time we have a solution.
> >
> > Yeah, that seems OK. Or we could add a new function, like
> > binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> > Not sure which way is better.
>
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>

Isn't this happening because we are passing "--clean
--create"/"--create" options to pg_restore in create_new_objects()? If
so, then I think one idea to decouple would be to not use those
options. Perform drop/create separately via commands (for create, we
need to generate the command as we are generating while generating the
dump in custom format), then rewrite the conflicting tables, and
finally restore the dump.

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar
On Sat, Sep 3, 2022 at 1:50 PM Dilip Kumar  wrote:
>
> On Tue, Aug 30, 2022 at 9:23 PM Robert Haas  wrote:
>
> > Well, that's very awkward. It doesn't seem like it would be very
> > difficult to teach pg_upgrade to call pg_restore without --clean and
> > just do the drop database itself, but that doesn't really help,
> > because pg_restore will in any event be creating the new database.
> > That doesn't seem like something we can practically refactor out,
> > because only pg_dump knows what properties to use when creating the
> > new database. What we could do is have the dump include a command like
> > SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> > but that doesn't really help very much, because passing the whole list
> > of relfilenode values from the old database seems pretty certain to be
> > a bad idea. The whole idea here was that we'd be able to build a hash
> > table on the new database's system table OIDs, and it seems like
> > that's not going to work.
>
> Right.
>
> > We could try to salvage some portion of the idea by making
> > pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> > set of arguments, like the smallest and largest relfilenode values
> > from the old database, and then we'd just need to move things that
> > overlap. But that feels pretty hit-or-miss to me as to whether it
> > actually avoids any work, and
> > pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> > to write. So perhaps we have to go back to the drawing board here.
>
> So as of now, we have two open options 1) the current approach and
> what patch is following to use Oid as relfilenode for the system
> tables when initially created.  2) call
> pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
> the system tables.
>
> Another idea that I am not very sure how feasible is. Can we change
> the dump such that in binary upgrade mode it will not use template0 as
> a template database (in creating database command) but instead some
> new database as a template e.g. template-XYZ?   And later for conflict
> checking, we will create this template-XYZ database on the new cluster
> and then we will perform all the conflict check (from all the
> databases of the old cluster) and rewrite operations on this database.
> And later all the databases will be created using template-XYZ as the
> template and all the rewriting stuff we have done is still intact.
> The problems I could think of are 1) only for a binary upgrade we will
> have to change the pg_dump.  2) we will have to use another database
> name as the reserved database name but what if that name is already in
> use in the previous cluster?

While we are still thinking on this issue, I have rebased the patch on
the latest head and fixed a couple of minor issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From a645d39c6109269bb74ed8f0627e61ecf9b0535a Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 26 Aug 2022 10:20:18 +0530
Subject: [PATCH v16] Widen relfilenumber from 32 bits to 56 bits
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently relfilenumber is 32 bits wide and that has a risk of wraparound so
the relfilenumber can be reused.  And to guard against the relfilenumber reuse
there is some complicated hack which leaves a 0-length tombstone file around
until the next checkpoint.  And when we allocate a new relfilenumber
we also need to loop to check the on disk conflict.

As part of this patch we are making the relfilenumber 56 bits wide and there will be
no provision for wraparound.  So after this change we will be able to get rid of the
0-length tombstone file and the loop for checking the on-disk conflict of the
relfilenumbers.

The reason behind making it 56 bits wide instead of directly making 64 bits wide is
that if we make it 64 bits wide then the size of the BufferTag will be increased which
will increase the memory usage and that may also impact the performance.  So in order
to avoid that, inside the buffer tag, we will use 8 bits for the fork number and 56 bits
for the relfilenumber.
---
 contrib/pg_buffercache/Makefile|   4 +-
 .../pg_buffercache/pg_buffercache--1.3--1.4.sql|  30 
 contrib/pg_buffercache/pg_buffercache.control  |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c  |  39 -
 contrib/pg_prewarm/autoprewarm.c   |   4 +-
 contrib/pg_walinspect/expected/pg_walinspect.out   |   4 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql|   4 +-
 contrib/test_decoding/expected/rewrite.out |   2 +-
 contrib/test_decoding/sql/rewrite.sql  |   2 +-
 doc/src/sgml/catalogs.sgml |   2 +-
 doc/src/sgml/pgbuffercache.sgml|   2 +-
 doc/src/sgml/storage.sgml  |   5 +-
 src/backend/access/gin/ginxlog.c   |   2 +-
 

Re: making relfilenodes 56 bits

2022-09-03 Thread Dilip Kumar
On Tue, Aug 30, 2022 at 9:23 PM Robert Haas  wrote:

> Well, that's very awkward. It doesn't seem like it would be very
> difficult to teach pg_upgrade to call pg_restore without --clean and
> just do the drop database itself, but that doesn't really help,
> because pg_restore will in any event be creating the new database.
> That doesn't seem like something we can practically refactor out,
> because only pg_dump knows what properties to use when creating the
> new database. What we could do is have the dump include a command like
> SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
> but that doesn't really help very much, because passing the whole list
> of relfilenode values from the old database seems pretty certain to be
> a bad idea. The whole idea here was that we'd be able to build a hash
> table on the new database's system table OIDs, and it seems like
> that's not going to work.

Right.

> We could try to salvage some portion of the idea by making
> pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
> set of arguments, like the smallest and largest relfilenode values
> from the old database, and then we'd just need to move things that
> overlap. But that feels pretty hit-or-miss to me as to whether it
> actually avoids any work, and
> pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
> to write. So perhaps we have to go back to the drawing board here.

So as of now, we have two open options 1) the current approach and
what patch is following to use Oid as relfilenode for the system
tables when initially created.  2) call
pg_binary_upgrade_move_things_out_of_the_way() which force rewrite all
the system tables.

Another idea that I am not very sure how feasible is. Can we change
the dump such that in binary upgrade mode it will not use template0 as
a template database (in creating database command) but instead some
new database as a template e.g. template-XYZ?   And later for conflict
checking, we will create this template-XYZ database on the new cluster
and then we will perform all the conflict check (from all the
databases of the old cluster) and rewrite operations on this database.
And later all the databases will be created using template-XYZ as the
template and all the rewriting stuff we have done is still intact.
The problems I could think of are 1) only for a binary upgrade we will
have to change the pg_dump.  2) we will have to use another database
name as the reserved database name but what if that name is already in
use in the previous cluster?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-30 Thread Robert Haas
On Tue, Aug 30, 2022 at 8:45 AM Dilip Kumar  wrote:
> I have found one more issue with this approach of rewriting the
> conflicting table.  Earlier I thought we could do the conflict
> checking and rewriting inside create_new_objects() right before the
> restore command.  But after implementing (while testing) this I
> realized that we DROP and CREATE the database while restoring the dump
> that means it will again generate the conflicting system tables.  So
> theoretically the rewriting should go in between the CREATE DATABASE
> and restoring the object but as of now both create database and
> restoring other objects are part of a single dump file.  I haven't yet
> analyzed how feasible it is to generate the dump in two parts, first
> part just to create the database and in second part restore the rest
> of the object.
>
> Thoughts?

Well, that's very awkward. It doesn't seem like it would be very
difficult to teach pg_upgrade to call pg_restore without --clean and
just do the drop database itself, but that doesn't really help,
because pg_restore will in any event be creating the new database.
That doesn't seem like something we can practically refactor out,
because only pg_dump knows what properties to use when creating the
new database. What we could do is have the dump include a command like
SELECT pg_binary_upgrade_move_things_out_of_the_way(some_arguments_here),
but that doesn't really help very much, because passing the whole list
of relfilenode values from the old database seems pretty certain to be
a bad idea. The whole idea here was that we'd be able to build a hash
table on the new database's system table OIDs, and it seems like
that's not going to work.

We could try to salvage some portion of the idea by making
pg_binary_upgrade_move_things_out_of_the_way() take a more restricted
set of arguments, like the smallest and largest relfilenode values
from the old database, and then we'd just need to move things that
overlap. But that feels pretty hit-or-miss to me as to whether it
actually avoids any work, and
pg_binary_upgrade_move_things_out_of_the_way() might also be annoying
to write. So perhaps we have to go back to the drawing board here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-30 Thread Dilip Kumar
On Fri, Aug 26, 2022 at 9:33 PM Robert Haas  wrote:
>
> On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> > While working on this solution I noticed one issue. Basically, the
> > problem is that during binary upgrade when we try to rewrite a heap we
> > would expect that “binary_upgrade_next_heap_pg_class_oid” and
> > “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> > creating a new heap. But we are not preserving anything so we don't
> > have those values. One option to this problem is that we can first
> > start the postmaster in non-binary upgrade mode perform all conflict
> > checking and rewrite and stop the postmaster.  Then start postmaster
> > again and perform the restore as we are doing now.  Although we will
> > have to start/stop the postmaster one extra time we have a solution.
>
> Yeah, that seems OK. Or we could add a new function, like
> binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
> Not sure which way is better.

I have found one more issue with this approach of rewriting the
conflicting table.  Earlier I thought we could do the conflict
checking and rewriting inside create_new_objects() right before the
restore command.  But after implementing (while testing) this I
realized that we DROP and CREATE the database while restoring the dump
that means it will again generate the conflicting system tables.  So
theoretically the rewriting should go in between the CREATE DATABASE
and restoring the object but as of now both create database and
restoring other objects are part of a single dump file.  I haven't yet
analyzed how feasible it is to generate the dump in two parts, first
part just to create the database and in second part restore the rest
of the object.

Thoughts?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-26 Thread Robert Haas
On Fri, Aug 26, 2022 at 7:01 AM Dilip Kumar  wrote:
> While working on this solution I noticed one issue. Basically, the
> problem is that during binary upgrade when we try to rewrite a heap we
> would expect that “binary_upgrade_next_heap_pg_class_oid” and
> “binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
> creating a new heap. But we are not preserving anything so we don't
> have those values. One option to this problem is that we can first
> start the postmaster in non-binary upgrade mode perform all conflict
> checking and rewrite and stop the postmaster.  Then start postmaster
> again and perform the restore as we are doing now.  Although we will
> have to start/stop the postmaster one extra time we have a solution.

Yeah, that seems OK. Or we could add a new function, like
binary_upgrade_allow_relation_oid_and_relfilenode_assignment(bool).
Not sure which way is better.

> But while thinking about this I started to think that since now we are
> completely decoupling the concept of Oid and relfilenumber then
> logically during REWRITE we should only be allocating new
> relfilenumber but we don’t really need to allocate the Oid at all.
> Yeah, we can do that if inside make_new_heap() if we pass the
> OIDOldHeap to heap_create_with_catalog(), then it will just create new
> storage(relfilenumber) but not a new Oid. But the problem is that the
> ATRewriteTable() and finish_heap_swap() functions are completely based
> on the relation cache.  So now if we only create a new relfilenumber
> but not a new Oid then we will have to change this infrastructure to
> copy at smgr level.

I think it would be a good idea to continue preserving the OIDs. If
nothing else, it makes debugging way easier, but also, there might be
user-defined regclass columns or something. Note the comments in
check_for_reg_data_type_usage().

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-26 Thread Dilip Kumar
On Thu, Aug 25, 2022 at 5:26 PM Dilip Kumar  wrote:

> I agree on this that this system is easy to explain that we just
> rewrite anything that conflicts so looks more future-proof.  Okay, I
> will try this solution and post the patch.

While working on this solution I noticed one issue. Basically, the
problem is that during binary upgrade when we try to rewrite a heap we
would expect that “binary_upgrade_next_heap_pg_class_oid” and
“binary_upgrade_next_heap_pg_class_relfilenumber” are already set for
creating a new heap. But we are not preserving anything so we don't
have those values. One option to this problem is that we can first
start the postmaster in non-binary upgrade mode perform all conflict
checking and rewrite and stop the postmaster.  Then start postmaster
again and perform the restore as we are doing now.  Although we will
have to start/stop the postmaster one extra time we have a solution.

But while thinking about this I started to think that since now we are
completely decoupling the concept of Oid and relfilenumber then
logically during REWRITE we should only be allocating new
relfilenumber but we don’t really need to allocate the Oid at all.
Yeah, we can do that if inside make_new_heap() if we pass the
OIDOldHeap to heap_create_with_catalog(), then it will just create new
storage(relfilenumber) but not a new Oid. But the problem is that the
ATRewriteTable() and finish_heap_swap() functions are completely based
on the relation cache.  So now if we only create a new relfilenumber
but not a new Oid then we will have to change this infrastructure to
copy at smgr level.

Thoughts?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-25 Thread Dilip Kumar
On Tue, Aug 23, 2022 at 8:00 PM Robert Haas  wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.

Other than complexity, we will have to check the conflict for all the
user table's relfilenumber from the old cluster into the hash build on
the new cluster's relfilenumber, isn't it extra overhead if there are
a lot of user tables?  But I think we are already restoring all those
tables in the new cluster so compared to that it will be very small.

> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.

I think having the OID-based system and having two ranges are not
exactly the same.  Because if we have the OID-based relfilenumber
allocation for system table (initially) and then later allocate from
the nextRelFileNumber counter then it seems like a mix of old system
(where actually OID and relfilenumber are tightly connected) and the
new system where nextRelFileNumber is completely independent counter.
OTOH having two ranges means logically we are not making dependent on
OID we are just allocating from a central counter but after catalog
initialization, we will leave some gap and start from a new range. So
I don't think this system is hard to explain.

> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.

I agree on this that this system is easy to explain that we just
rewrite anything that conflicts so looks more future-proof.  Okay, I
will try this solution and post the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-24 Thread Robert Haas
On Mon, Aug 1, 2022 at 7:57 AM Dilip Kumar  wrote:
> I have fixed other comments, and also fixed comments from Alvaro to
> use %lld instead of INT64_FORMAT inside the ereport and wherever he
> suggested.

Notwithstanding the ongoing discussion about the exact approach for
the main patch, it seemed OK to push the preparatory patch you posted
here, so I have now done that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-24 Thread Amit Kapila
On Tue, Aug 23, 2022 at 3:28 PM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila  wrote:
>
> > One more thing we may want to think about is what if there are tables
> > created by extension? For example, I think BDR creates some tables
> > like node_group, conflict_history, etc. Now, I think if such an
> > extension is created by default, both old and new clusters will have
> > those tables. Isn't there a chance of relfilenumber conflict in such
> > cases?
>
> Shouldn't they behave as a normal user table? because before upgrade
> anyway new cluster can not have any table other than system tables and
> those tables created by an extension should also be restored as other
> user table does.
>

Right.

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-08-24 Thread Amit Kapila
On Tue, Aug 23, 2022 at 8:00 PM Robert Haas  wrote:
>
> On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.
>
> True. That's the downside. The question is whether it's worth adding
> some complexity to avoid needing separate ranges.
>
> Honestly, if we don't care about having separate ranges, we can do
> something even simpler and just make the starting relfilenumber for
> system tables same as the OID. Then we don't have to do anything at
> all, outside of not changing the OID assigned to pg_largeobject in a
> future release. Then as long as pg_upgrade is targeting a new cluster
> with completely fresh databases that have not had any system table
> rewrites so far, there can't be any conflict.
>
> And perhaps that is the best solution after all, but while it is
> simple in terms of code, I feel it's a bit complicated for human
> beings. It's very simple to understand the scheme that Amit proposed:
> if there's anything in the new cluster that would conflict, we move it
> out of the way. We don't have to assume the new cluster hasn't had any
> table rewrites. We don't have to nail down starting relfilenumber
> assignments for system tables. We don't have to worry about
> relfilenumber or OID assignments changing between releases.
> pg_largeobject is not a special case. There are no special ranges of
> OIDs or relfilenumbers required. It just straight up works -- all the
> time, no matter what, end of story.
>

This sounds simple to understand. It seems we always create new system
tables in the new cluster before the upgrade, so I think it is safe to
assume there won't be any table rewrite in it. OTOH, if the
relfilenumber allocation scheme is robust to deal with table rewrites
then we probably don't need to worry about this assumption changing in
the future.

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-08-23 Thread Robert Haas
On Tue, Aug 23, 2022 at 2:06 AM Dilip Kumar  wrote:
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.

True. That's the downside. The question is whether it's worth adding
some complexity to avoid needing separate ranges.

Honestly, if we don't care about having separate ranges, we can do
something even simpler and just make the starting relfilenumber for
system tables same as the OID. Then we don't have to do anything at
all, outside of not changing the OID assigned to pg_largeobject in a
future release. Then as long as pg_upgrade is targeting a new cluster
with completely fresh databases that have not had any system table
rewrites so far, there can't be any conflict.

And perhaps that is the best solution after all, but while it is
simple in terms of code, I feel it's a bit complicated for human
beings. It's very simple to understand the scheme that Amit proposed:
if there's anything in the new cluster that would conflict, we move it
out of the way. We don't have to assume the new cluster hasn't had any
table rewrites. We don't have to nail down starting relfilenumber
assignments for system tables. We don't have to worry about
relfilenumber or OID assignments changing between releases.
pg_largeobject is not a special case. There are no special ranges of
OIDs or relfilenumbers required. It just straight up works -- all the
time, no matter what, end of story.

The other schemes we're talking about here all require a bunch of
assumptions about stuff like what I just mentioned. We can certainly
do it that way, and maybe it's even for the best. But I feel like it's
a little bit fragile. Maybe some future change gets blocked because it
would break one of the assumptions that the system relies on, or maybe
someone doesn't even realize there's an issue and changes something
that introduces a bug into this system. Or on the other hand maybe
not. But I think there's at least some value in considering whether
adding a little more code might actually make things simpler to reason
about, and whether that might be a good enough reason to do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-23 Thread Dilip Kumar
On Tue, Aug 23, 2022 at 3:16 PM Amit Kapila  wrote:
>
> > OTOH, if we keep the two separate ranges for the user and system table
> > then we don't need all this complex logic of conflict checking.  From
> > the old cluster, we can just remember the relfilenumbr of the
> > pg_largeobject, and in the new cluster before trying to restore we can
> > just query the new cluster pg_class and find out whether it is used by
> > any system table and if so then we can just rewrite that system table.
> >
>
> Before re-write of that system table, I think we need to set
> NextRelFileNumber to a number greater than the max relfilenumber from
> the old cluster, otherwise, it can later conflict with some user
> table.

Yes we will need to do that.

> > And I think using two ranges might not be that complicated because as
> > soon as we are done with the initdb we can just set NextRelFileNumber
> > to the first user range relfilenumber so I think this could be the
> > simplest solution.  And I think what Amit is suggesting is something
> > on this line?
> >
>
> Yeah, I had thought of checking only pg_largeobject. I think the
> advantage of having separate ranges is that we have a somewhat simpler
> logic in the upgrade but OTOH the other scheme has the advantage of
> having a single allocation scheme. Do we see any other pros/cons of
> one over the other?

I feel having a separate range is not much different from having a
single allocation scheme, after cluster initialization, we will just
have to set the NextRelFileNumber to something called
FirstNormalRelFileNumber which looks fine to me.

> One more thing we may want to think about is what if there are tables
> created by extension? For example, I think BDR creates some tables
> like node_group, conflict_history, etc. Now, I think if such an
> extension is created by default, both old and new clusters will have
> those tables. Isn't there a chance of relfilenumber conflict in such
> cases?

Shouldn't they behave as a normal user table? because before upgrade
anyway new cluster can not have any table other than system tables and
those tables created by an extension should also be restored as other
user table does.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-23 Thread Amit Kapila
On Tue, Aug 23, 2022 at 11:36 AM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar  wrote:
> >
> > On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
> > >
> > > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  
> > > wrote:
> > > > To solve that problem, how about rewriting the system table in the new
> > > > cluster which has a conflicting relfilenode? I think we can probably
> > > > do this conflict checking before processing the tables from the old
> > > > cluster.
> > >
> > > Thanks for chiming in.
> > >
> > > Right now, there are two parts to the relfilenumber preservation
> > > system, and this scheme doesn't quite fit into either of them. First,
> > > the dump includes commands to set pg_class.relfilenode in the new
> > > cluster to the same value that it had in the old cluster. The dump
> > > can't include any SQL commands that depend on what's happening in the
> > > new cluster because pg_dump(all) only connects to a single cluster,
> > > which in this case is the old cluster. Second, pg_upgrade itself
> > > copies the files from the old cluster to the new cluster. This doesn't
> > > involve a database connection at all. So there's no part of the
> > > current relfilenode preservation mechanism that can look at the old
> > > AND the new database and decide on some SQL to execute against the new
> > > database.
> > >
> > > I thought for a while that we could use the information that's already
> > > gathered by get_rel_infos() to do what you're suggesting here, but it
> > > doesn't quite work, because that function excludes system tables, and
> > > we can't afford to do that here. We'd either need to modify that query
> > > to include system tables - at least for the new cluster - or run a
> > > separate one to gather information about system tables in the new
> > > cluster. Then, we could put all the pg_class.relfilenode values we
> > > found in the new cluster into a hash table, loop over the list of rels
> > > this function found in the old cluster, and for each one, probe into
> > > the hash table. If we find a match, that's a system table that needs
> > > to be moved out of the way before calling create_new_objects(), or
> > > maybe inside that function but before it runs pg_restore.
> > >
> > > That doesn't seem too crazy, I think. It's a little bit of new
> > > mechanism, but it doesn't sound horrific. It's got the advantage of
> > > being significantly cheaper than my proposal of moving everything out
> > > of the way unconditionally, and at the same time it retains one of the
> > > key advantages of that proposal - IMV, anyway - which is that we don't
> > > need separate relfilenode ranges for user and system objects any more.
> > > So I guess on balance I kind of like it, but maybe I'm missing
> > > something.
> >
> > Okay, so this seems exactly the same as your previous proposal but
> > instead of unconditionally rewriting all the system tables we will
> > rewrite only those conflict with the user table or pg_largeobject from
> > the previous cluster.  Although it might have additional
> > implementation complexity on the pg upgrade side, it seems cheaper
> > than rewriting everything.
>
> OTOH, if we keep the two separate ranges for the user and system table
> then we don't need all this complex logic of conflict checking.  From
> the old cluster, we can just remember the relfilenumbr of the
> pg_largeobject, and in the new cluster before trying to restore we can
> just query the new cluster pg_class and find out whether it is used by
> any system table and if so then we can just rewrite that system table.
>

Before re-write of that system table, I think we need to set
NextRelFileNumber to a number greater than the max relfilenumber from
the old cluster, otherwise, it can later conflict with some user
table.

> And I think using two ranges might not be that complicated because as
> soon as we are done with the initdb we can just set NextRelFileNumber
> to the first user range relfilenumber so I think this could be the
> simplest solution.  And I think what Amit is suggesting is something
> on this line?
>

Yeah, I had thought of checking only pg_largeobject. I think the
advantage of having separate ranges is that we have a somewhat simpler
logic in the upgrade but OTOH the other scheme has the advantage of
having a single allocation scheme. Do we see any other pros/cons of
one over the other?

One more thing we may want to think about is what if there are tables
created by extension? For example, I think BDR creates some tables
like node_group, conflict_history, etc. Now, I think if such an
extension is created by default, both old and new clusters will have
those tables. Isn't there a chance of relfilenumber conflict in such
cases?

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-08-23 Thread Dilip Kumar
On Tue, Aug 23, 2022 at 8:33 AM Dilip Kumar  wrote:
>
> On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
> >
> > On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> > > To solve that problem, how about rewriting the system table in the new
> > > cluster which has a conflicting relfilenode? I think we can probably
> > > do this conflict checking before processing the tables from the old
> > > cluster.
> >
> > Thanks for chiming in.
> >
> > Right now, there are two parts to the relfilenumber preservation
> > system, and this scheme doesn't quite fit into either of them. First,
> > the dump includes commands to set pg_class.relfilenode in the new
> > cluster to the same value that it had in the old cluster. The dump
> > can't include any SQL commands that depend on what's happening in the
> > new cluster because pg_dump(all) only connects to a single cluster,
> > which in this case is the old cluster. Second, pg_upgrade itself
> > copies the files from the old cluster to the new cluster. This doesn't
> > involve a database connection at all. So there's no part of the
> > current relfilenode preservation mechanism that can look at the old
> > AND the new database and decide on some SQL to execute against the new
> > database.
> >
> > I thought for a while that we could use the information that's already
> > gathered by get_rel_infos() to do what you're suggesting here, but it
> > doesn't quite work, because that function excludes system tables, and
> > we can't afford to do that here. We'd either need to modify that query
> > to include system tables - at least for the new cluster - or run a
> > separate one to gather information about system tables in the new
> > cluster. Then, we could put all the pg_class.relfilenode values we
> > found in the new cluster into a hash table, loop over the list of rels
> > this function found in the old cluster, and for each one, probe into
> > the hash table. If we find a match, that's a system table that needs
> > to be moved out of the way before calling create_new_objects(), or
> > maybe inside that function but before it runs pg_restore.
> >
> > That doesn't seem too crazy, I think. It's a little bit of new
> > mechanism, but it doesn't sound horrific. It's got the advantage of
> > being significantly cheaper than my proposal of moving everything out
> > of the way unconditionally, and at the same time it retains one of the
> > key advantages of that proposal - IMV, anyway - which is that we don't
> > need separate relfilenode ranges for user and system objects any more.
> > So I guess on balance I kind of like it, but maybe I'm missing
> > something.
>
> Okay, so this seems exactly the same as your previous proposal but
> instead of unconditionally rewriting all the system tables we will
> rewrite only those conflict with the user table or pg_largeobject from
> the previous cluster.  Although it might have additional
> implementation complexity on the pg upgrade side, it seems cheaper
> than rewriting everything.

OTOH, if we keep the two separate ranges for the user and system table
then we don't need all this complex logic of conflict checking.  From
the old cluster, we can just remember the relfilenumbr of the
pg_largeobject, and in the new cluster before trying to restore we can
just query the new cluster pg_class and find out whether it is used by
any system table and if so then we can just rewrite that system table.
And I think using two ranges might not be that complicated because as
soon as we are done with the initdb we can just set NextRelFileNumber
to the first user range relfilenumber so I think this could be the
simplest solution.  And I think what Amit is suggesting is something
on this line?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-22 Thread Dilip Kumar
On Tue, Aug 23, 2022 at 1:46 AM Robert Haas  wrote:
>
> On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> > To solve that problem, how about rewriting the system table in the new
> > cluster which has a conflicting relfilenode? I think we can probably
> > do this conflict checking before processing the tables from the old
> > cluster.
>
> Thanks for chiming in.
>
> Right now, there are two parts to the relfilenumber preservation
> system, and this scheme doesn't quite fit into either of them. First,
> the dump includes commands to set pg_class.relfilenode in the new
> cluster to the same value that it had in the old cluster. The dump
> can't include any SQL commands that depend on what's happening in the
> new cluster because pg_dump(all) only connects to a single cluster,
> which in this case is the old cluster. Second, pg_upgrade itself
> copies the files from the old cluster to the new cluster. This doesn't
> involve a database connection at all. So there's no part of the
> current relfilenode preservation mechanism that can look at the old
> AND the new database and decide on some SQL to execute against the new
> database.
>
> I thought for a while that we could use the information that's already
> gathered by get_rel_infos() to do what you're suggesting here, but it
> doesn't quite work, because that function excludes system tables, and
> we can't afford to do that here. We'd either need to modify that query
> to include system tables - at least for the new cluster - or run a
> separate one to gather information about system tables in the new
> cluster. Then, we could put all the pg_class.relfilenode values we
> found in the new cluster into a hash table, loop over the list of rels
> this function found in the old cluster, and for each one, probe into
> the hash table. If we find a match, that's a system table that needs
> to be moved out of the way before calling create_new_objects(), or
> maybe inside that function but before it runs pg_restore.
>
> That doesn't seem too crazy, I think. It's a little bit of new
> mechanism, but it doesn't sound horrific. It's got the advantage of
> being significantly cheaper than my proposal of moving everything out
> of the way unconditionally, and at the same time it retains one of the
> key advantages of that proposal - IMV, anyway - which is that we don't
> need separate relfilenode ranges for user and system objects any more.
> So I guess on balance I kind of like it, but maybe I'm missing
> something.

Okay, so this seems exactly the same as your previous proposal but
instead of unconditionally rewriting all the system tables we will
rewrite only those conflict with the user table or pg_largeobject from
the previous cluster.  Although it might have additional
implementation complexity on the pg upgrade side, it seems cheaper
than rewriting everything.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-22 Thread Robert Haas
On Mon, Aug 22, 2022 at 3:55 AM Amit Kapila  wrote:
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.

Thanks for chiming in.

Right now, there are two parts to the relfilenumber preservation
system, and this scheme doesn't quite fit into either of them. First,
the dump includes commands to set pg_class.relfilenode in the new
cluster to the same value that it had in the old cluster. The dump
can't include any SQL commands that depend on what's happening in the
new cluster because pg_dump(all) only connects to a single cluster,
which in this case is the old cluster. Second, pg_upgrade itself
copies the files from the old cluster to the new cluster. This doesn't
involve a database connection at all. So there's no part of the
current relfilenode preservation mechanism that can look at the old
AND the new database and decide on some SQL to execute against the new
database.

I thought for a while that we could use the information that's already
gathered by get_rel_infos() to do what you're suggesting here, but it
doesn't quite work, because that function excludes system tables, and
we can't afford to do that here. We'd either need to modify that query
to include system tables - at least for the new cluster - or run a
separate one to gather information about system tables in the new
cluster. Then, we could put all the pg_class.relfilenode values we
found in the new cluster into a hash table, loop over the list of rels
this function found in the old cluster, and for each one, probe into
the hash table. If we find a match, that's a system table that needs
to be moved out of the way before calling create_new_objects(), or
maybe inside that function but before it runs pg_restore.

That doesn't seem too crazy, I think. It's a little bit of new
mechanism, but it doesn't sound horrific. It's got the advantage of
being significantly cheaper than my proposal of moving everything out
of the way unconditionally, and at the same time it retains one of the
key advantages of that proposal - IMV, anyway - which is that we don't
need separate relfilenode ranges for user and system objects any more.
So I guess on balance I kind of like it, but maybe I'm missing
something.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-22 Thread Amit Kapila
On Mon, Aug 22, 2022 at 1:25 PM Amit Kapila  wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
> >
> > On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> > > There was also an issue where the user table from the old cluster's
> > > relfilenode could conflict with the system table of the new cluster.
> > > As a solution currently for system table object (while creating
> > > storage first time) we are keeping the low range of relfilenumber,
> > > basically we are using the same relfilenumber as OID so that during
> > > upgrade the normal user table from the old cluster will not conflict
> > > with the system tables in the new cluster.  But with this solution
> > > Robert told me (in off list chat) a problem that in future if we want
> > > to make relfilenumber completely unique within a cluster by
> > > implementing the CREATEDB differently then we can not do that as we
> > > have created fixed relfilenodes for the system tables.
> > >
> > > I am not sure what exactly we can do to avoid that because even if we
> > > do something  to avoid that in the new cluster the old cluster might
> > > be already using the non-unique relfilenode so after upgrading the new
> > > cluster will also get those non-unique relfilenode.
> >
> > I think this aspect of the patch could use some more discussion.
> >
> > To recap, the problem is that pg_upgrade mustn't discover that a
> > relfilenode that is being migrated from the old cluster is being used
> > for some other table in the new cluster. Since the new cluster should
> > only contain system tables that we assume have never been rewritten,
> > they'll all have relfilenodes equal to their OIDs, and thus less than
> > 16384. On the other hand all the user tables from the old cluster will
> > have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> > which also gets migrated, is a special case. Since we don't change OID
> > assignments from version to version, it should have either the same
> > relfilenode value in the old and new clusters, if never rewritten, or
> > else the value in the old cluster will be greater than 16384, in which
> > case no conflict is possible.
> >
> > But if we just assign all relfilenode values from a central counter,
> > then we have got trouble. If the new version has more system catalog
> > tables than the old version, then some value that got used for a user
> > table in the old version might get used for a system table in the new
> > version, which is a problem. One idea for fixing this is to have two
> > RelFileNumber ranges: a system range (small values) and a user range.
> > System tables get values in the system range initially, and in the
> > user range when first rewritten. User tables always get values in the
> > user range. Everything works fine in this scenario except maybe for
> > pg_largeobject: what if it gets one value from the system range in the
> > old cluster, and a different value from the system range in the new
> > cluster, but some other system table in the new cluster gets the value
> > that pg_largeobject had in the old cluster? Then we've got trouble.
> >
>
> To solve that problem, how about rewriting the system table in the new
> cluster which has a conflicting relfilenode? I think we can probably
> do this conflict checking before processing the tables from the old
> cluster.
>

I think while rewriting of system table during the upgrade, we need to
ensure that it gets relfilenumber from the system range, otherwise, if
we allocate it from the user range, there will be a chance of conflict
with the user tables from the old cluster. Another way could be to set
the next-relfilenumber counter for the new cluster to the value from
the old cluster as mentioned by Robert in his previous email [1].

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoYsNiF8JGZ%2BKp7Zgcct67Qk%2B%2BYAp%2B1ybOQ0qomUayn%2B7A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-08-22 Thread Amit Kapila
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
>
> On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> I think this aspect of the patch could use some more discussion.
>
> To recap, the problem is that pg_upgrade mustn't discover that a
> relfilenode that is being migrated from the old cluster is being used
> for some other table in the new cluster. Since the new cluster should
> only contain system tables that we assume have never been rewritten,
> they'll all have relfilenodes equal to their OIDs, and thus less than
> 16384. On the other hand all the user tables from the old cluster will
> have relfilenodes greater than 16384, so we're fine. pg_largeobject,
> which also gets migrated, is a special case. Since we don't change OID
> assignments from version to version, it should have either the same
> relfilenode value in the old and new clusters, if never rewritten, or
> else the value in the old cluster will be greater than 16384, in which
> case no conflict is possible.
>
> But if we just assign all relfilenode values from a central counter,
> then we have got trouble. If the new version has more system catalog
> tables than the old version, then some value that got used for a user
> table in the old version might get used for a system table in the new
> version, which is a problem. One idea for fixing this is to have two
> RelFileNumber ranges: a system range (small values) and a user range.
> System tables get values in the system range initially, and in the
> user range when first rewritten. User tables always get values in the
> user range. Everything works fine in this scenario except maybe for
> pg_largeobject: what if it gets one value from the system range in the
> old cluster, and a different value from the system range in the new
> cluster, but some other system table in the new cluster gets the value
> that pg_largeobject had in the old cluster? Then we've got trouble.
>

To solve that problem, how about rewriting the system table in the new
cluster which has a conflicting relfilenode? I think we can probably
do this conflict checking before processing the tables from the old
cluster.

-- 
With Regards,
Amit Kapila.




Re: making relfilenodes 56 bits

2022-08-11 Thread Dilip Kumar
On Thu, Aug 11, 2022 at 10:58 AM Dilip Kumar  wrote:
>
> On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
> >
> > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > > I think even if we start the range from the 4 billion we can not avoid
> > > keeping two separate ranges for system and user tables otherwise the
> > > next upgrade where old and new clusters both have 56 bits
> > > relfilenumber will get conflicting files.  And, for the same reason we
> > > still have to call SetNextRelFileNumber() during upgrade.
> >
> > Well, my proposal to move everything from the new cluster up to higher
> > numbers would address this without requiring two ranges.
> >
> > > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > > range will start from 4 billion and user range maybe something around
> > > 4.1 (I think we can keep it very small though, just reserve 50k
> > > relfilenumber for system for future expansion and start user range
> > > from there).
> >
> > A disadvantage of this is that it basically means all the file names
> > in new clusters are going to be 10 characters long. That's not a big
> > disadvantage, but it's not wonderful. File names that are only 5-7
> > characters long are common today, and easier to remember.
>
> That's correct.
>
> > > So now system tables have no issues and also the user tables from the
> > > old cluster have no issues.  But pg_largeobject might get conflict
> > > when both old and new cluster are using 56 bits relfilenumber, because
> > > it is possible that in the new cluster some other system table gets
> > > that relfilenumber which is used by pg_largeobject in the old cluster.
> > >
> > > This could be resolved if we allocate pg_largeobject's relfilenumber
> > > from the user range, that means this relfilenumber will always be the
> > > first value from the user range.  So now if the old and new cluster
> > > both are using 56bits relfilenumber then pg_largeobject in both
> > > cluster would have got the same relfilenumber and if the old cluster
> > > is using the current 32 bits relfilenode system then the whole range
> > > of the new cluster is completely different than that of the old
> > > cluster.
> >
> > I think this can work, but it does rely to some extent on the fact
> > that there are no other tables which need to be treated like
> > pg_largeobject. If there were others, they'd need fixed starting
> > RelFileNumber assignments, or some other trick, like renumbering them
> > twice in the cluster, first two a known-unused value and then back to
> > the proper value. You'd have trouble if in the other cluster
> > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> > cluster the reverse, without some hackery.
>
> Agree, if it has more catalog like pg_largeobject then it would
> require some hacking.
>
> > I do feel like your idea here has some advantages - my proposal
> > requires rewriting all the catalogs in the new cluster before we do
> > anything else, and that's going to take some time even though they
> > should be small. But I also feel like it has some disadvantages: it
> > seems to rely on complicated reasoning and special cases more than I'd
> > like.
>
> One other advantage with your approach is that since we are starting
> the "nextrelfilenumber" after the old cluster's relfilenumber range.
> So only at the beginning we need to set the "nextrelfilenumber" but
> after that while upgrading each object we don't need to set the
> nextrelfilenumber every time because that is already higher than the
> complete old cluster range.  In other 2 approaches we will have to try
> to set the nextrelfilenumber everytime we preserve the relfilenumber
> during upgrade.

I was also thinking that whether we will get the max "relfilenumber"
from the old cluster at the cluster level or per database level?  I
mean if we want to get database level we can run simple query on
pg_class and get it but there also we will need to see how to handle
the mapped relation if they are rewritten?  I don't think we can get
the max relfilenumber from the old cluster at the cluster level.
Maybe in the newer version we can expose a function from the server to
just return the NextRelFileNumber and that would be the max
relfilenumber but I'm not sure how to do that in the old version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-10 Thread Dilip Kumar
On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
>
> On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > I think even if we start the range from the 4 billion we can not avoid
> > keeping two separate ranges for system and user tables otherwise the
> > next upgrade where old and new clusters both have 56 bits
> > relfilenumber will get conflicting files.  And, for the same reason we
> > still have to call SetNextRelFileNumber() during upgrade.
>
> Well, my proposal to move everything from the new cluster up to higher
> numbers would address this without requiring two ranges.
>
> > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > range will start from 4 billion and user range maybe something around
> > 4.1 (I think we can keep it very small though, just reserve 50k
> > relfilenumber for system for future expansion and start user range
> > from there).
>
> A disadvantage of this is that it basically means all the file names
> in new clusters are going to be 10 characters long. That's not a big
> disadvantage, but it's not wonderful. File names that are only 5-7
> characters long are common today, and easier to remember.

That's correct.

> > So now system tables have no issues and also the user tables from the
> > old cluster have no issues.  But pg_largeobject might get conflict
> > when both old and new cluster are using 56 bits relfilenumber, because
> > it is possible that in the new cluster some other system table gets
> > that relfilenumber which is used by pg_largeobject in the old cluster.
> >
> > This could be resolved if we allocate pg_largeobject's relfilenumber
> > from the user range, that means this relfilenumber will always be the
> > first value from the user range.  So now if the old and new cluster
> > both are using 56bits relfilenumber then pg_largeobject in both
> > cluster would have got the same relfilenumber and if the old cluster
> > is using the current 32 bits relfilenode system then the whole range
> > of the new cluster is completely different than that of the old
> > cluster.
>
> I think this can work, but it does rely to some extent on the fact
> that there are no other tables which need to be treated like
> pg_largeobject. If there were others, they'd need fixed starting
> RelFileNumber assignments, or some other trick, like renumbering them
> twice in the cluster, first two a known-unused value and then back to
> the proper value. You'd have trouble if in the other cluster
> pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> cluster the reverse, without some hackery.

Agree, if it has more catalog like pg_largeobject then it would
require some hacking.

> I do feel like your idea here has some advantages - my proposal
> requires rewriting all the catalogs in the new cluster before we do
> anything else, and that's going to take some time even though they
> should be small. But I also feel like it has some disadvantages: it
> seems to rely on complicated reasoning and special cases more than I'd
> like.

One other advantage with your approach is that since we are starting
the "nextrelfilenumber" after the old cluster's relfilenumber range.
So only at the beginning we need to set the "nextrelfilenumber" but
after that while upgrading each object we don't need to set the
nextrelfilenumber every time because that is already higher than the
complete old cluster range.  In other 2 approaches we will have to try
to set the nextrelfilenumber everytime we preserve the relfilenumber
during upgrade.

Other than these two approaches we have another approach (what the
patch set is already doing) where we keep the system relfilenumber
range same as Oid.  I know you have already pointed out that this
might have some hidden bug but one advantage of this approach is it is
simple compared two above two approaches in the sense that it doesn't
need to maintain two ranges and it also doesn't need to rewrite all
system tables in the new cluster.  So I think it would be good if we
can get others' opinions on all these 3 approaches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-09 Thread Robert Haas
On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> I think even if we start the range from the 4 billion we can not avoid
> keeping two separate ranges for system and user tables otherwise the
> next upgrade where old and new clusters both have 56 bits
> relfilenumber will get conflicting files.  And, for the same reason we
> still have to call SetNextRelFileNumber() during upgrade.

Well, my proposal to move everything from the new cluster up to higher
numbers would address this without requiring two ranges.

> So the idea is, we will be having 2 ranges for relfilenumbers, system
> range will start from 4 billion and user range maybe something around
> 4.1 (I think we can keep it very small though, just reserve 50k
> relfilenumber for system for future expansion and start user range
> from there).

A disadvantage of this is that it basically means all the file names
in new clusters are going to be 10 characters long. That's not a big
disadvantage, but it's not wonderful. File names that are only 5-7
characters long are common today, and easier to remember.

> So now system tables have no issues and also the user tables from the
> old cluster have no issues.  But pg_largeobject might get conflict
> when both old and new cluster are using 56 bits relfilenumber, because
> it is possible that in the new cluster some other system table gets
> that relfilenumber which is used by pg_largeobject in the old cluster.
>
> This could be resolved if we allocate pg_largeobject's relfilenumber
> from the user range, that means this relfilenumber will always be the
> first value from the user range.  So now if the old and new cluster
> both are using 56bits relfilenumber then pg_largeobject in both
> cluster would have got the same relfilenumber and if the old cluster
> is using the current 32 bits relfilenode system then the whole range
> of the new cluster is completely different than that of the old
> cluster.

I think this can work, but it does rely to some extent on the fact
that there are no other tables which need to be treated like
pg_largeobject. If there were others, they'd need fixed starting
RelFileNumber assignments, or some other trick, like renumbering them
twice in the cluster, first two a known-unused value and then back to
the proper value. You'd have trouble if in the other cluster
pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
cluster the reverse, without some hackery.

I do feel like your idea here has some advantages - my proposal
requires rewriting all the catalogs in the new cluster before we do
anything else, and that's going to take some time even though they
should be small. But I also feel like it has some disadvantages: it
seems to rely on complicated reasoning and special cases more than I'd
like.

What do other people think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-05 Thread Dilip Kumar
On Thu, Aug 4, 2022 at 5:01 PM Dilip Kumar  wrote:
>
> On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:
>
> > One solution to all this is to do as Dilip proposes here: for system
> > relations, keep assigning the OID as the initial relfilenumber.
> > Actually, we really only need to do this for pg_largeobject; all the
> > other relfilenumber values could be assigned from a counter, as long
> > as they're assigned from a range distinct from what we use for user
> > relations.
> >
> > But I don't really like that, because I feel like the whole thing
> > where we start out with relfilenumber=oid is a recipe for hidden bugs.
> > I believe we'd be better off if we decouple those concepts more
> > thoroughly. So here's another idea: what if we set the
> > next-relfilenumber counter for the new cluster to the value from the
> > old cluster, and then rewrote all the (thus-far-empty) system tables?
>
> You mean in a new cluster start the next-relfilenumber counter from
> the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
> we start next-relfilenumber after the range of the old cluster then we
> can also avoid the logic of SetNextRelFileNumber() during upgrade.
>
> My very initial idea around this was to start the next-relfilenumber
> directly from the 4 billion in the new cluster so there can not be any
> conflict and we don't even need to identify the highest value of used
> relfilenode in the old cluster.  In fact we don't need to rewrite the
> system table before upgrading I think.  So what do we lose with this?
> just 4 billion relfilenode? does that really matter provided the range
> we get with the 56 bits relfilenumber.

I think even if we start the range from the 4 billion we can not avoid
keeping two separate ranges for system and user tables otherwise the
next upgrade where old and new clusters both have 56 bits
relfilenumber will get conflicting files.  And, for the same reason we
still have to call SetNextRelFileNumber() during upgrade.

So the idea is, we will be having 2 ranges for relfilenumbers, system
range will start from 4 billion and user range maybe something around
4.1 (I think we can keep it very small though, just reserve 50k
relfilenumber for system for future expansion and start user range
from there).

So now system tables have no issues and also the user tables from the
old cluster have no issues.  But pg_largeobject might get conflict
when both old and new cluster are using 56 bits relfilenumber, because
it is possible that in the new cluster some other system table gets
that relfilenumber which is used by pg_largeobject in the old cluster.

This could be resolved if we allocate pg_largeobject's relfilenumber
from the user range, that means this relfilenumber will always be the
first value from the user range.  So now if the old and new cluster
both are using 56bits relfilenumber then pg_largeobject in both
cluster would have got the same relfilenumber and if the old cluster
is using the current 32 bits relfilenode system then the whole range
of the new cluster is completely different than that of the old
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-04 Thread Dilip Kumar
On Sat, Jul 30, 2022 at 1:59 AM Robert Haas  wrote:

> One solution to all this is to do as Dilip proposes here: for system
> relations, keep assigning the OID as the initial relfilenumber.
> Actually, we really only need to do this for pg_largeobject; all the
> other relfilenumber values could be assigned from a counter, as long
> as they're assigned from a range distinct from what we use for user
> relations.
>
> But I don't really like that, because I feel like the whole thing
> where we start out with relfilenumber=oid is a recipe for hidden bugs.
> I believe we'd be better off if we decouple those concepts more
> thoroughly. So here's another idea: what if we set the
> next-relfilenumber counter for the new cluster to the value from the
> old cluster, and then rewrote all the (thus-far-empty) system tables?

You mean in a new cluster start the next-relfilenumber counter from
the highest relfilenode/Oid value in the old cluster right?.  Yeah, if
we start next-relfilenumber after the range of the old cluster then we
can also avoid the logic of SetNextRelFileNumber() during upgrade.

My very initial idea around this was to start the next-relfilenumber
directly from the 4 billion in the new cluster so there can not be any
conflict and we don't even need to identify the highest value of used
relfilenode in the old cluster.  In fact we don't need to rewrite the
system table before upgrading I think.  So what do we lose with this?
just 4 billion relfilenode? does that really matter provided the range
we get with the 56 bits relfilenumber.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-31 Thread Dilip Kumar
On Fri, Jul 29, 2022 at 10:55 PM Robert Haas  wrote:
>
> On Thu, Jul 28, 2022 at 10:29 AM Robert Haas  wrote:
> > > I have done some cleanup in 0002 as well, basically, earlier we were
> > > storing the result of the BufTagGetRelFileLocator() in a separate
> > > variable which is not required everywhere.  So wherever possible I
> > > have avoided using the intermediate variable.
> >
> > I'll have a look at this next.
>
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does. I would have expected it
> to return void and take an argument of type RelFileLocator * into
> which it writes the results. On the other hand, I was also taught that
> one should avoid passing a struct type as an argument, and smgropen()
> has been doing that since Tom Lane committed
> 87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
> maybe this isn't that relevant any more on modern compilers? Or maybe
> for small structs it doesn't matter much? I dunno.
>
> Other than that, I think your 0002 looks fine.

Generally, I try to avoid it, but I see in current code also if the
structure is small and by directly returning the structure it makes
the other code easy then we are doing this way[1].  I wanted to do
this way is a) if we pass as an argument then I will have to use an
extra variable which makes some code complicated, it's not a big
issue, infact I had it that way in the previous version but simplified
in one of the recent versions.  b) If I allocate memory and return
pointer then also I need to store that address and later free that.

[1]
static inline ForEachState
for_each_from_setup(const List *lst, int N)
{
ForEachState r = {lst, N};

Assert(N >= 0);
return r;
}

static inline FullTransactionId
FullTransactionIdFromEpochAndXid(uint32 epoch, TransactionId xid)
{
FullTransactionId result;

result.value = ((uint64) epoch) << 32 | xid;

return result;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-31 Thread Dilip Kumar
On Sat, Jul 30, 2022 at 1:35 AM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2022-Jul-29, Robert Haas wrote:
> >> Yeah, if we think it's OK to pass around structs, then that seems like
> >> the right solution. Otherwise functions that take RelFileLocator
> >> should be changed to take const RelFileLocator * and we should adjust
> >> elsewhere accordingly.
>
> > We do that in other places.  See get_object_address() for another
> > example.  Now, I don't see *why* they do it.
>
> If it's a big struct then avoiding copying it is good; but RelFileLocator
> isn't that big.
>
> While researching that statement I did happen to notice that no one has
> bothered to update the comment immediately above struct RelFileLocator,
> and it is something that absolutely does require attention if there
> are plans to make RelFileNumber something other than 32 bits.

I think we need to update this comment in the patch where we are
making RelFileNumber 64 bits wide.  But as such I do not see a problem
in using RelFileLocator directly as key because if we make
RelFileNumber 64 bits then its structure will already be 8 byte
aligned so there should not be any padding.  However, if we use some
other structure as key which contain RelFileLocator i.e.
RelFileLocatorBackend then there will be a problem.  So for handling
that issue while computing the key size (wherever we have
RelFileLocatorBackend as key) I have avoided the padding bytes in size
by introducing this new macro[1].

[1]
#define SizeOfRelFileLocatorBackend \
(offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-30 Thread Alvaro Herrera
On 2022-Jul-30, Dilip Kumar wrote:

> On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera  
> wrote:

> > Please don't do this; rather use %llu and cast to (long long).
> > Otherwise the string becomes mangled for translation.
> 
> Okay, actually I did not understand the clear logic of when to use
> %llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
> integers.  So do you think it is fine to replace all INT64_FORMAT in
> my patch with %llu?

The point here is that there are two users of the source code: one is
the compiler, and the other is gettext, which extracts the string for
the translation catalog.  The compiler is OK with UINT64_FORMAT, of
course (because the preprocessor deals with it).  But gettext is quite
stupid and doesn't understand that UINT64_FORMAT expands to some
specifier, so it truncates the string at the double quote sign just
before; in other words, it just doesn't work.  So whenever you have a
string that ends up in a translation catalog, you must not use
UINT64_FORMAT or any other preprocessor macro; it has to be a straight
specifier in the format string.

We have found that the most convenient notation is to use %llu in the
string and cast the argument to (unsigned long long), so our convention
is to use that.

For strings that do not end up in a translation catalog, there's no
reason to use %llu-and-cast; UINT64_FORMAT is okay.

> > > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> > >   if (memcmp(replay_image_masked, primary_image_masked, 
> > > BLCKSZ) != 0)
> > >   {
> > >   elog(FATAL,
> > > -  "inconsistent page found, rel %u/%u/%u, 
> > > forknum %u, blkno %u",
> > > +  "inconsistent page found, rel %u/%u/" 
> > > INT64_FORMAT ", forknum %u, blkno %u",
> > >rlocator.spcOid, rlocator.dbOid, 
> > > rlocator.relNumber,
> > >forknum, blkno);
> >
> > Should this one be an ereport, and thus you do need to change it to that
> > and handle it like that?
> 
> Okay, so you mean irrespective of this patch should this be converted
> to ereport?

Yes, I think this should be an ereport with errcode(ERRCODE_DATA_CORRUPTED).

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: making relfilenodes 56 bits

2022-07-29 Thread Dilip Kumar
On Thu, Jul 28, 2022 at 9:29 PM Alvaro Herrera  wrote:
>
> Not a full review, just a quick skim of 0003.

Thanks for the review

> > + if (!shutdown)
> > + {
> > + if (ShmemVariableCache->loggedRelFileNumber < 
> > checkPoint.nextRelFileNumber)
> > + elog(ERROR, "nextRelFileNumber can not go backward 
> > from " INT64_FORMAT "to" INT64_FORMAT,
> > +  checkPoint.nextRelFileNumber, 
> > ShmemVariableCache->loggedRelFileNumber);
> > +
> > + checkPoint.nextRelFileNumber = 
> > ShmemVariableCache->loggedRelFileNumber;
> > + }
>
> Please don't do this; rather use %llu and cast to (long long).
> Otherwise the string becomes mangled for translation.  I think there are
> many uses of this sort of pattern in strings, but not all of them are
> translatable so maybe we don't care -- for example contrib doesn't have
> translations.  And the rmgrdesc routines don't translate either, so we
> probably don't care about it there; and nothing that uses elog either.
> But this one in particular I think should be an ereport, not an elog.
> There are several other ereports in various places of the patch also.

Okay, actually I did not understand the clear logic of when to use
%llu and to use (U)INT64_FORMAT.  They are both used for 64-bit
integers.  So do you think it is fine to replace all INT64_FORMAT in
my patch with %llu?

> > @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
> >   if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) 
> > != 0)
> >   {
> >   elog(FATAL,
> > -  "inconsistent page found, rel %u/%u/%u, 
> > forknum %u, blkno %u",
> > +  "inconsistent page found, rel %u/%u/" 
> > INT64_FORMAT ", forknum %u, blkno %u",
> >rlocator.spcOid, rlocator.dbOid, 
> > rlocator.relNumber,
> >forknum, blkno);
>
> Should this one be an ereport, and thus you do need to change it to that
> and handle it like that?

Okay, so you mean irrespective of this patch should this be converted
to ereport?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-29 Thread Thomas Munro
On Sat, Jul 30, 2022 at 9:11 AM Thomas Munro  wrote:
> on all modern Unix-y systems,

(I meant to write AMD64 there)




Re: making relfilenodes 56 bits

2022-07-29 Thread Thomas Munro
On Sat, Jul 30, 2022 at 8:08 AM Tom Lane  wrote:
> Robert Haas  writes:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
> C99 and later requires it.  But it is pass-by-value and you have
> to think twice about whether you want the struct to be copied.

C89 had that.

As for what it actually does in a non-inlined function: on all modern
Unix-y systems, 128 bit first arguments and return values are
transferred in register pairs[1].  So if you define a struct that
holds uint32_t, uint32_t, uint64_t and compile a function that takes
one and returns it, you see the struct being transferred directly from
input registers to output registers:

   0x <+0>:mov%rdi,%rax
   0x0003 <+3>:mov%rsi,%rdx
   0x0006 <+6>:ret

Similar on ARM64.  There it's an empty function, so it must be using
the same register in and out[2].

The MSVC calling convention is different and doesn't seem to be able
to pass it through registers, so it schleps it out to memory at a
return address[3].  But that's pretty similar to the proposed
alternative anyway, so surely no worse.  *shrug*  And of course those
"constructor"-like functions are inlined anyway.

[1] https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI
[2] https://gcc.godbolt.org/z/qfPzhW7YM
[3] https://gcc.godbolt.org/z/WqvYz6xjs




Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas
On Wed, Jul 20, 2022 at 7:27 AM Dilip Kumar  wrote:
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

I think this aspect of the patch could use some more discussion.

To recap, the problem is that pg_upgrade mustn't discover that a
relfilenode that is being migrated from the old cluster is being used
for some other table in the new cluster. Since the new cluster should
only contain system tables that we assume have never been rewritten,
they'll all have relfilenodes equal to their OIDs, and thus less than
16384. On the other hand all the user tables from the old cluster will
have relfilenodes greater than 16384, so we're fine. pg_largeobject,
which also gets migrated, is a special case. Since we don't change OID
assignments from version to version, it should have either the same
relfilenode value in the old and new clusters, if never rewritten, or
else the value in the old cluster will be greater than 16384, in which
case no conflict is possible.

But if we just assign all relfilenode values from a central counter,
then we have got trouble. If the new version has more system catalog
tables than the old version, then some value that got used for a user
table in the old version might get used for a system table in the new
version, which is a problem. One idea for fixing this is to have two
RelFileNumber ranges: a system range (small values) and a user range.
System tables get values in the system range initially, and in the
user range when first rewritten. User tables always get values in the
user range. Everything works fine in this scenario except maybe for
pg_largeobject: what if it gets one value from the system range in the
old cluster, and a different value from the system range in the new
cluster, but some other system table in the new cluster gets the value
that pg_largeobject had in the old cluster? Then we've got trouble. It
doesn't help if we assign pg_largeobject a starting relfilenode from
the user range, either: now a relfilenode that needs to end up
containing the some user table from the old cluster might find itself
blocked by pg_largeobject in the new cluster.

One solution to all this is to do as Dilip proposes here: for system
relations, keep assigning the OID as the initial relfilenumber.
Actually, we really only need to do this for pg_largeobject; all the
other relfilenumber values could be assigned from a counter, as long
as they're assigned from a range distinct from what we use for user
relations.

But I don't really like that, because I feel like the whole thing
where we start out with relfilenumber=oid is a recipe for hidden bugs.
I believe we'd be better off if we decouple those concepts more
thoroughly. So here's another idea: what if we set the
next-relfilenumber counter for the new cluster to the value from the
old cluster, and then rewrote all the (thus-far-empty) system tables?
Then every system relation in the new cluster has a relfilenode value
greater than any in use in the old cluster, so we can afterwards
migrate over every relfilenode from the old cluster with no risk of
conflicting with anything. Then all the special cases go away. We
don't need system and user ranges for relfilenodes, and
pg_largeobject's not a special case, either. We can assign relfilenode
values to system relations in exactly the same we do for user
relations: assign a value from the global counter and forget about it.
If this cluster happens to be the "new cluster" for a pg_upgrade
attempt, the procedure described at the beginning of this paragraph
will move everything that might conflict out of the way.

One thing to perhaps not like about this is that it's a little more
expensive: clustering every system table in every database on a new
cluster isn't completely free. Perhaps it's not expensive enough to be
a big problem, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-29 Thread Tom Lane
Robert Haas  writes:
> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

FWIW, I think that was invalid pre-ANSI-C, and maybe even in C89.
C99 and later requires it.  But it is pass-by-value and you have
to think twice about whether you want the struct to be copied.

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-29 Thread Tom Lane
Alvaro Herrera  writes:
> On 2022-Jul-29, Robert Haas wrote:
>> Yeah, if we think it's OK to pass around structs, then that seems like
>> the right solution. Otherwise functions that take RelFileLocator
>> should be changed to take const RelFileLocator * and we should adjust
>> elsewhere accordingly.

> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.

If it's a big struct then avoiding copying it is good; but RelFileLocator
isn't that big.

While researching that statement I did happen to notice that no one has
bothered to update the comment immediately above struct RelFileLocator,
and it is something that absolutely does require attention if there
are plans to make RelFileNumber something other than 32 bits.

 * Note: various places use RelFileLocator in hashtable keys.  Therefore,
 * there *must not* be any unused padding bytes in this struct.  That
 * should be safe as long as all the fields are of type Oid.
 */
typedef struct RelFileLocator
{
OidspcOid;/* tablespace */
OiddbOid; /* database */
RelFileNumber  relNumber; /* relation */
} RelFileLocator;

regards, tom lane




Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas
On Fri, Jul 29, 2022 at 3:18 PM Alvaro Herrera  wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > Yeah, if we think it's OK to pass around structs, then that seems like
> > the right solution. Otherwise functions that take RelFileLocator
> > should be changed to take const RelFileLocator * and we should adjust
> > elsewhere accordingly.
>
> We do that in other places.  See get_object_address() for another
> example.  Now, I don't see *why* they do it.  I suppose there's
> notational convenience; for get_object_address() I think it'd be uglier
> with another out argument (it already has *relp).  For smgropen() it's
> not clear at all that there is any.
>
> For the new function, there's at least a couple of places that the
> calling convention makes simpler, so I don't see why you wouldn't use it
> that way.

All right, perhaps it's fine as Dilip has it, then.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera
On 2022-Jul-29, Robert Haas wrote:

> Yeah, if we think it's OK to pass around structs, then that seems like
> the right solution. Otherwise functions that take RelFileLocator
> should be changed to take const RelFileLocator * and we should adjust
> elsewhere accordingly.

We do that in other places.  See get_object_address() for another
example.  Now, I don't see *why* they do it.  I suppose there's
notational convenience; for get_object_address() I think it'd be uglier
with another out argument (it already has *relp).  For smgropen() it's
not clear at all that there is any.

For the new function, there's at least a couple of places that the
calling convention makes simpler, so I don't see why you wouldn't use it
that way.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"




Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas
On Fri, Jul 29, 2022 at 2:12 PM Alvaro Herrera  wrote:
> On 2022-Jul-29, Robert Haas wrote:
> > I was taught that when programming in C one should avoid returning a
> > struct type, as BufTagGetRelFileLocator does.
>
> Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
> RelFileLocator as argument.  With this coding you can call one function
> with the other function as its argument.
>
> However, with the current definition of relpathbackend() and siblings,
> it looks quite disastrous -- BufTagGetRelFileLocator is being called
> three times.  You could argue that a solution would be to turn those
> macros into static inline functions.

Yeah, if we think it's OK to pass around structs, then that seems like
the right solution. Otherwise functions that take RelFileLocator
should be changed to take const RelFileLocator * and we should adjust
elsewhere accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera
On 2022-Jul-29, Robert Haas wrote:

> I was taught that when programming in C one should avoid returning a
> struct type, as BufTagGetRelFileLocator does.

Doing it like that helps RelFileLocatorSkippingWAL, which takes a bare
RelFileLocator as argument.  With this coding you can call one function
with the other function as its argument.

However, with the current definition of relpathbackend() and siblings,
it looks quite disastrous -- BufTagGetRelFileLocator is being called
three times.  You could argue that a solution would be to turn those
macros into static inline functions.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from 
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
 Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg0.php




Re: making relfilenodes 56 bits

2022-07-29 Thread Robert Haas
On Thu, Jul 28, 2022 at 10:29 AM Robert Haas  wrote:
> > I have done some cleanup in 0002 as well, basically, earlier we were
> > storing the result of the BufTagGetRelFileLocator() in a separate
> > variable which is not required everywhere.  So wherever possible I
> > have avoided using the intermediate variable.
>
> I'll have a look at this next.

I was taught that when programming in C one should avoid returning a
struct type, as BufTagGetRelFileLocator does. I would have expected it
to return void and take an argument of type RelFileLocator * into
which it writes the results. On the other hand, I was also taught that
one should avoid passing a struct type as an argument, and smgropen()
has been doing that since Tom Lane committed
87bd95638552b8fc1f5f787ce5b862bb6fc2eb80 all the way back in 2004. So
maybe this isn't that relevant any more on modern compilers? Or maybe
for small structs it doesn't matter much? I dunno.

Other than that, I think your 0002 looks fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-29 Thread vignesh C
On Wed, Jul 27, 2022 at 6:02 PM Dilip Kumar  wrote:
>
> On Wed, Jul 27, 2022 at 3:27 PM vignesh C  wrote:
> >
>
> > Thanks for the updated patch, Few comments:
> > 1) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_load_buffers
> > ...
> > if (fscanf(file, "%u,%u,%u,%u,%u\n", [i].database,
> >[i].tablespace, [i].filenumber,
> >, [i].blocknum) != 5)
> > ...
> >
> > 2) The format specifier should be changed from %u to INT64_FORMAT
> > autoprewarm.c -> apw_dump_now
> > ...
> > ret = fprintf(file, "%u,%u,%u,%u,%u\n",
> >   block_info_array[i].database,
> >   block_info_array[i].tablespace,
> >   block_info_array[i].filenumber,
> >   (uint32) block_info_array[i].forknum,
> >   block_info_array[i].blocknum);
> > ...
> >
> > 3) should the comment "entry point for old extension version" be on
> > top of pg_buffercache_pages, as the current version will use
> > pg_buffercache_pages_v1_4
> > +
> > +Datum
> > +pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +{
> > +   return pg_buffercache_pages_internal(fcinfo, OIDOID);
> > +}
> > +
> > +/* entry point for old extension version */
> > +Datum
> > +pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
> > +{
> > +   return pg_buffercache_pages_internal(fcinfo, INT8OID);
> > +}
> >
> > 4) we could use the new style or ereport by removing the brackets
> > around errcode:
> > +   if (fctx->record[i].relfilenumber > OID_MAX)
> > +   ereport(ERROR,
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> > +
> >  fctx->record[i].relfilenumber),
> > +
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE")));
> >
> > like:
> > ereport(ERROR,
> >
> > errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >
> > errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
> > an OID",
> >
> > fctx->record[i].relfilenumber),
> >
> > errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
> > UPDATE"));
> >
> > 5) Similarly in the below code too:
> > +   /* check whether the relfilenumber is within a valid range */
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +   (relfilenumber;
> >
> >
> > 6) Similarly in the below code too:
> > +#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
> >  \
> > +do {
> >  \
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
> > +   ereport(ERROR,
> >  \
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",   \
> > +   (relfilenumber; \
> > +} while (0)
> > +
> >
> >
> > 7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
> > this macro be used here too:
> > pg_filenode_relation(PG_FUNCTION_ARGS)
> >  {
> > Oid reltablespace = PG_GETARG_OID(0);
> > -   RelFileNumber relfilenumber = PG_GETARG_OID(1);
> > +   RelFileNumber relfilenumber = PG_GETARG_INT64(1);
> > Oid heaprel;
> >
> > +   /* check whether the relfilenumber is within a valid range */
> > +   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
> > +   ereport(ERROR,
> > +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +errmsg("relfilenumber " INT64_FORMAT
> > " is out of range",
> > +   (relfilenumber;
> >
> >
> > 8) I felt this include is not required:
> > diff --git a/src/backend/access/transam/varsup.c
> > b/src/backend/access/transam/varsup.c
> > index 849a7ce..a2f0d35 100644
> > --- a/src/backend/access/transam/varsup.c
> > +++ b/src/backend/access/transam/varsup.c
> > @@ -13,12 +13,16 @@
> >
> >  #include "postgres.h"
> >
> > +#include 
> > +
> >  #include "access/clog.h"
> >  #include "access/commit_ts.h"
> >
> > 9) should we change elog to ereport to use the New-style error reporting API
> > +   /* safety check, we should never get this far in a HS standby */
> > +   if (RecoveryInProgress())
> > +   elog(ERROR, "cannot assign RelFileNumber during recovery");
> > +
> > +   if (IsBinaryUpgrade)
> > +   elog(ERROR, "cannot assign RelFileNumber during binary
> > upgrade");
> >
> > 10) Here nextRelFileNumber is protected by RelFileNumberGenLock, 

Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma
On Fri, Jul 29, 2022 at 6:26 PM Ashutosh Sharma 
wrote:

> On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:
>
> +/* --
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
>
> Above comment says that RelFileNumber zero is invalid which is technically
> correct because we don't have any relation file in disk with zero number.
> But the point is that if someone reads below definition of
> CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
> definition relfilenumber zero is valid.
>

Please ignore the above comment shared in my previous email.  It is a
little over-thinking on my part that generated this comment in my mind.
Sorry for that. Here are the other comments I have:

+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_buffercache DROP VIEW pg_buffercache;
+ALTER EXTENSION pg_buffercache DROP FUNCTION pg_buffercache_pages();
+
+/* Then we can drop them */
+DROP VIEW pg_buffercache;
+DROP FUNCTION pg_buffercache_pages();
+
+/* Now redefine */
+CREATE OR REPLACE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages_v1_4'
+LANGUAGE C PARALLEL SAFE;
+
+CREATE OR REPLACE VIEW pg_buffercache AS
+   SELECT P.* FROM pg_buffercache_pages() AS P
+   (bufferid integer, relfilenode int8, reltablespace oid, reldatabase oid,
+relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+pinning_backends int4);

As we are dropping the function and view I think it would be good if we
*don't* use the "OR REPLACE" keyword when re-defining them.

==

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenode" INT64_FORMAT " is too
large to be represented as an OID",
+   fctx->record[i].relfilenumber),
+errhint("Upgrade the extension using ALTER
EXTENSION pg_buffercache UPDATE")));

I think it would be good to recommend users to upgrade to the latest
version instead of just saying upgrade the pg_buffercache using ALTER
EXTENSION 

==

--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -39,10 +39,10 @@ SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
 -- Test for filtering out WAL records of a particular table
 -- ===

-SELECT oid AS sample_tbl_oid FROM pg_class WHERE relname = 'sample_tbl'
\gset
+SELECT relfilenode AS sample_tbl_relfilenode FROM pg_class WHERE relname =
'sample_tbl' \gset

Is this change required? The original query is just trying to fetch table
oid not relfilenode and AFAIK we haven't changed anything in table oid.

==

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

I think we can shift this macro to some header file and reuse it at several
places.

==


+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation

--
With Regards,
Ashutosh Sharma.


Re: making relfilenodes 56 bits

2022-07-29 Thread Ashutosh Sharma
On Thu, Jul 28, 2022 at 5:02 PM Dilip Kumar  wrote:

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage

Above comment says that RelFileNumber zero is invalid which is technically
correct because we don't have any relation file in disk with zero number.
But the point is that if someone reads below definition of
CHECK_RELFILENUMBER_RANGE he/she might get confused because as per this
definition relfilenumber zero is valid.

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)   \
+do {   \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,  \
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT " is out of range",
\
+   (relfilenumber; \
+} while (0)
+

+   RelFileNumber relfilenumber = PG_GETARG_INT64(0);
+   CHECK_RELFILENUMBER_RANGE(relfilenumber);

It seems like the relfilenumber in above definition represents relfilenode
value in pg_class which can hold zero value which actually means it's a
mapped relation. I think it would be good to provide some clarity here.

--
With Regards,
Ashutosh Sharma.


Re: making relfilenodes 56 bits

2022-07-29 Thread Alvaro Herrera
On 2022-Jul-28, Robert Haas wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera  
> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FF in the MAX_RELFILENUMBER definition.
> 
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames.

Of course.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)




Re: making relfilenodes 56 bits

2022-07-28 Thread Joshua Drake
On Thu, Jul 28, 2022 at 9:52 AM Robert Haas  wrote:

> On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera 
> wrote:
> > I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> > not use hex digits?  Then we know the limit is 14 chars, as in
> > 0x00FF in the MAX_RELFILENUMBER definition.
>
> Hmm, but surely we want the error messages to be printed using the
> same format that we use for the actual filenames. We could make the
> filenames use hex characters too, but I'm not wild about changing
> user-visible details like that.
>

>From a DBA perspective this would be a regression in usability.

JD

-- 

   - Founder - https://commandprompt.com/ - 24x7x365 Postgres since 1997
   - Founder and Co-Chair - https://postgresconf.org/
   - Founder - https://postgresql.us - United States PostgreSQL
   - Public speaker, published author, postgresql expert, and people
   believer.
   - Host - More than a refresh
   : A podcast about
   data and the people who wrangle it.


Re: making relfilenodes 56 bits

2022-07-28 Thread Robert Haas
On Thu, Jul 28, 2022 at 11:59 AM Alvaro Herrera  wrote:
> I do wonder why do we keep relfilenodes limited to decimal digits.  Why
> not use hex digits?  Then we know the limit is 14 chars, as in
> 0x00FF in the MAX_RELFILENUMBER definition.

Hmm, but surely we want the error messages to be printed using the
same format that we use for the actual filenames. We could make the
filenames use hex characters too, but I'm not wild about changing
user-visible details like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-28 Thread Alvaro Herrera
Not a full review, just a quick skim of 0003.

On 2022-Jul-28, Dilip Kumar wrote:

> + if (!shutdown)
> + {
> + if (ShmemVariableCache->loggedRelFileNumber < 
> checkPoint.nextRelFileNumber)
> + elog(ERROR, "nextRelFileNumber can not go backward from 
> " INT64_FORMAT "to" INT64_FORMAT,
> +  checkPoint.nextRelFileNumber, 
> ShmemVariableCache->loggedRelFileNumber);
> +
> + checkPoint.nextRelFileNumber = 
> ShmemVariableCache->loggedRelFileNumber;
> + }

Please don't do this; rather use %llu and cast to (long long).
Otherwise the string becomes mangled for translation.  I think there are
many uses of this sort of pattern in strings, but not all of them are
translatable so maybe we don't care -- for example contrib doesn't have
translations.  And the rmgrdesc routines don't translate either, so we
probably don't care about it there; and nothing that uses elog either.
But this one in particular I think should be an ereport, not an elog.
There are several other ereports in various places of the patch also.

> @@ -2378,7 +2378,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
>   if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) 
> != 0)
>   {
>   elog(FATAL,
> -  "inconsistent page found, rel %u/%u/%u, 
> forknum %u, blkno %u",
> +  "inconsistent page found, rel %u/%u/" 
> INT64_FORMAT ", forknum %u, blkno %u",
>rlocator.spcOid, rlocator.dbOid, 
> rlocator.relNumber,
>forknum, blkno);

Should this one be an ereport, and thus you do need to change it to that
and handle it like that?


> + if (xlrec->rlocator.relNumber > 
> ShmemVariableCache->nextRelFileNumber)
> + elog(ERROR, "unexpected relnumber " INT64_FORMAT "that 
> is bigger than nextRelFileNumber " INT64_FORMAT,
> +  xlrec->rlocator.relNumber, 
> ShmemVariableCache->nextRelFileNumber);

You missed one whitespace here after the INT64_FORMAT.

> diff --git a/src/bin/pg_controldata/pg_controldata.c 
> b/src/bin/pg_controldata/pg_controldata.c
> index c390ec5..f727078 100644
> --- a/src/bin/pg_controldata/pg_controldata.c
> +++ b/src/bin/pg_controldata/pg_controldata.c
> @@ -250,6 +250,8 @@ main(int argc, char *argv[])
>   printf(_("Latest checkpoint's NextXID:  %u:%u\n"),
>  
> EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid),
>  
> XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid));
> + printf(_("Latest checkpoint's NextRelFileNumber:  " INT64_FORMAT "\n"),
> +ControlFile->checkPointCopy.nextRelFileNumber);

This one must definitely be translatable.

>  /* Characters to allow for an RelFileNumber in a relation path */
> -#define RELNUMBERCHARS   OIDCHARS/* same as OIDCHARS */
> +#define RELNUMBERCHARS   20  /* max chars printed by %lu */

Maybe say %llu here instead.


I do wonder why do we keep relfilenodes limited to decimal digits.  Why
not use hex digits?  Then we know the limit is 14 chars, as in
0x00FF in the MAX_RELFILENUMBER definition.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)




Re: making relfilenodes 56 bits

2022-07-28 Thread Robert Haas
On Thu, Jul 28, 2022 at 7:32 AM Dilip Kumar  wrote:
> Thanks, I have rebased other patches,  actually, there is a new 0001
> patch now.  It seems during renaming relnode related Oid to
> RelFileNumber, some of the references were missed and in the last
> patch set I kept it as part of main patch 0003, but I think it's
> better to keep it separate.  So took out those changes and created
> 0001, but you think this can be committed as part of 0003 only then
> also it's fine with me.

I committed this in part. I took out the introduction of
RELNUMBERCHARS as I think that should probably be a separate commit,
but added in a comment change that you seem to have overlooked.

> I have done some cleanup in 0002 as well, basically, earlier we were
> storing the result of the BufTagGetRelFileLocator() in a separate
> variable which is not required everywhere.  So wherever possible I
> have avoided using the intermediate variable.

I'll have a look at this next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-27 Thread Robert Haas
On Wed, Jul 27, 2022 at 12:37 PM Dilip Kumar  wrote:
> Just realised that this should have been BufferTagsEqual instead of 
> BufferTagEqual
>
> I will modify this and send an updated patch tomorrow.

I changed it and committed.

What was formerly 0002 will need minor rebasing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar
On Wed, 27 Jul 2022 at 9:49 PM, Dilip Kumar  wrote:

> On Wed, Jul 27, 2022 at 12:07 AM Robert Haas 
> wrote:
> >
> > On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar 
> wrote:
> > > I have thought about it while doing so but I am not sure whether it is
> > > a good idea or not, because before my change these all were macros
> > > with 2 naming conventions so I just changed to inline function so why
> > > to change the name.
> >
> > Well, the reason to change the name would be for consistency. It feels
> > weird to have some NAMES_LIKETHIS() and other NamesLikeThis().
> >
> > Now, an argument against that is that it will make back-patching more
> > annoying, if any code using these functions/macros is touched. But
> > since the calling sequence is changing anyway (you now have to pass a
> > pointer rather than the object itself) that argument doesn't really
> > carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
> > etc.
>
> Okay, so I have renamed these 2 functions and BUFFERTAGS_EQUAL as well
> to BufferTagEqual().


Just realised that this should have been BufferTagsEqual instead of
BufferTagEqual

I will modify this and send an updated patch tomorrow.

—
Dilip

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: making relfilenodes 56 bits

2022-07-27 Thread vignesh C
On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:
>
> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro  wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar  wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.

Thanks for the updated patch, Few comments:
1) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_load_buffers
...
if (fscanf(file, "%u,%u,%u,%u,%u\n", [i].database,
   [i].tablespace, [i].filenumber,
   , [i].blocknum) != 5)
...

2) The format specifier should be changed from %u to INT64_FORMAT
autoprewarm.c -> apw_dump_now
...
ret = fprintf(file, "%u,%u,%u,%u,%u\n",
  block_info_array[i].database,
  block_info_array[i].tablespace,
  block_info_array[i].filenumber,
  (uint32) block_info_array[i].forknum,
  block_info_array[i].blocknum);
...

3) should the comment "entry point for old extension version" be on
top of pg_buffercache_pages, as the current version will use
pg_buffercache_pages_v1_4
+
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, OIDOID);
+}
+
+/* entry point for old extension version */
+Datum
+pg_buffercache_pages_v1_4(PG_FUNCTION_ARGS)
+{
+   return pg_buffercache_pages_internal(fcinfo, INT8OID);
+}

4) we could use the new style or ereport by removing the brackets
around errcode:
+   if (fctx->record[i].relfilenumber > OID_MAX)
+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+
errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",
+
 fctx->record[i].relfilenumber),
+
errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE")));

like:
ereport(ERROR,

errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("relfilenode" INT64_FORMAT " is too large to be represented as
an OID",

fctx->record[i].relfilenumber),

errhint("Upgrade the extension using ALTER EXTENSION pg_buffercache
UPDATE"));

5) Similarly in the below code too:
+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


6) Similarly in the below code too:
+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)
 \
+do {
 \
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+   ereport(ERROR,
 \
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  \
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",   \
+   (relfilenumber; \
+} while (0)
+


7) This error code looks similar to CHECK_RELFILENUMBER_RANGE, can
this macro be used here too:
pg_filenode_relation(PG_FUNCTION_ARGS)
 {
Oid reltablespace = PG_GETARG_OID(0);
-   RelFileNumber relfilenumber = PG_GETARG_OID(1);
+   RelFileNumber relfilenumber = PG_GETARG_INT64(1);
Oid heaprel;

+   /* check whether the relfilenumber is within a valid range */
+   if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relfilenumber " INT64_FORMAT
" is out of range",
+   (relfilenumber;


8) I felt this include is not required:
diff --git a/src/backend/access/transam/varsup.c
b/src/backend/access/transam/varsup.c
index 849a7ce..a2f0d35 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -13,12 +13,16 @@

 #include "postgres.h"

+#include 
+
 #include "access/clog.h"
 #include "access/commit_ts.h"

9) should we change elog to ereport to use the New-style error reporting API
+   /* safety check, we should never get this far in a HS standby */
+   if (RecoveryInProgress())
+   elog(ERROR, "cannot assign RelFileNumber 

Re: making relfilenodes 56 bits

2022-07-27 Thread Dilip Kumar
On Wed, Jul 27, 2022 at 1:24 PM Ashutosh Sharma  wrote:
>
> Some more comments:

Note: Please don't top post.

> ==
>
> Shouldn't we retry for the new relfilenumber if 
> "ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a 
> cases where some of the tables are dropped by the user and relfilenumber of 
> those tables can be reused for which we would need to find the relfilenumber 
> that can be resued. For e.g. consider below example:
>
> postgres=# create table t1(a int);
> CREATE TABLE
>
> postgres=# create table t2(a int);
> CREATE TABLE
>
> postgres=# create table t3(a int);
> ERROR:  relfilenumber is out of bound
>
> postgres=# drop table t1, t2;
> DROP TABLE
>
> postgres=# checkpoint;
> CHECKPOINT
>
> postgres=# vacuum;
> VACUUM
>
> Now if I try to recreate table t3, it should succeed, shouldn't it? But it 
> doesn't because we simply error out by seeing the nextRelFileNumber saved in 
> the shared memory.
>
> postgres=# create table t1(a int);
> ERROR:  relfilenumber is out of bound
>
> I think, above should have worked.

No, it should not, the whole point of this design is not to reuse the
relfilenumber ever within a cluster lifetime.  You might want to read
this mail[1] that by the time we use 2^56 relfilenumbers the cluster
will anyway reach its lifetime by other factors.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BZrDms7gSjckme8YV2tzxgZ0KVfGcsjaFoKyzQX_f_Mw%40mail.gmail.com

> ==
>
> 
> 
> Note that while a table's filenode often matches its OID, this is
> not necessarily the case; some operations, like
> TRUNCATE, REINDEX, 
> CLUSTER and some forms
> of ALTER TABLE, can change the filenode while preserving 
> the OID.
>
> I think this note needs some improvement in storage.sgml. It says the table's 
> relfilenode mostly matches its OID, but it doesn't. This will happen only in 
> case of system table and maybe never in case of user table.

Yes, this should be changed.

> postgres=# create table t2(a int);
> ERROR:  relfilenumber is out of bound
>
> Since this is a user-visible error, I think it would be good to mention 
> relfilenode instead of relfilenumber. Elsewhere (including the user manual) 
> we refer to this as a relfilenode.

No this is expected to be an internal error because in general during
the cluster lifetime ideally, we should never reach this number.  So
we are putting this check so that it should not reach this number due
to some other computational/programming mistake.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-27 Thread Ashutosh Sharma
Some more comments:

==

Shouldn't we retry for the new relfilenumber if
"ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER". There can be a
cases where some of the tables are dropped by the user and relfilenumber of
those tables can be reused for which we would need to find the
relfilenumber that can be resued. For e.g. consider below example:

postgres=# create table t1(a int);
CREATE TABLE

postgres=# create table t2(a int);
CREATE TABLE

postgres=# create table t3(a int);
ERROR:  relfilenumber is out of bound

postgres=# drop table t1, t2;
DROP TABLE

postgres=# checkpoint;
CHECKPOINT

postgres=# vacuum;
VACUUM

Now if I try to recreate table t3, it should succeed, shouldn't it? But it
doesn't because we simply error out by seeing the nextRelFileNumber saved
in the shared memory.

postgres=# create table t1(a int);
ERROR:  relfilenumber is out of bound

I think, above should have worked.

==



Note that while a table's filenode often matches its OID, this is
not necessarily the case; some operations, like
TRUNCATE, REINDEX,
CLUSTER and some forms
of ALTER TABLE, can change the filenode while preserving
the OID.

I think this note needs some improvement in storage.sgml. It says the
table's relfilenode mostly matches its OID, but it doesn't. This will
happen only in case of system table and maybe never in case of user table.

==

postgres=# create table t2(a int);
ERROR:  relfilenumber is out of bound

Since this is a user-visible error, I think it would be good to mention
relfilenode instead of relfilenumber. Elsewhere (including the user manual)
we refer to this as a relfilenode.

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 10:36 PM Ashutosh Sharma 
wrote:

> Thanks Dilip. Here are few comments that  could find upon quickly
> reviewing the v11 patch:
>
>  /*
> + * Similar to the XLogPutNextOid but instead of writing NEXTOID log
> record it
> + * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
> + * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
> upto
>
> XLogRecPtrthen -> XLogRecPtr then
>
> ==
>
> +   switch (relpersistence)
> +   {
> +   case RELPERSISTENCE_TEMP:
> +   backend = BackendIdForTempRelations();
> +   break;
> +   case RELPERSISTENCE_UNLOGGED:
> +   case RELPERSISTENCE_PERMANENT:
> +   backend = InvalidBackendId;
> +   break;
> +   default:
> +   elog(ERROR, "invalid relpersistence: %c", relpersistence);
> +   return InvalidRelFileNumber;/* placate compiler */
> +   }
>
>
> I think the above check should be added at the beginning of the function
> for the reason that if we come to the default switch case we won't be
> acquiring the lwlock and do other stuff to get a new relfilenumber.
>
> ==
>
> -   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
> +* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
> +* because of the possibility that that relation will be moved back to
> the
>
> that that relation -> that relation.
>
> ==
>
> + * option_parse_relfilenumber
> + *
> + * Parse relfilenumber value for an option.  If the parsing is successful,
> + * returns; if parsing fails, returns false.
> + */
>
> If parsing is successful, returns true;
>
> --
> With Regards,
> Ashutosh Sharma.
>
> On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:
>
>> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
>> wrote:
>>
>> Hi,
>> Note: please avoid top posting.
>>
>> > /*
>> >  * If relfilenumber is unspecified by the caller then create
>> storage
>> > -* with oid same as relid.
>> > +* with relfilenumber same as relid if it is a system table
>> otherwise
>> > +* allocate a new relfilenumber.  For more details read
>> comments atop
>> > +* FirstNormalRelFileNumber declaration.
>> >  */
>> > if (!RelFileNumberIsValid(relfilenumber))
>> > -   relfilenumber = relid;
>> > +   {
>> > +   relfilenumber = relid < FirstNormalObjectId ?
>> > +   relid : GetNewRelFileNumber();
>> >
>> > Above code says that in the case of system table we want relfilenode to
>> be the same as object id. This technically means that the relfilenode or
>> oid for the system tables would not be exceeding 16383. However in the
>> below lines of code added in the patch, it says there is some chance for
>> the storage path of the user tables from the old cluster conflicting with
>> the storage path of the system tables in the new cluster. Assuming that the
>> OIDs for the user tables on the old cluster would start with 16384 (the
>> first object ID), I see no reason why there would be a conflict.
>>
>>
>> Basically, the above comment says that the initial system table
>> storage will be created with the same relfilenumber as Oid so you are
>> right that will not exceed 16383.  And below code is explaining the
>> reason that 

Re: making relfilenodes 56 bits

2022-07-26 Thread Robert Haas
On Tue, Jul 12, 2022 at 4:35 PM Robert Haas  wrote:
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.

Committed that way, also with the fix for the typo Dilip found.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-26 Thread Robert Haas
On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar  wrote:
> I have thought about it while doing so but I am not sure whether it is
> a good idea or not, because before my change these all were macros
> with 2 naming conventions so I just changed to inline function so why
> to change the name.

Well, the reason to change the name would be for consistency. It feels
weird to have some NAMES_LIKETHIS() and other NamesLikeThis().

Now, an argument against that is that it will make back-patching more
annoying, if any code using these functions/macros is touched. But
since the calling sequence is changing anyway (you now have to pass a
pointer rather than the object itself) that argument doesn't really
carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma
Thanks Dilip. Here are few comments that  could find upon quickly reviewing
the v11 patch:

 /*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record
it
+ * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
upto

XLogRecPtrthen -> XLogRecPtr then

==

+   switch (relpersistence)
+   {
+   case RELPERSISTENCE_TEMP:
+   backend = BackendIdForTempRelations();
+   break;
+   case RELPERSISTENCE_UNLOGGED:
+   case RELPERSISTENCE_PERMANENT:
+   backend = InvalidBackendId;
+   break;
+   default:
+   elog(ERROR, "invalid relpersistence: %c", relpersistence);
+   return InvalidRelFileNumber;/* placate compiler */
+   }


I think the above check should be added at the beginning of the function
for the reason that if we come to the default switch case we won't be
acquiring the lwlock and do other stuff to get a new relfilenumber.

==

-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option.  If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:

> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
> wrote:
>
> Hi,
> Note: please avoid top posting.
>
> > /*
> >  * If relfilenumber is unspecified by the caller then create
> storage
> > -* with oid same as relid.
> > +* with relfilenumber same as relid if it is a system table
> otherwise
> > +* allocate a new relfilenumber.  For more details read comments
> atop
> > +* FirstNormalRelFileNumber declaration.
> >  */
> > if (!RelFileNumberIsValid(relfilenumber))
> > -   relfilenumber = relid;
> > +   {
> > +   relfilenumber = relid < FirstNormalObjectId ?
> > +   relid : GetNewRelFileNumber();
> >
> > Above code says that in the case of system table we want relfilenode to
> be the same as object id. This technically means that the relfilenode or
> oid for the system tables would not be exceeding 16383. However in the
> below lines of code added in the patch, it says there is some chance for
> the storage path of the user tables from the old cluster conflicting with
> the storage path of the system tables in the new cluster. Assuming that the
> OIDs for the user tables on the old cluster would start with 16384 (the
> first object ID), I see no reason why there would be a conflict.
>
>
> Basically, the above comment says that the initial system table
> storage will be created with the same relfilenumber as Oid so you are
> right that will not exceed 16383.  And below code is explaining the
> reason that in order to avoid the conflict with the user table from
> the older cluster we do it this way.  Otherwise, in the new design, we
> have no intention to keep the relfilenode same as Oid.  But during an
> upgrade from the older cluster which is not following this new design
> might have user table relfilenode which can conflict with the system
> table in the new cluster so we have to ensure that with the new design
> also when creating the initial cluster we keep the system table
> relfilenode in low range and directly using Oid is the best idea for
> this purpose instead of defining the completely new range and
> maintaining a separate counter for that.
>
> > +/* --
> > + * RelFileNumber zero is InvalidRelFileNumber.
> > + *
> > + * For the system tables (OID < FirstNormalObjectId) the initial storage
> > + * will be created with the relfilenumber same as their oid.  And,
> later for
> > + * any storage the relfilenumber allocated by GetNewRelFileNumber()
> will start
> > + * at 10.  Thus, when upgrading from an older cluster, the relation
> storage
> > + * path for the user table from the old cluster will not conflict with
> the
> > + * relation storage path for the system table from the new cluster.
> Anyway,
> > + * the new cluster must not have any user tables while upgrading, so we
> needn't
> > + * worry about them.
> > + * --
> > + */
> > +#define FirstNormalRelFileNumber   ((RelFileNumber) 10)
> >
> > ==
> >
> > When WAL logging the next object id we have the chosen the xlog
> threshold value as 8192 whereas for relfilenode it is 512. Any reason for
> choosing this low arbitrary value in case of relfilenumber?
>
> For Oid when we cross the max value we will wraparound, whereas for
> relfilenumber we can not expect the wraparound for cluster lifetime.
> So it is better not to log forward a 

Re: making relfilenodes 56 bits

2022-07-26 Thread Dilip Kumar
On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma  wrote:

Hi,
Note: please avoid top posting.

> /*
>  * If relfilenumber is unspecified by the caller then create storage
> -* with oid same as relid.
> +* with relfilenumber same as relid if it is a system table otherwise
> +* allocate a new relfilenumber.  For more details read comments atop
> +* FirstNormalRelFileNumber declaration.
>  */
> if (!RelFileNumberIsValid(relfilenumber))
> -   relfilenumber = relid;
> +   {
> +   relfilenumber = relid < FirstNormalObjectId ?
> +   relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be 
> the same as object id. This technically means that the relfilenode or oid for 
> the system tables would not be exceeding 16383. However in the below lines of 
> code added in the patch, it says there is some chance for the storage path of 
> the user tables from the old cluster conflicting with the storage path of the 
> system tables in the new cluster. Assuming that the OIDs for the user tables 
> on the old cluster would start with 16384 (the first object ID), I see no 
> reason why there would be a conflict.


Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* --
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will 
> start
> + * at 10.  Thus, when upgrading from an older cluster, the relation 
> storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we 
> needn't
> + * worry about them.
> + * --
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 10)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold 
> value as 8192 whereas for relfilenode it is 512. Any reason for choosing this 
> low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma
/*
 * If relfilenumber is unspecified by the caller then create storage
-* with oid same as relid.
+* with relfilenumber same as relid if it is a system table
otherwise
+* allocate a new relfilenumber.  For more details read comments
atop
+* FirstNormalRelFileNumber declaration.
 */
if (!RelFileNumberIsValid(relfilenumber))
-   relfilenumber = relid;
+   {
+   relfilenumber = relid < FirstNormalObjectId ?
+   relid : GetNewRelFileNumber();

Above code says that in the case of system table we want relfilenode to be
the same as object id. This technically means that the relfilenode or oid
for the system tables would not be exceeding 16383. However in the below
lines of code added in the patch, it says there is some chance for the
storage path of the user tables from the old cluster conflicting with the
storage path of the system tables in the new cluster. Assuming that the
OIDs for the user tables on the old cluster would start with 16384 (the
first object ID), I see no reason why there would be a conflict.

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage
+ * will be created with the relfilenumber same as their oid.  And, later
for
+ * any storage the relfilenumber allocated by GetNewRelFileNumber() will
start
+ * at 10.  Thus, when upgrading from an older cluster, the relation
storage
+ * path for the user table from the old cluster will not conflict with the
+ * relation storage path for the system table from the new cluster.
Anyway,
+ * the new cluster must not have any user tables while upgrading, so we
needn't
+ * worry about them.
+ * --
+ */
+#define FirstNormalRelFileNumber   ((RelFileNumber) 10)

==

When WAL logging the next object id we have the chosen the xlog threshold
value as 8192 whereas for relfilenode it is 512. Any reason for choosing
this low arbitrary value in case of relfilenumber?

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:

> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro 
> wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar 
> wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: making relfilenodes 56 bits

2022-07-26 Thread Dilip Kumar
On Tue, Jul 26, 2022 at 10:05 AM Amul Sul  wrote:
>
> Few more typos in 0004 patch as well:
>
> the a value
> interger
> previosly
> currenly
>

Thanks for the review, I will fix it in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-26 Thread Dilip Kumar
On Fri, Jul 22, 2022 at 4:21 PM vignesh C  wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar  wrote:
> >

> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:

I have thought about it while doing so but I am not sure whether it is
a good idea or not, because before my change these all were macros
with 2 naming conventions so I just changed to inline function so why
to change the name.

> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -   RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -   (a).blockNum == (b).blockNum && \
> -   (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +   tag->rlocator.spcOid = InvalidOid;
> +   tag->rlocator.dbOid = InvalidOid;
> +   tag->rlocator.relNumber = InvalidRelFileNumber;
> +   tag->forkNum = InvalidForkNumber;
> +   tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the 
> file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)

Yeah we can do that.

> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>
> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber; \
> warning: 1 line adds whitespace errors.

Okay, I will fix it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-26 Thread Dilip Kumar
On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro  wrote:
>
> On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar  wrote:
> > [v10 patch set]
>
> Hi Dilip, I'm experimenting with these patches and will hopefully have
> more to say soon, but I just wanted to point out that this builds with
> warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> there is the good kind of uninitialised data on Linux, and the bad
> kind of uninitialised data on those other pesky systems?

Thanks, I have figured out the issue, I will post the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-25 Thread Dilip Kumar
On Mon, Jul 25, 2022 at 9:51 PM Ashutosh Sharma  wrote:
>
> Hi,
>
> As oid and relfilenumber are linked with each other, I still see that if the 
> oid value reaches the threshold limit, we are unable to create a table with 
> storage. For example I set FirstNormalObjectId to 4294967294 (one value less 
> than the range limit of 2^32 -1 = 4294967295). Now when I try to create a 
> table, the CREATE TABLE command gets stuck because it is unable to find the 
> OID for the comp type although it can find a new relfilenumber.
>

First of all if the OID value reaches to max oid then it should wrap
around to the FirstNormalObjectId and find a new non conflicting OID.
Since in your case the first normaloid is 4294967294 and max oid is
42949672945 there is no scope of wraparound because in this case you
can create at most one object and once you created that then there is
no more unused oid left and with the current patch we are not at all
trying do anything about this.

Now come to the problem we are trying to solve with 56bits
relfilenode.  Here we are not trying to extend the limit of the system
to create more than 4294967294 objects.  What we are trying to solve
is to not to reuse the same disk filenames for different objects.  And
also notice that the relfilenodes can get consumed really faster than
oid so chances of wraparound is more, I mean you can truncate/rewrite
the same relation multiple times so that relation will have the same
oid but will consume multiple relfilenodes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-25 Thread Amul Sul
On Fri, Jul 22, 2022 at 4:21 PM vignesh C  wrote:
>
> On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar  wrote:
> > >
> > > I was doing some more testing by setting the FirstNormalRelFileNumber
> > > to a high value(more than 32 bits) I have noticed a couple of problems
> > > there e.g. relpath is still using OIDCHARS macro which says max
> > > relfilenumber file name can be only 10 character long which is no
> > > longer true.  So there we need to change this value to 20 and also
> > > need to carefully rename the macros and other variable names used for
> > > this purpose.
> > >
> > > Similarly there was some issue in macro in buf_internal.h while
> > > fetching the relfilenumber.  So I will relook into all those issues
> > > and repost the patch soon.
> >
> > I have fixed these existing issues and there was also some issue in
> > pg_dump.c which was creating problems in upgrading to the same version
> > while using a higher range of the relfilenumber.
> >
> > There was also an issue where the user table from the old cluster's
> > relfilenode could conflict with the system table of the new cluster.
> > As a solution currently for system table object (while creating
> > storage first time) we are keeping the low range of relfilenumber,
> > basically we are using the same relfilenumber as OID so that during
> > upgrade the normal user table from the old cluster will not conflict
> > with the system tables in the new cluster.  But with this solution
> > Robert told me (in off list chat) a problem that in future if we want
> > to make relfilenumber completely unique within a cluster by
> > implementing the CREATEDB differently then we can not do that as we
> > have created fixed relfilenodes for the system tables.
> >
> > I am not sure what exactly we can do to avoid that because even if we
> > do something  to avoid that in the new cluster the old cluster might
> > be already using the non-unique relfilenode so after upgrading the new
> > cluster will also get those non-unique relfilenode.
>
> Thanks for the patch, my comments from the initial review:
> 1) Since we have changed the macros to inline functions, should we
> change the function names similar to the other inline functions in the
> same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:
> -#define BUFFERTAGS_EQUAL(a,b) \
> -( \
> -   RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
> -   (a).blockNum == (b).blockNum && \
> -   (a).forkNum == (b).forkNum \
> -)
> +static inline void
> +CLEAR_BUFFERTAG(BufferTag *tag)
> +{
> +   tag->rlocator.spcOid = InvalidOid;
> +   tag->rlocator.dbOid = InvalidOid;
> +   tag->rlocator.relNumber = InvalidRelFileNumber;
> +   tag->forkNum = InvalidForkNumber;
> +   tag->blockNum = InvalidBlockNumber;
> +}
>
> 2) We could move this macros along with the other macros at the top of the 
> file:
> +/*
> + * The freeNext field is either the index of the next freelist entry,
> + * or one of these special values:
> + */
> +#define FREENEXT_END_OF_LIST   (-1)
> +#define FREENEXT_NOT_IN_LIST   (-2)
>
> 3) typo thn should be then:
> + * can raise it as necessary if we end up with more mapped relations. For
> + * now, we just pick a round number that is modestly larger thn the expected
> + * number of mappings.
> + */
>

Few more typos in 0004 patch as well:

the a value
interger
previosly
currenly

> 4) There is one whitespace issue:
> git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
> Applying: Widen relfilenumber from 32 bits to 56 bits
> .git/rebase-apply/patch:1500: space before tab in indent.
> (relfilenumber; \
> warning: 1 line adds whitespace errors.
>
> Regards,
> Vignesh
>

Regards,
Amul




Re: making relfilenodes 56 bits

2022-07-25 Thread Ashutosh Sharma
Hi,

As oid and relfilenumber are linked with each other, I still see that if
the oid value reaches the threshold limit, we are unable to create a table
with storage. For example I set FirstNormalObjectId to 4294967294 (one
value less than the range limit of 2^32 -1 = 4294967295). Now when I try to
create a table, the CREATE TABLE command gets stuck because it is unable to
find the OID for the comp type although it can find a new relfilenumber.

postgres=# create table t1(a int);
CREATE TABLE

postgres=# select oid, reltype, relfilenode from pg_class where relname =
't1';
oid |  reltype   | relfilenode
++-
 4294967295 | 4294967294 |  10
(1 row)

postgres=# create table t2(a int);
^CCancel request sent
ERROR:  canceling statement due to user request

creation of t2 table gets stuck as it is unable to find a new oid.
Basically the point that I am trying to put here is even though we will be
able to find the new relfile number by increasing the relfilenumber size
but still the commands like above will not execute if the oid value (of 32
bits) has reached the threshold limit.

--
With Regards,
Ashutosh Sharma.



On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar  wrote:

> On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar  wrote:
> >
> > I was doing some more testing by setting the FirstNormalRelFileNumber
> > to a high value(more than 32 bits) I have noticed a couple of problems
> > there e.g. relpath is still using OIDCHARS macro which says max
> > relfilenumber file name can be only 10 character long which is no
> > longer true.  So there we need to change this value to 20 and also
> > need to carefully rename the macros and other variable names used for
> > this purpose.
> >
> > Similarly there was some issue in macro in buf_internal.h while
> > fetching the relfilenumber.  So I will relook into all those issues
> > and repost the patch soon.
>
> I have fixed these existing issues and there was also some issue in
> pg_dump.c which was creating problems in upgrading to the same version
> while using a higher range of the relfilenumber.
>
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: making relfilenodes 56 bits

2022-07-22 Thread vignesh C
On Wed, Jul 20, 2022 at 4:57 PM Dilip Kumar  wrote:
>
> On Mon, Jul 18, 2022 at 4:51 PM Dilip Kumar  wrote:
> >
> > I was doing some more testing by setting the FirstNormalRelFileNumber
> > to a high value(more than 32 bits) I have noticed a couple of problems
> > there e.g. relpath is still using OIDCHARS macro which says max
> > relfilenumber file name can be only 10 character long which is no
> > longer true.  So there we need to change this value to 20 and also
> > need to carefully rename the macros and other variable names used for
> > this purpose.
> >
> > Similarly there was some issue in macro in buf_internal.h while
> > fetching the relfilenumber.  So I will relook into all those issues
> > and repost the patch soon.
>
> I have fixed these existing issues and there was also some issue in
> pg_dump.c which was creating problems in upgrading to the same version
> while using a higher range of the relfilenumber.
>
> There was also an issue where the user table from the old cluster's
> relfilenode could conflict with the system table of the new cluster.
> As a solution currently for system table object (while creating
> storage first time) we are keeping the low range of relfilenumber,
> basically we are using the same relfilenumber as OID so that during
> upgrade the normal user table from the old cluster will not conflict
> with the system tables in the new cluster.  But with this solution
> Robert told me (in off list chat) a problem that in future if we want
> to make relfilenumber completely unique within a cluster by
> implementing the CREATEDB differently then we can not do that as we
> have created fixed relfilenodes for the system tables.
>
> I am not sure what exactly we can do to avoid that because even if we
> do something  to avoid that in the new cluster the old cluster might
> be already using the non-unique relfilenode so after upgrading the new
> cluster will also get those non-unique relfilenode.

Thanks for the patch, my comments from the initial review:
1) Since we have changed the macros to inline functions, should we
change the function names similar to the other inline functions in the
same file like: ClearBufferTag, InitBufferTag & BufferTagsEqual:
-#define BUFFERTAGS_EQUAL(a,b) \
-( \
-   RelFileLocatorEquals((a).rlocator, (b).rlocator) && \
-   (a).blockNum == (b).blockNum && \
-   (a).forkNum == (b).forkNum \
-)
+static inline void
+CLEAR_BUFFERTAG(BufferTag *tag)
+{
+   tag->rlocator.spcOid = InvalidOid;
+   tag->rlocator.dbOid = InvalidOid;
+   tag->rlocator.relNumber = InvalidRelFileNumber;
+   tag->forkNum = InvalidForkNumber;
+   tag->blockNum = InvalidBlockNumber;
+}

2) We could move this macros along with the other macros at the top of the file:
+/*
+ * The freeNext field is either the index of the next freelist entry,
+ * or one of these special values:
+ */
+#define FREENEXT_END_OF_LIST   (-1)
+#define FREENEXT_NOT_IN_LIST   (-2)

3) typo thn should be then:
+ * can raise it as necessary if we end up with more mapped relations. For
+ * now, we just pick a round number that is modestly larger thn the expected
+ * number of mappings.
+ */

4) There is one whitespace issue:
git am v10-0004-Widen-relfilenumber-from-32-bits-to-56-bits.patch
Applying: Widen relfilenumber from 32 bits to 56 bits
.git/rebase-apply/patch:1500: space before tab in indent.
(relfilenumber; \
warning: 1 line adds whitespace errors.

Regards,
Vignesh




Re: making relfilenodes 56 bits

2022-07-20 Thread Thomas Munro
On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar  wrote:
> [v10 patch set]

Hi Dilip, I'm experimenting with these patches and will hopefully have
more to say soon, but I just wanted to point out that this builds with
warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
there is the good kind of uninitialised data on Linux, and the bad
kind of uninitialised data on those other pesky systems?




Re: making relfilenodes 56 bits

2022-07-18 Thread Dilip Kumar
On Thu, Jul 14, 2022 at 5:18 PM Dilip Kumar  wrote:

> Apart from this I have fixed all the pending issues that includes
>
> - Change existing macros to inline functions done in 0001.
> - Change pg_class index from (tbspc, relfilenode) to relfilenode and
> also change RelidByRelfilenumber().  In RelidByRelfilenumber I have
> changed the hash to maintain based on just the relfilenumber but we
> still need to pass the tablespace to identify whether it is a shared
> relation or not.  If we want we can make it bool but I don't think
> that is really needed here.
> - Changed logic of GetNewRelFileNumber() based on what Robert
> described, and instead of tracking the pending logged relnumbercount
> now I am tracking last loggedRelNumber, which help little bit in
> SetNextRelFileNumber in making code cleaner, but otherwise it doesn't
> make much difference.
> - Some new asserts in buf_internal inline function to validate value
> of computed/input relfilenumber.

I was doing some more testing by setting the FirstNormalRelFileNumber
to a high value(more than 32 bits) I have noticed a couple of problems
there e.g. relpath is still using OIDCHARS macro which says max
relfilenumber file name can be only 10 character long which is no
longer true.  So there we need to change this value to 20 and also
need to carefully rename the macros and other variable names used for
this purpose.

Similarly there was some issue in macro in buf_internal.h while
fetching the relfilenumber.  So I will relook into all those issues
and repost the patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Dilip Kumar
On Tue, Jul 12, 2022 at 7:21 PM Robert Haas  wrote:
>

> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates. I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.
> It would require substantially more changes to the code in this file,
> and that means there's more risk of introducing bugs, and I don't see
> that there's much value anyway, because if we ever do hit the current
> limit, we can just raise the limit.
>
> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

The patch looks good except one minor comment

+ * corruption.  Since the file might be more tha none standard-size disk
+ * sector in size, we cannot rely on overwrite-in-place. Instead, we generate

typo "more tha none" -> "more than one"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Tue, Jul 12, 2022 at 6:02 PM Andres Freund  wrote:
> MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
> restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
> filesystem access each time), it's not feasible to make that number large.

Yeah. TBH, what I'd really like to do is kill the entire fork system
with fire and replace it with something more scalable, which would
maybe permit the sort of thing Hannu suggests here. With the current
system, forget it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Andres Freund
Hi,

Please don't top quote - as mentioned a couple times recently.

On 2022-07-12 23:00:22 +0200, Hannu Krosing wrote:
> Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> 
> Have you really thought through making the ForkNum 8-bit ?

MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new
restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with
filesystem access each time), it's not feasible to make that number large.

Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-07-12 Thread Hannu Krosing
Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX);

Have you really thought through making the ForkNum 8-bit ?

For example this would limit a columnar storage with each column
stored in it's own fork (which I'd say is not entirely unreasonable)
to having just about ~250 columns.

And there can easily be other use cases where we do not want to limit
number of forks so much

Cheers
Hannu

On Tue, Jul 12, 2022 at 10:36 PM Robert Haas  wrote:
>
> On Tue, Jul 12, 2022 at 1:09 PM Andres Freund  wrote:
> > What does currently happen if we exceed that?
>
> elog
>
> > > diff --git a/src/include/utils/wait_event.h 
> > > b/src/include/utils/wait_event.h
> > > index b578e2ec75..5d3775ccde 100644
> > > --- a/src/include/utils/wait_event.h
> > > +++ b/src/include/utils/wait_event.h
> > > @@ -193,7 +193,7 @@ typedef enum
> > >   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> > >   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> > >   WAIT_EVENT_RELATION_MAP_READ,
> > > - WAIT_EVENT_RELATION_MAP_SYNC,
> > > + WAIT_EVENT_RELATION_MAP_RENAME,
> >
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>




Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Tue, Jul 12, 2022 at 1:09 PM Andres Freund  wrote:
> What does currently happen if we exceed that?

elog

> > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> > index b578e2ec75..5d3775ccde 100644
> > --- a/src/include/utils/wait_event.h
> > +++ b/src/include/utils/wait_event.h
> > @@ -193,7 +193,7 @@ typedef enum
> >   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
> >   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
> >   WAIT_EVENT_RELATION_MAP_READ,
> > - WAIT_EVENT_RELATION_MAP_SYNC,
> > + WAIT_EVENT_RELATION_MAP_RENAME,
>
> Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> since it includes fsync etc?

Sure, I had it that way for a while and changed it at the last minute.
I can change it back.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-12 Thread Andres Freund
Hi,

On 2022-07-12 09:51:12 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 7:22 PM Andres Freund  wrote:
> > I guess I'm not enthused in duplicating the necessary knowledge in evermore
> > places. We've forgotten one of the magic incantations in the past, and 
> > needing
> > to find all the places that need to be patched is a bit bothersome.
> >
> > Perhaps we could add extract helpers out of durable_rename()?
> >
> > OTOH, I don't really see what we gain by keeping things out of the critical
> > section? It does seem good to have the temp-file creation/truncation and 
> > write
> > separately, but after that I don't think it's worth much to avoid a
> > PANIC. What legitimate issue does it avoid?
> 
> OK, so then I think we should just use durable_rename(). Here's a
> patch that does it that way. I briefly considered the idea of
> extracting helpers, but it doesn't seem worthwhile to me. There's not
> that much code in durable_rename() in the first place.

Cool.


> In this version, I also removed the struct padding, changed the limit
> on the number of entries to a nice round 64, and made some comment
> updates.

What does currently happen if we exceed that?

I wonder if we should just reference a new define generated by genbki.pl
documenting the number of relations that need to be tracked. Then we don't
need to maintain this manually going forward.


> I considered trying to go further and actually make the file
> variable-size, so that we never again need to worry about the limit on
> the number of entries, but I don't actually think that's a good idea.

Yea, I don't really see what we'd gain. For this stuff to change we need to
recompile anyway.


> If we were going to split up durable_rename(), the only intelligible
> split I can see would be to have a second version of the function, or
> a flag to the existing function, that caters to the situation where
> the old file is already known to have been fsync()'d.

I was thinking of something like durable_rename_prep() that'd fsync the
file/directories under their old names, and then durable_rename_exec() that
actually renames and then fsyncs.  But without a clear usecase...


> + /* Write new data to the file. */
> + pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
> + if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
...
> + pgstat_report_wait_end();
> +

Not for this patch, but we eventually should move this sequence into a
wrapper. Perhaps combined with retry handling for short writes, the ENOSPC
stuff and an error message when the write fails. It's a bit insane how many
copies of this we have.


> diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
> index b578e2ec75..5d3775ccde 100644
> --- a/src/include/utils/wait_event.h
> +++ b/src/include/utils/wait_event.h
> @@ -193,7 +193,7 @@ typedef enum
>   WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
>   WAIT_EVENT_LOGICAL_REWRITE_WRITE,
>   WAIT_EVENT_RELATION_MAP_READ,
> - WAIT_EVENT_RELATION_MAP_SYNC,
> + WAIT_EVENT_RELATION_MAP_RENAME,

Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
since it includes fsync etc?

Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-07-12 Thread Dilip Kumar
On Mon, Jul 11, 2022 at 9:49 PM Robert Haas  wrote:
>

> It also makes me wonder why we're using macros rather than static
> inline functions in buf_internals.h. I wonder whether we could do
> something like this, for example, and keep InvalidForkNumber as -1:
>
> static inline ForkNumber
> BufTagGetForkNum(BufferTag *tagPtr)
> {
> int8 ret;
>
> StaticAssertStmt(MAX_FORKNUM <= INT8_MAX);
> ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS);
> return (ForkNumber) ret;
> }
>
> Even if we don't use that particular trick, I think we've generally
> been moving toward using static inline functions rather than macros,
> because it provides better type-safety and the code is often easier to
> read. Maybe we should also approach it that way here. Or even commit a
> preparatory patch replacing the existing macros with inline functions.
> Or maybe it's best to leave it alone, not sure.

I think it make sense to convert existing macros as well, I have
attached a patch for the same,
>
> > I had those changes in v7-0003, now I have merged with 0002.  This has
> > assert check while replaying the WAL for smgr create and smgr
> > truncate, and while during normal path when allocating the new
> > relfilenumber we are asserting for any existing file.
>
> I think a test-and-elog might be better. Most users won't be running
> assert-enabled builds, but this seems worth checking regardless.

IMHO the recovery time asserts we can convert to elog but one which we
are doing after each GetNewRelFileNumber is better to keep as an
assert as we are doing the file access so it can be costly?

> > I have done some performance tests, with very small values I can see a
> > lot of wait events for RelFileNumberGen but with bigger numbers like
> > 256 or 512 it is not really bad.  See results at the end of the
> > mail[1]
>
> It's a little hard to interpret these results because you don't say
> how often you were checking the wait events, or how often the
> operation took to complete. I suppose we can guess the relative time
> scale from the number of Activity events: if there were 190
> WalWriterMain events observed, then the time to complete the operation
> is probably 190 times how often you were checking the wait events, but
> was that every second or every half second or every tenth of a second?

I am executing it after every 0.5 sec using below script in psql
\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

And running test for 60 sec
./pgbench -c 32 -j 32 -T 60 -f create_script.sql -p 54321  postgres

$ cat create_script.sql
select create_table(100);

// function body 'create_table'
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
EXECUTE format('CREATE TABLE %s(a int)', relname);

EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;
$$ LANGUAGE plpgsql;



> > I have done these changes during GetNewRelFileNumber() this required
> > to track the last logged record pointer as well but I think this looks
> > clean.  With this I can see some reduction in RelFileNumberGen wait
> > event[1]
>
> I find the code you wrote here a little bit magical. I believe it
> depends heavily on choosing to issue the new WAL record when we've
> exhausted exactly 50% of the available space. I suggest having two
> constants, one of which is the number of relfilenumber values per WAL
> record, and the other of which is the threshold for issuing a new WAL
> record. Maybe something like RFN_VALUES_PER_XLOG and
> RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works
> correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log
> new RFNs until old allocation is completely exhausted) and
> RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the
> previous allocation). That way, if in the future someone decides to
> change the constant values, they can do that and the code still works.

ok



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 553a6b185ee7270bf206387421e50b48c7a8b97b Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 12 Jul 2022 17:10:04 +0530
Subject: [PATCH v1] Convert buf_internal.h macros to static inline functions

Readability wise inline functions are better compared to macros and this
will also help to write cleaner and readable code for 64-bit relfilenode
because as part of that patch we will have to do some complex bitwise
operation so doing that inside the inline function will be cleaner.
---
 src/backend/storage/buffer/buf_init.c |   2 +-
 src/backend/storage/buffer/bufmgr.c   |  16 ++--
 src/backend/storage/buffer/localbuf.c |  12 +--
 src/include/storage/buf_internals.h   | 156 +-
 4 files changed, 111 insertions(+), 75 deletions(-)

diff --git 

Re: making relfilenodes 56 bits

2022-07-12 Thread Robert Haas
On Mon, Jul 11, 2022 at 7:22 PM Andres Freund  wrote:
> I guess I'm not enthused in duplicating the necessary knowledge in evermore
> places. We've forgotten one of the magic incantations in the past, and needing
> to find all the places that need to be patched is a bit bothersome.
>
> Perhaps we could add extract helpers out of durable_rename()?
>
> OTOH, I don't really see what we gain by keeping things out of the critical
> section? It does seem good to have the temp-file creation/truncation and write
> separately, but after that I don't think it's worth much to avoid a
> PANIC. What legitimate issue does it avoid?

OK, so then I think we should just use durable_rename(). Here's a
patch that does it that way. I briefly considered the idea of
extracting helpers, but it doesn't seem worthwhile to me. There's not
that much code in durable_rename() in the first place.

In this version, I also removed the struct padding, changed the limit
on the number of entries to a nice round 64, and made some comment
updates. I considered trying to go further and actually make the file
variable-size, so that we never again need to worry about the limit on
the number of entries, but I don't actually think that's a good idea.
It would require substantially more changes to the code in this file,
and that means there's more risk of introducing bugs, and I don't see
that there's much value anyway, because if we ever do hit the current
limit, we can just raise the limit.

If we were going to split up durable_rename(), the only intelligible
split I can see would be to have a second version of the function, or
a flag to the existing function, that caters to the situation where
the old file is already known to have been fsync()'d.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Remove-the-restriction-that-the-relmap-must-be-51.patch
Description: Binary data


Re: making relfilenodes 56 bits

2022-07-11 Thread Andres Freund
On 2022-07-11 16:11:53 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 3:34 PM Andres Freund  wrote:
> > Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> > first (likely adding O_TRUNC to flags), use durable_rename() to rename it 
> > into
> > place.  The tempfile should probably be written out before the XLogInsert(),
> > the durable_rename() after, although I think it'd also be correct to more
> > closely approximate the current sequence.
> 
> Something like this?

Yea. I've not looked carefully, but on a quick skim it looks good.


> I chose not to use durable_rename() here, because that allowed me to
> do more of the work before starting the critical section, and it's
> probably slightly more efficient this way, too. That could be changed,
> though, if you really want to stick with durable_rename().

I guess I'm not enthused in duplicating the necessary knowledge in evermore
places. We've forgotten one of the magic incantations in the past, and needing
to find all the places that need to be patched is a bit bothersome.

Perhaps we could add extract helpers out of durable_rename()?

OTOH, I don't really see what we gain by keeping things out of the critical
section? It does seem good to have the temp-file creation/truncation and write
separately, but after that I don't think it's worth much to avoid a
PANIC. What legitimate issue does it avoid?


Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-07-11 Thread Robert Haas
On Mon, Jul 11, 2022 at 3:34 PM Andres Freund  wrote:
> Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
> first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
> place.  The tempfile should probably be written out before the XLogInsert(),
> the durable_rename() after, although I think it'd also be correct to more
> closely approximate the current sequence.

Something like this?

I chose not to use durable_rename() here, because that allowed me to
do more of the work before starting the critical section, and it's
probably slightly more efficient this way, too. That could be changed,
though, if you really want to stick with durable_rename().

I haven't done anything about actually making the file variable-length
here, either, which I think is what we would want to do. If this seems
more or less all right, I can work on that next.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-rough-draft-of-removing-relmap-size-restriction.patch
Description: Binary data


Re: making relfilenodes 56 bits

2022-07-11 Thread Andres Freund
Hi,

On 2022-07-11 15:08:57 -0400, Robert Haas wrote:
> On Mon, Jul 11, 2022 at 2:57 PM Andres Freund  wrote:
> > I don't know where we could fit a sanity check that connects to all 
> > databases
> > and detects duplicates across all the pg_class instances. Perhaps 
> > pg_amcheck?
> 
> Unless we're going to change the way CREATE DATABASE works, uniqueness
> across databases is not guaranteed.

You could likely address that by not flagging conflicts iff oid also matches?
Not sure if worth it, but ...


> > Maybe the easiest fix here would be to replace the file atomically. Then we
> > don't need this <= 512 byte stuff. These are done rarely enough that I don't
> > think the overhead of creating a separate file, fsyncing that, renaming,
> > fsyncing, would be a problem?
> 
> Anything we can reasonably do to reduce the number of places where
> we're relying on things being <= 512 bytes seems like a step in the
> right direction to me. It's very difficult to know whether such code
> is correct, or what the probability is that crossing the 512-byte
> boundary would break anything.

Seems pretty simple to do. Have write_relmapper_file() write to a .tmp file
first (likely adding O_TRUNC to flags), use durable_rename() to rename it into
place.  The tempfile should probably be written out before the XLogInsert(),
the durable_rename() after, although I think it'd also be correct to more
closely approximate the current sequence.

It's a lot more problematic to do this for the control file, because we can
end up updating that at a high frequency on standbys, due to minRecoveryPoint.

I have wondered about maintaining that in a dedicated file instead, and
perhaps even doing so on a primary.

Greetings,

Andres Freund




Re: making relfilenodes 56 bits

2022-07-11 Thread Robert Haas
On Mon, Jul 11, 2022 at 2:57 PM Andres Freund  wrote:
> ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover
> relfilenode - afaics there's no real connection to the tablespace
> anymore. That'd a) reduce the size of the index b) guarantee uniqueness across
> tablespaces.

Sounds like a good idea.

> I don't know where we could fit a sanity check that connects to all databases
> and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?

Unless we're going to change the way CREATE DATABASE works, uniqueness
across databases is not guaranteed.

> I think that's a very good idea. My concern around doing an XLogFlush() is
> that it could lead to a lot of tiny f[data]syncs(), because not much else
> needs to be written out. But the scheme you describe would likely lead the
> XLogFlush() flushing plenty other WAL writes out, addressing that.

Oh, interesting. I hadn't considered that angle.

> Maybe the easiest fix here would be to replace the file atomically. Then we
> don't need this <= 512 byte stuff. These are done rarely enough that I don't
> think the overhead of creating a separate file, fsyncing that, renaming,
> fsyncing, would be a problem?

Anything we can reasonably do to reduce the number of places where
we're relying on things being <= 512 bytes seems like a step in the
right direction to me. It's very difficult to know whether such code
is correct, or what the probability is that crossing the 512-byte
boundary would break anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-07-11 Thread Andres Freund
Hi,

On 2022-07-07 13:26:29 -0400, Robert Haas wrote:
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

ISTM that we should redefine pg_class_tblspc_relfilenode_index to only cover
relfilenode - afaics there's no real connection to the tablespace
anymore. That'd a) reduce the size of the index b) guarantee uniqueness across
tablespaces.

I don't know where we could fit a sanity check that connects to all databases
and detects duplicates across all the pg_class instances. Perhaps pg_amcheck?


It may be worth changing RelidByRelfilenumber() / its infrastructure to not
use reltablespace anymore.


> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it.

I think that's a very good idea. My concern around doing an XLogFlush() is
that it could lead to a lot of tiny f[data]syncs(), because not much else
needs to be written out. But the scheme you describe would likely lead the
XLogFlush() flushing plenty other WAL writes out, addressing that.


> If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

Even if the record isn't yet flushed out by the time we need to, the
deferred-ness means that there's a good chance more useful records can also be
flushed out at the same time...


> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
> 
> #define MAX_MAPPINGS62  /* 62 * 8 + 16 = 512 */
> 
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one.

+1

Perhaps MAX_MAPPINGS should be at least partially computed instead of doing
the math in a comment? sizeof(RelMapping) could directly be used, and we could
define SIZEOF_RELMAPFILE_START with a StaticAssert() enforcing it to be equal
to offsetof(RelMapFile, mappings), if we move crc & pad to *before* mappings -
afaics that should be entirely doable.


> If we do nothing fancy here, the maximum number of mappings will have to be
> reduced from 62 to 31, which is a problem because global/pg_filenode.map
> currently has 48 entries. We could try to arrange to squeeze padding out of
> the RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're using
> already, never mind leaving room for future growth.

Ugh.


> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something.

I'm not so sure that's a good idea - while the hardware sector size likely
isn't 512 on much storage anymore, it's still the size that most storage
protocols use. Which then means you need to be confident that you not just
rely on storage atomicity, but also that nothing in the
  filesystem <-> block layer <-> driver
stack somehow causes a single larger write to be split up into two.

And if you use a filesystem with a smaller filesystem block size, there might
not even be a choice for the write to be split into two writes. E.g. XFS still
supports 512 byte blocks (although I think it's deprecating block size < 1024).


Maybe the easiest fix here would be to replace the file atomically. Then we
don't need this <= 512 byte 

  1   2   >