from:"Nazir Bilal Yavuz"

Re: meson and check-tests

2024-09-26 Thread Nazir Bilal Yavuz

Hi,

On Thu, 26 Sept 2024 at 08:45, Ashutosh Bapat
 wrote:
>
> On Wed, Sep 25, 2024 at 8:24 PM Nazir Bilal Yavuz  wrote:
> >
> > Additionally, the patch I shared earlier was only for regress/regress
> > tests. From what I understand from here [1], only regress/regress
> > tests support 'make check-tests', so the patch seems correct. I
> > experimented with how we can implement something similar for other
> > types of tests, including other regression, isolation, and ECPG tests.
> > The attached patch works for all types of tests but only for the Meson
> > builds. For example you can run:
> >
> > $ meson test --suite setup
> > $ TESTS='check check_btree' meson test amcheck/regress
> > $ TESTS='ddl stream' meson test test_decoding/regress
> > $ TESTS='oldest_xmin skip_snapshot_restore' meson test 
> > test_decoding/isolation
> > $ TESTS='sql/prepareas compat_informix/dec_test' meson test ecpg/ecpg
> >
> > What do you think about that behaviour? It is different from 'make
> > check-tests' but it looks useful to me.
>
> I think that would be a good enhancement, if a particular regression
> set takes longer e.g. the one in test_decoding takes a few seconds.
> When we worked on PG_TEST_EXTRA, it was advised to keep feature parity
> between meson and make. I guess, a similar advice applies here as well
> and we will have to change make to support these options. But that
> will be more work.
>
> Let's split the patch into two 1. supporting TESTS in meson only for
> regress/regress, 2. extending that support to other suites. The first
> patch will bring meson inline with make as far as running a subset of
> regression tests is concerned and can be committed separately. We will
> seek opinions on the second patch and commit it separately if it takes
> time. It will be good to see the support for running a subset of
> regression in meson ASAP so that developers can save time running
> entire regression suite when not needed. The second one will be an
> additional feature that can wait if it takes more time to add it to
> both meson and make.

I agree with you. I splitted the patch into two like you said.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From b782aef4d805436e36db26021cc25274d965726a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 26 Sep 2024 11:55:23 +0300
Subject: [PATCH v3 1/2] Add 'make check-tests' behavior to the meson based
 builds

There was no way to run specific regression tests in the regress/regress
tests in the meson based builds. Add this behavior.
---
 meson.build| 12 ++--
 src/tools/testwrap | 14 ++
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 7150f85e0fb..8efc6c09da4 100644
--- a/meson.build
+++ b/meson.build
@@ -3434,11 +3434,9 @@ foreach test_dir : tests
 '--dbname', dbname,
   ] + t.get('regress_args', [])
 
-  test_selection = []
-  if t.has_key('schedule')
-test_selection += ['--schedule', t['schedule'],]
-  endif
+  test_schedule = t.get('schedule', [])
 
+  test_selection = []
   if kind == 'isolation'
 test_selection += t.get('specs', [])
   else
@@ -3462,12 +3460,13 @@ foreach test_dir : tests
   testwrap_base,
   '--testgroup', test_group,
   '--testname', kind,
+  '--schedule', test_schedule,
+  '--tests', test_selection,
   '--',
   test_command_base,
   '--outputdir', test_output,
   '--temp-instance', test_output / 'tmp_check',
   '--port', testport.to_string(),
-  test_selection,
 ],
 suite: test_group,
 kwargs: test_kwargs,
@@ -3482,10 +3481,11 @@ foreach test_dir : tests
 testwrap_base,
 '--testgroup', test_group_running,
 '--testname', kind,
+'--schedule', test_schedule,
+'--tests', test_selection,
 '--',
 test_command_base,
 '--outputdir', test_output_running,
-test_selection,
   ],
   is_parallel: t.get('runningcheck-parallel', true),
   suite: test_group_running,
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 9a270beb72d..d78deb529a8 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -12,6 +12,8 @@ parser.add_argument('--srcdir', help='source directory of test', type=str)
 parser.add_argument('--basedir', help='base directory of test', type=str)
 pars

Re: meson and check-tests

2024-09-25 Thread Nazir Bilal Yavuz

Hi,

Thanks for looking into this!

On Wed, 25 Sept 2024 at 13:27, Ashutosh Bapat
 wrote:
>
> On Mon, Sep 23, 2024 at 2:16 PM Nazir Bilal Yavuz  wrote:
> >
> > On Sat, 21 Sept 2024 at 09:01, jian he  wrote:
> > >
> > > so
> > > TESTS="copy jsonb stats_ext" meson test --suite regress
> > > will fail.
> > >
> > > to make it work we need change it to
> > > TESTS="test_setup copy jsonb stats_ext" meson test --suite regress
> > >
> > > Many tests depend on test_setup.sql, maybe we can implicitly prepend it.
> > > Another dependency issue. alter_table depending on create_index.
> > >
> > > TESTS="test_setup alter_table" meson test --suite regress
> > > will fail.
> > > TESTS="test_setup create_index alter_table" meson test --suite regress
> > > will work.
> >
> > Yes, I realized that but since that is how it is done in the make
> > builds, I didn't want to change the behaviour. Also, I think it makes
> > sense to leave it to the tester. It is more flexible in that way.
>
> Since meson has a setup suite already, it might have been a good idea
> to do as Jian suggested. But a. setup is also another suite and not a
> setup step per say. b. the dependencies between regression tests are
> not documented well or rather we don't have a way to specify which
> test depends upon which. So we can't infer the .sql files that need to
> be run as a setup step. Somebody could add a dependency without meson
> or make being aware of that and tests will fail again. So I think we
> have to leave it as is. If we get to that point we should fix both
> make as well as meson. But not as part of this exercise.
>
> It's a bit inconvenient that we don't see whether an individual test
> failed or succeeded on the screen; we need to open testlog.txt for the
> same. But that's true with the regress suite generally so no
> complaints there.

Thanks for sharing your thoughts.

> Individual TAP tests are run using `meson test -C 
> :` syntax. If we can run individual SQL tests using same
> syntax e.g. `meson test regress:partition_join` that would make it
> consistent with the way TAP tests are run. But we need to make sure
> that the test later in the syntax would see the objects left behind by
> prior tests. E.g. `meson test regress:test_setup
> regress:partition_join` should see both tests passing. partition_join
> uses some tables created by test_setup, so those need to be run
> sequentially. Is that doable?

I think that makes sense, but it is not easily achievable right now.
The difference between TAP tests and regress/regress tests is that TAP
tests are registered individually, whereas regress/regress tests are
registered as one (with the --schedule option). This means we need to
register these tests one by one (instead of passing them with the
--schedule option) to the Meson build system in order to run them as
'meson test :'.

Additionally, the patch I shared earlier was only for regress/regress
tests. From what I understand from here [1], only regress/regress
tests support 'make check-tests', so the patch seems correct. I
experimented with how we can implement something similar for other
types of tests, including other regression, isolation, and ECPG tests.
The attached patch works for all types of tests but only for the Meson
builds. For example you can run:

$ meson test --suite setup
$ TESTS='check check_btree' meson test amcheck/regress
$ TESTS='ddl stream' meson test test_decoding/regress
$ TESTS='oldest_xmin skip_snapshot_restore' meson test test_decoding/isolation
$ TESTS='sql/prepareas compat_informix/dec_test' meson test ecpg/ecpg

What do you think about that behaviour? It is different from 'make
check-tests' but it looks useful to me.

[1] https://www.postgresql.org/message-id/1364.1717305911%40sss.pgh.pa.us

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 002920d8c9bf10ce1fe2f3cd4f115452a92664a8 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 25 Sep 2024 17:31:08 +0300
Subject: [PATCH v2] Use TESTS="" in all type of tests in the Meson builds

---
 meson.build| 12 ++--
 src/tools/testwrap | 13 +
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 7150f85e0fb..8efc6c09da4 100644
--- a/meson.build
+++ b/meson.build
@@ -3434,11 +3434,9 @@ foreach test_dir : tests
 '--dbname', dbname,
   ] + t.get('regress_args', [])
 
-  test_selection = []
-  if t.has_key('schedule')
-test_selection += ['--schedule', t['schedule'],]
-  endif
+  test_schedule = t.get('sche

Re: meson and check-tests

2024-09-23 Thread Nazir Bilal Yavuz

Hi,

On Mon, 23 Sept 2024 at 11:46, Nazir Bilal Yavuz  wrote:
> On Sat, 21 Sept 2024 at 09:01, jian he  wrote:
> > in [1] you mentioned "setup", but that "setup" is more or less like
> > "meson test  --suite setup --suite regress"
> > but originally, I thought was about "src/test/regress/sql/test_setup.sql".
> > for example, now you cannot run src/test/regress/sql/stats_ext.sql
> > without first running test_setup.sql, because some functions (like fipshash)
> > live in test_setup.sql.
>
> Postgres binaries are created at the build step in the make builds so
> these binaries can be used for the tests. But in the meson builds,
> they are created at the 'meson test  --suite setup' for the testing
> ('meson install' command creates binaries as well but they are not for
> testing, they are for installing binaries to the OS). So, 'meson test
> --suite setup' should be run before running regression tests.

The above sentence lacks some information. It appears that if binaries
are not created beforehand (only running configure, not make), they
are generated during tests in the make builds. This also applies to
meson builds when the meson test command is run, as meson executes
setup suite tests first, which creates the binaries. However, if we
specify a different test suite, like regress in this case, the setup
suite tests are not executed, and the binaries are not created,
preventing the tests from running. I am not sure how to configure
meson builds to run setup suite tests if they are not executed
beforehand.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: meson and check-tests

2024-09-23 Thread Nazir Bilal Yavuz

Hi,

On Sat, 21 Sept 2024 at 09:01, jian he  wrote:
>
> hi. Thanks for your work!

Thank you for looking into this!

> I do find some issues.
>
>
> TESTS="copy jsonb jsonb" meson test --suite regress
> one will fail. not sure this is expected?

Yes, that is expected.

> in [1] you mentioned "setup", but that "setup" is more or less like
> "meson test  --suite setup --suite regress"
> but originally, I thought was about "src/test/regress/sql/test_setup.sql".
> for example, now you cannot run src/test/regress/sql/stats_ext.sql
> without first running test_setup.sql, because some functions (like fipshash)
> live in test_setup.sql.

Postgres binaries are created at the build step in the make builds so
these binaries can be used for the tests. But in the meson builds,
they are created at the 'meson test  --suite setup' for the testing
('meson install' command creates binaries as well but they are not for
testing, they are for installing binaries to the OS). So, 'meson test
--suite setup' should be run before running regression tests.

> so
> TESTS="copy jsonb stats_ext" meson test --suite regress
> will fail.
>
> to make it work we need change it to
> TESTS="test_setup copy jsonb stats_ext" meson test --suite regress
>
> Many tests depend on test_setup.sql, maybe we can implicitly prepend it.
> Another dependency issue. alter_table depending on create_index.
>
> TESTS="test_setup alter_table" meson test --suite regress
> will fail.
> TESTS="test_setup create_index alter_table" meson test --suite regress
> will work.

Yes, I realized that but since that is how it is done in the make
builds, I didn't want to change the behaviour. Also, I think it makes
sense to leave it to the tester. It is more flexible in that way.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: meson and check-tests

2024-09-20 Thread Nazir Bilal Yavuz

Hi,

I’ve been working on a patch and wanted to share my approach, which
might be helpful for others. The patch removes the '--schedule' and
'${schedule_file}' options from the regress/regress test command when
the TESTS environment variable is set. Instead, it appends the TESTS
variable to the end of the command.

Please note that setup suite tests (at least tmp_install and
initdb_cache) should be executed before running these tests. One
drawback is that while the Meson logs will still show the test command
with the '--schedule' and '${schedule_file}' options, the actual
command used will be changed.

Some examples after the patched build:

$ meson test --suite regress -> fails
$ TESTS="create_table copy jsonb" meson test --suite regress -> fails
### run required setup suite tests
$ meson test tmp_install
$ meson test initdb_cache
###
$ meson test --suite regress -> passes (12s)
$ TESTS="copy" meson test --suite regress -> passes (0.35s)
$ TESTS="copy jsonb" meson test --suite regress -> passes (0.52s)
$ TESTS='select_into' meson test --suite regress -> fails
$ TESTS='test_setup select_into' meson test --suite regress -> passes (0.52s)
$ TESTS='rangetypes multirangetypes' meson test --suite regress -> fails
$ TESTS='test_setup multirangetypes rangetypes' meson test --suite
regres -> fails
$ TESTS='test_setup rangetypes multirangetypes' meson test --suite
regress -> passes (0.91s)

Any feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 7c94889b553ffc294ddf9eba7c595ea629d24e91 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 20 Sep 2024 11:39:20 +0300
Subject: [PATCH v1] Add 'make check-tests' approach to the meson based builds

---
 src/tools/testwrap | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/tools/testwrap b/src/tools/testwrap
index 9a270beb72d..9180727b6ff 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -41,6 +41,17 @@ env_dict = {**os.environ,
 'TESTDATADIR': os.path.join(testdir, 'data'),
 'TESTLOGDIR': os.path.join(testdir, 'log')}
 
+# Symmetric behaviour with make check-tests. If TESTS environment variable is
+# set, only run these regression tests in regress/regress test. Note that setup
+# suite tests (at least tmp_install and initdb_cache tests) need to be run
+# before running these tests.
+if "TESTS" in env_dict and args.testgroup == 'regress' and args.testname == 'regress':
+elem = '--schedule'
+schedule_index = args.test_command.index(elem) if elem in args.test_command else -1
+if schedule_index >= 0:
+del args.test_command[schedule_index : schedule_index + 2]
+args.test_command.extend(env_dict["TESTS"].split(' '))
+
 sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE)
 # Meson categorizes a passing TODO test point as bad
 # (https://github.com/mesonbuild/meson/issues/13183).  Remove the TODO
-- 
2.45.2

Re: per backend I/O statistics

2024-09-17 Thread Nazir Bilal Yavuz

Hi,

On Tue, 17 Sept 2024 at 16:07, Bertrand Drouvot
 wrote:
> On Tue, Sep 17, 2024 at 02:52:01PM +0300, Nazir Bilal Yavuz wrote:
> > Could we remove pg_stat_get_my_io() completely and use
> > pg_stat_get_backend_io() with the current backend's pid to get the
> > current backend's stats?
>
> The reason why I keep pg_stat_get_my_io() is because (as mentioned in [1]), 
> the
> statistics snapshot is build for "my backend stats" (means it depends of the
> stats_fetch_consistency value). I can see use cases for that.
>
> OTOH, pg_stat_get_backend_io() behaves as if stats_fetch_consistency is set to
> none (each execution re-fetches counters from shared memory) (see [2]). 
> Indeed,
> the snapshot is not build in each backend to copy all the others backends 
> stats,
> as 1/ I think that there is no use case (there is no need to get others 
> backends
> I/O statistics while taking care of the stats_fetch_consistency) and 2/ that
> could be memory expensive depending of the number of max connections.
>
> So I think it's better to keep both functions as they behave differently.
>
> Thoughts?

Yes, that is correct. Sorry, you already had explained it and I had
agreed with it but I forgot.

> > If you meant the same thing, please don't
> > mind it.
>
> What I meant to say is to try to remove the duplicate code that we can find in
> those 3 functions (say creating a new function that would contain the 
> duplicate
> code and make use of this new function in the 3 others). Did not look at it in
> details yet.

I got it, thanks for the explanation.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: per backend I/O statistics

2024-09-17 Thread Nazir Bilal Yavuz

Hi,

On Fri, 13 Sept 2024 at 19:09, Bertrand Drouvot
 wrote:
> On Fri, Sep 13, 2024 at 04:45:08PM +0300, Nazir Bilal Yavuz wrote:
> > - The pgstat_reset_io_counter_internal() is called in the
> > pgstat_shutdown_hook(). This causes the stats_reset column showing the
> > termination time of the old backend when its proc num is reassigned to
> > a new backend.
>
> doh! Nice catch, thanks!
>
> And also new backends that are not re-using a previous "existing" process slot
> are getting the last reset time (if any). So the right place to fix this is in
> pgstat_io_init_backend_cb(): done in v4 attached. v4 also sets the reset time 
> to
> 0 in pgstat_shutdown_hook() (but that's not necessary though, that's just to 
> be
> right "conceptually" speaking).

Thanks, I confirm that it is fixed.

You mentioned some refactoring upthread:

On Fri, 6 Sept 2024 at 18:03, Bertrand Drouvot
 wrote:
>
> - If we agree on the current proposal then I'll do  some refactoring in
> pg_stat_get_backend_io(), pg_stat_get_my_io() and pg_stat_get_io() to avoid
> duplicated code (it's not done yet to ease the review).

Could we remove pg_stat_get_my_io() completely and use
pg_stat_get_backend_io() with the current backend's pid to get the
current backend's stats? If you meant the same thing, please don't
mind it.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: per backend I/O statistics

2024-09-13 Thread Nazir Bilal Yavuz

Hi,

Thanks for working on this!

Your patch applies and builds cleanly.

On Fri, 6 Sept 2024 at 18:03, Bertrand Drouvot
 wrote:
>
> - As stated up-thread, the pg_stat_get_backend_io() behaves as if
> stats_fetch_consistency is set to none (each execution re-fetches counters
> from shared memory). Indeed, the snapshot is not build in each backend to copy
> all the others backends stats, as 1/ there is no use case (there is no need to
> get others backends I/O statistics while taking care of the 
> stats_fetch_consistency)
> and 2/ that could be memory expensive depending of the number of max 
> connections.

I believe this is the correct approach.

I manually tested your patches, and they work as expected. Here is
some feedback:

- The stats_reset column is NULL in both pg_my_stat_io and
pg_stat_get_backend_io() until the first call to reset io statistics.
While I'm not sure if it's necessary, it might be useful to display
the more recent of the two times in the stats_reset column: the
statistics reset time or the backend creation time.

- The pgstat_reset_io_counter_internal() is called in the
pgstat_shutdown_hook(). This causes the stats_reset column showing the
termination time of the old backend when its proc num is reassigned to
a new backend.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-09-12 Thread Nazir Bilal Yavuz

Hi,

On Thu, 12 Sept 2024 at 12:35, Ashutosh Bapat
 wrote:
>
> Here's what I understand, please correct me: The code in meson.build
> is only called at the time of setup; not during meson test. Hence we
> can not check the existence of a runtime environment variable in that
> file. The things in test_env override those set at run time. So we
> save it as an argument to --pg_test_extra and then use it if
> PG_TEST_EXTRA is not set at run time. I tried to find if there's some
> other place to store "environment variables that can be overriden at
> runtime" but I can not find it. So it looks like this is the best we
> can do for now.

Yes, that is exactly what is happening.

> If it comes to a point where there are more such environment variables
> that need to be passed, probably we will pass a key-value string of
> those to testwrap. For now, for a single variable, this looks ok.

Yes, that would be better.

> > > +# If PG_TEST_EXTRA is not set in the environment, then look for its Meson
> > > +# configure option.
> > > +if "PG_TEST_EXTRA" not in os.environ and args.pg_test_extra:
> > > + env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> > > +
> > >
> > > If somebody looks at just these lines, it's not clear how
> > > PG_TEST_EXTRA is passed to the test environment if it's available in
> > > os.environ. So one has to understand that env_dict is the test
> > > environment. If that's so, the code and comment rewritten as below
> > > makes more sense to me. What do you think?
> > >
> > > # If PG_TEST_EXTRA is not already part of the test environment, check if 
> > > it's
> > > # passed via program argument --pg_test_extra. Thus we also override
> > > # configuration time value by run time value of PG_TEST_EXTRA.
> > > if "PG_TEST_EXTRA" not in env_dict and args.pg_test_extra:
> > > env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> >
> > I think your suggestion is better, done.
>
> I didn't see the expected change. I was talking about something like attached.
>
> Also
> 1. I have also made changes to the comment,
> 2. renamed the argument --pg_test_extra to --pg-test-extra using
> convention similar to other arguments.
> 3. few other cosmetic changes.
>
> Please review and incorporate those in the respective patches and
> tests. Sorry for a single diff.
>
> Once this is done, I think we can mark this CF entry as RFC.

Thanks for the changes. I applied all of them in respective patches.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 035fe94107fe2c03647a4bb3ec4f0b3d1673e004 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 11 Sep 2024 15:41:27 +0300
Subject: [PATCH v5 1/2] Make PG_TEST_EXTRA env variable override its Meson
 configure option

Since PG_TEST_EXTRA configure option is set while configuring Meson,
each change on PG_TEST_EXTRA needs reconfigure and this is costly. So,
first try to use PG_TEST_EXTRA from environment. If it does not exist,
then try to use it from its configure option in Meson builds.

After the above change, Meson builds are able to get PG_TEST_EXTRA from
environment and this overrides the configure option. PG_TEST_EXTRA
environment variable is set at the top of the .cirrus.tasks.yml file. So,
PG_TEXT_EXTRA in configure scripts became useless and were removed
because of that.
---
 doc/src/sgml/installation.sgml |  6 --
 .cirrus.tasks.yml  |  6 +-
 meson.build| 11 ++-
 meson_options.txt  |  2 +-
 src/tools/testwrap | 10 ++
 5 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index ff9abd4649d..7c481e07e98 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3073,8 +3073,10 @@ ninja install
   

 Enable test suites which require special software to run.  This option
-accepts arguments via a whitespace-separated list.  See  for details.
+accepts arguments via a whitespace-separated list.  Please note that
+this configure option is overridden by PG_TEST_EXTRA environment
+variable if it is set.  See  for
+details.

   
  
diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 90cb95c8681..fc413eb11ef 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -175,7 +175,6 @@ task:
 --buildtype=debug \
 -Dcassert=true -Dinjection_points=true \
 -Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
--DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 -Dextra_lib_dirs=/usr/local/lib -Dextra_include_dirs=/usr/local/include/ \
 build
 EOF
@@ -364,

Re: Avoid dead code (contrib/pg_visibility/pg_visibility.c)

2024-09-11 Thread Nazir Bilal Yavuz

Hi,

On Thu, 12 Sept 2024 at 08:19, Michael Paquier  wrote:
>
> On Wed, Sep 04, 2024 at 01:50:24PM -0300, Ranier Vilela wrote:
> > I think that commit c582b75
> > <https://github.com/postgres/postgres/commit/c582b75851c2d096ce050d753494505a957cee75>,
> > left an oversight.
> >
> > The report is:
> > CID 1559993: (#1 of 1): Logically dead code (DEADCODE)

Thanks for the report!

> I am not sure to understand what you mean here and if this is still
> relevant as of Noah's latest commit in 65c310b310a6.

This should be fixed in 65c310b310a6.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Allow CI to only run the compiler warnings task

2024-09-11 Thread Nazir Bilal Yavuz

Hi,

On Wed, 11 Sept 2024 at 15:36, Bertrand Drouvot
 wrote:
>
> Hi hackers,
>
> While working on a new pg_logicalinspect module ([1]), I reached a point where
> all the CI tests were green except the compiler warnings one. Then, to save 
> time
> addressing the issue, I modified the .cirrus.tasks.yml file to $SUBJECT.
>
> I think this could be useful for others too, so please find attached this tiny
> patch.

+1 for this. I encountered the same issue before.

> Note that the patch does not add an extra "ci-task-only", but for simplicity 
> it
> it renames ci-os-only to ci-task-only.

I think this change makes sense. I gave a quick try to your patch with
ci-task-only: ["", "linux", "compilerwarnings"] and it worked as
expected.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-09-11 Thread Nazir Bilal Yavuz

Hi,

On Wed, 11 Sept 2024 at 13:04, Ashutosh Bapat
 wrote:
>
> Thanks Bilal for testing the patch. Can you or Jacob please create one
> patchset including both meson and make fixes? Please keep the meson
> and make changes in separate patches though. I think the meson patches
> come from [1] (they probably need a rebase, git am failed) and make
> patch comes from [2].The one fixing make needs a good commit message.

I created and attached a patchset and wrote a commit message to 'make'
fix. Please feel free to edit them.

> some comments on code
>
> 1. comments on changes to meson
>
> + variable if it exists. See  for
>
> s/exists/set/

Done.

> -# Test suites that are not safe by default but can be run if selected
> -# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
> -# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
> -test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
>
> A naive question. What happens if we add PG_TEST_EXTRA in meson.build
> itself rather than passing it via testwrap? E.g. like
> if "PG_TEST_EXTRA" not in os.environ
> test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))

Then this configure time option will be passed to the test environment
and there is no way to change it without reconfiguring if we don't
touch the testwrap file.

> I am worried that we might have to add an extra argument to testwrap
> for every environment variable that influences the tests. Avoiding it
> would be better.

If we want to have both configure time and run time variables, I
believe that this is the only way for now.

> option('PG_TEST_EXTRA', type: 'string', value: '',
> - description: 'Enable selected extra tests')
> + description: 'Enable selected extra tests, please note that this
> configure option is overridden by PG_TEST_EXTRA environment variable
> if it exists')
>
> All the descriptions look much shorter than this one. I suggest we
> shorten this one too as
> "Enable selected extra tests. Overridden by PG_TEST_EXTRA environment 
> variable."
> not as short as other descriptions but shorter than before and yet
> serves its intended purpose. Or just make it the same as the one in
> configure.ac. Either way the descriptions in configure.ac and
> meson_options.txt should be in sync.

I liked your suggestion, done in both meson_options.txt and configure.ac.

> +# If PG_TEST_EXTRA is not set in the environment, then look for its Meson
> +# configure option.
> +if "PG_TEST_EXTRA" not in os.environ and args.pg_test_extra:
> + env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
> +
>
> If somebody looks at just these lines, it's not clear how
> PG_TEST_EXTRA is passed to the test environment if it's available in
> os.environ. So one has to understand that env_dict is the test
> environment. If that's so, the code and comment rewritten as below
> makes more sense to me. What do you think?
>
> # If PG_TEST_EXTRA is not already part of the test environment, check if it's
> # passed via program argument --pg_test_extra. Thus we also override
> # configuration time value by run time value of PG_TEST_EXTRA.
> if "PG_TEST_EXTRA" not in env_dict and args.pg_test_extra:
> env_dict["PG_TEST_EXTRA"] = args.pg_test_extra

I think your suggestion is better, done.

> But in case we decide to fix meson.build as suggested in one of the
> commentsabove, this change will be unnecessary.
>
> Note that PG_TEST_EXTRA is used in only TAP tests right now. The way
> the value passed to --pg_test_extra is handled in testwrap, it will
> available to every test, not just TAP tests. But this looks fine to me
> since the documentation of PG_TEST_EXTRA or its name itself does not
> show any intention to limit it only to TAP tests.

I agree, it looks fine to me as well.

> 2. comments on make changes
> Since Makefile.global.in is included in src/test/Makefile I was
> expecting that the PG_TEST_EXTRA picked from configuration would be
> available in src/test/Makefile from which it would be exported. But
> that doesn't seem to be the case. In fact, I am getting doubtful about
> the role of the current "export PG_TEST_EXTRA" in /src/test/Makefile.
> Even if I remove it, it doesn't affect anything. Commands a.
> PG_TEST_EXTRA=xid_wraparound make check, b.
> PG_TEST_EXTRA=xid_wraparound make -C $XID_MODULE_DIR check run the
> tests (and don't skip them).

Yes, it looks like it is useless. If we export PG_TEST_EXTRA, then it
should be already available on the environment, right?

> Anyway with the proposed change PG_TEST_EXTRA passed at configuration
&g

Make pg_stat_io view count IOs as bytes instead of blocks

2024-09-11 Thread Nazir Bilal Yavuz

Hi,

Currently, in the pg_stat_io view, IOs are counted as blocks. However,
there are two issues with this approach:

1- The actual number of IO requests to the kernel is lower because IO
requests can be merged before sending the final request. Additionally, it
appears that all IOs are counted in block size.
2- Some IOs may not align with block size. For example, WAL read IOs are
done in variable bytes and it is not possible to correctly show these IOs
in the pg_stat_io view [1].

To address this, I propose showing the total number of IO requests to the
kernel (as smgr function calls) and the total number of bytes in the IO. To
implement this change, the op_bytes column will be removed from the
pg_stat_io view. Instead, the [reads | writes | extends] columns will track
the total number of IO requests, and newly added [read | write |
extend]_bytes columns will track the total number of bytes in the IO.

Example benefit of this change:

Running query [2], the result is:

╔═══╦══╦══╦═══╗
║backend_type   ║  object  ║  context ║ avg_io_blocks ║
╠═══╬══╬══╬═══╣
║   client backend  ║ relation ║ bulkread ║ 15.99 ║
╠═══╬══╬══╬═══╣
║ background worker ║ relation ║ bulkread ║ 15.99 ║
╚═══╩══╩══╩═══╝

You can rerun the same query [2] after setting io_combine_limit to 32 [3].
The result is:

╔═══╦══╦══╦═══╗
║backend_type   ║  object  ║  context ║ avg_io_blocks ║
╠═══╬══╬══╬═══╣
║   client backend  ║ relation ║ bulkread ║ 31.70 ║
╠═══╬══╬══╬═══╣
║ background worker ║ relation ║ bulkread ║ 31.60 ║
╚═══╩══╩══╩═══╝

I believe that having visibility into avg_io_[bytes | blocks] is valuable
information that could help optimize Postgres.

Any feedback would be appreciated.

[1]
https://www.postgresql.org/message-id/CAN55FZ1ny%2B3kpdm5X3nGZ2Jp3wxZO-744eFgxktS6YQ3%3DOKR-A%40mail.gmail.com

[2]
CREATE TABLE t as select i, repeat('a', 600) as filler from
generate_series(1, 1000) as i;
SELECT pg_stat_reset_shared('io');
SELECT * FROM t WHERE i = 0;
SELECT backend_type, object, context, TRUNC((read_bytes / reads / (SELECT
current_setting('block_size')::numeric)), 2) as avg_io_blocks FROM
pg_stat_io WHERE reads > 0;

[3] SET io_combine_limit TO 32;

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From f02b0d261880aa3f933a9350b6b1557f6b14f292 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 11 Sep 2024 11:04:18 +0300
Subject: [PATCH v1] Make pg_stat_io count IOs as bytes instead of blocks

Currently in pg_stat_io view, IOs are counted as blocks. There are two
problems with this approach:

1- The actual number of I/O requests sent to the kernel is lower because
I/O requests may be merged before being sent. Additionally, it gives the
impression that all I/Os are done in block size, which shadows the
benefits of merging I/O requests.

2- There may be some IOs which are not done in block size in the future.
For example, WAL read IOs are done in variable bytes and it is not
possible to correctly show these IOs in pg_stat_io view.

Because of these problems, now show the total number of IO requests to
the kernel (as smgr function calls) and total number of bytes in the IO.
Also, remove op_bytes column from the pg_stat_io view.
---
 src/include/catalog/pg_proc.dat|  6 +-
 src/include/pgstat.h   |  9 ++-
 src/backend/catalog/system_views.sql   |  4 +-
 src/backend/storage/buffer/bufmgr.c| 14 ++---
 src/backend/storage/buffer/localbuf.c  |  6 +-
 src/backend/storage/smgr/md.c  |  4 +-
 src/backend/utils/activity/pgstat_io.c | 63 ---
 src/backend/utils/adt/pgstatfuncs.c| 87 +++---
 src/test/regress/expected/rules.out|  6 +-
 doc/src/sgml/monitoring.sgml   | 61 +++---
 10 files changed, 184 insertions(+), 76 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ff5436acacf..b0dab15bfd4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5826,9 +5826,9 @@
   proname => 'pg_stat_get_io', prorows => '30', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_

Re: Use read streams in pg_visibility

2024-09-10 Thread Nazir Bilal Yavuz

Hi,

On Wed, 11 Sept 2024 at 01:38, Noah Misch  wrote:
>
> On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> > Your patch is correct. I wrongly assumed it would catch blockno bug,
> > the attached version catches it. I made blockno = 0 invisible and not
> > frozen before copying the vm file. So, in the blockno buggy version;
> > callback will skip that block but the main loop in the
> > collect_corrupt_items() will not skip it. I tested it with your patch
> > and there is exactly 1 blockno difference between expected and result
> > output.
>
> Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
> a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
> under default settings, but this lets us disregard the possibility entirely.

Thanks!

> I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in pg_visibility

2024-09-10 Thread Nazir Bilal Yavuz

Hi,

On Tue, 10 Sept 2024 at 00:32, Noah Misch  wrote:
>
> Copying the vm file is a good technique.  In my test runs, this does catch the
> "never detect" bug, but it doesn't catch the blkno bug.  Can you make it catch
> both?  It's possible my hand-patching to recreate the blkno bug is what went
> wrong, so I'm attaching what I used.  It consists of
> v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
> fixes for the "never detect" bug from your v6-0002:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 342bd8ede69942424b6d0568574db52cb574c479 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 10 Sep 2024 13:15:11 +0300
Subject: [PATCH v7 1/2] Add tests that pg_check_[visible|frozen] report
 corrupted tuples

Currently, there are no tests that pg_check_[visible|frozen] need to
report corrupted tuples. Add that kind of tests.

To add these tests, visibility map needs to be corrupted. It is done by
overwriting visibility map by its older state.
---
 contrib/pg_visibility/meson.build |   1 +
 contrib/pg_visibility/t/002_corrupt_vm.pl | 107 ++
 2 files changed, 108 insertions(+)
 create mode 100644 contrib/pg_visibility/t/002_corrupt_vm.pl

diff --git a/contrib/pg_visibility/meson.build b/contrib/pg_visibility/meson.build
index f3c1263313a..609fc5f9f3e 100644
--- a/contrib/pg_visibility/meson.build
+++ b/contrib/pg_visibility/meson.build
@@ -36,6 +36,7 @@ tests += {
   'tap': {
 'tests': [
   't/001_concurrent_transaction.pl',
+  't/002_corrupt_vm.pl',
 ],
   },
 }
diff --git a/contrib/pg_visibility/t/002_corrupt_vm.pl b/contrib/pg_visibility/t/002_corrupt_vm.pl
new file mode 100644
index 000..e57f365576f
--- /dev/null
+++ b/contrib/pg_visibility/t/002_corrupt_vm.pl
@@ -0,0 +1,107 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Check that pg_check_visible() and pg_check_frozen() reports
+# correct TIDs when there is a corruption.
+use strict;
+use warnings FATAL => 'all';
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $blck_size = $node->safe_psql("postgres", "SHOW block_size;");
+
+# Create a sample table with at least 10 pages and then run VACUUM. 10 is
+# selected manually as it is big enough to select 5 random tuples from the
+# relation.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE EXTENSION pg_visibility;
+	CREATE TABLE corruption_test
+WITH (autovacuum_enabled = false) AS
+   SELECT
+i,
+repeat('a', 10) AS data
+			FROM
+			generate_series(1, $blck_size) i;
+	VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) corruption_test;
+)
+);
+
+# VACUUM is run, it is safe to get the number of pages from the relpages.
+my $npages = $node->safe_psql(
+	"postgres",
+	"SELECT relpages FROM pg_class
+		WHERE  relname = 'corruption_test';"
+);
+ok($npages >= 10, "There should be at least 10 pages in the table");
+
+my $file = $node->safe_psql("postgres",
+	"SELECT pg_relation_filepath('corruption_test');");
+
+# Delete the first block to make sure that it will be skipped as it is
+# not visible nor frozen.
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE (ctid::text::point)[0] = 0;"
+);
+
+# Copy visibility map.
+$node->stop;
+my $vm_file = $node->data_dir . '/' . $file . '_vm';
+copy("$vm_file", "${vm_file}_temp");
+$node->start;
+
+# Select 5 random tuples that are starting from the second block of the
+# relation. The first block is skipped because it is deleted above.
+my $tuples = $node->safe_psql(
+	"postgres",
+	"SELECT ctid FROM (
+SELECT ctid FROM corruption_test
+			WHERE (ctid::text::point)[0] != 0
+ORDER BY random() LIMIT 5)
+ORDER BY ctid ASC;"
+);
+
+# Do the changes below to use tuples in the query.
+# "\n" -> ","
+# "(" -> "'("
+# ")" -> ")'"
+(my $tuples_query = $tuples) =~ s/\n/,/g;
+$tuples_query =~ s/\(/\'\(/g;
+$tuples_query =~ s/\)/\)\'/g;
+
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE ctid i

Re: Use read streams in pg_visibility

2024-09-09 Thread Nazir Bilal Yavuz

Hi,

On Thu, 5 Sept 2024 at 18:54, Noah Misch  wrote:
>
> On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 4 Sept 2024 at 21:43, Noah Misch  wrote:
> > > https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
> > > then observed that collect_corrupt_items() was now guaranteed to never 
> > > detect
> > > corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes. 
> > >  For
> > > the next try, could you add that testing we discussed?
> >
> > Do you think that corrupting the visibility map and then seeing if
> > pg_check_visible() and pg_check_frozen() report something is enough?
>
> I think so.  Please check that it would have caught both the blkno bug and the
> above bug.

The test and updated patch files are attached. In that test I
overwrite the visibility map file with its older state. Something like
this:

1- Create the table, then run VACUUM FREEZE on the table.
2- Shutdown the server, create a copy of the vm file, restart the server.
3- Run the DELETE command on some $random_tuples.
4- Shutdown the server, overwrite the vm file with the #2 vm file,
restart the server.
5- pg_check_visible and pg_check_frozen must report $random_tuples as corrupted.

Do you think this test makes sense and enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 1f7b2cfe0d2247eb4768ce79d14982701437d473 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 9 Sep 2024 15:27:53 +0300
Subject: [PATCH v6 1/2] Add tests that pg_check_[visible|frozen] report
 corrupted tuples

Currently, there are no tests that pg_check_[visible|frozen] need to
report corrupted tuples. Add that kind of tests.

To add these tests, visibility map needs to be corrupted. It is done by
overwriting visibility map by its older state.
---
 contrib/pg_visibility/meson.build |   1 +
 contrib/pg_visibility/t/002_corrupt_vm.pl | 100 ++
 2 files changed, 101 insertions(+)
 create mode 100644 contrib/pg_visibility/t/002_corrupt_vm.pl

diff --git a/contrib/pg_visibility/meson.build b/contrib/pg_visibility/meson.build
index f3c1263313a..609fc5f9f3e 100644
--- a/contrib/pg_visibility/meson.build
+++ b/contrib/pg_visibility/meson.build
@@ -36,6 +36,7 @@ tests += {
   'tap': {
 'tests': [
   't/001_concurrent_transaction.pl',
+  't/002_corrupt_vm.pl',
 ],
   },
 }
diff --git a/contrib/pg_visibility/t/002_corrupt_vm.pl b/contrib/pg_visibility/t/002_corrupt_vm.pl
new file mode 100644
index 000..cc6d043b8b7
--- /dev/null
+++ b/contrib/pg_visibility/t/002_corrupt_vm.pl
@@ -0,0 +1,100 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Check that pg_check_visible() and pg_check_frozen() reports
+# correct TIDs when there is a corruption.
+use strict;
+use warnings FATAL => 'all';
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $blck_size = $node->safe_psql("postgres", "SHOW block_size;");
+
+# Create a sample table with at least 10 pages and then run VACUUM. 10 is
+# selected manually as it is big enough to select 5 random tuples from the
+# relation.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE EXTENSION pg_visibility;
+	CREATE TABLE corruption_test
+WITH (autovacuum_enabled = false) AS
+   SELECT
+i,
+repeat('a', 10) AS data
+			FROM
+			generate_series(1, $blck_size) i;
+	VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) corruption_test;
+)
+);
+
+# VACUUM is run, it is safe to get the number of pages from the relpages.
+my $npages = $node->safe_psql(
+	"postgres",
+	"SELECT relpages FROM pg_class
+		WHERE  relname = 'corruption_test';"
+);
+
+ok($npages >= 10, "There should be at least 10 pages in the table");
+
+my $file = $node->safe_psql("postgres",
+	"SELECT pg_relation_filepath('corruption_test');");
+
+# Copy visibility map.
+$node->stop;
+my $vm_file = $node->data_dir . '/' . $file . '_vm';
+copy("$vm_file", "${vm_file}_temp");
+$node->start;
+
+# Select 5 random tuples from the relation.
+my $tuples = $node->safe_psql(
+	"postgres",
+	"SELECT ctid FROM (
+SELECT ctid FROM corruption_test
+ORDER BY random() LIMIT 5)
+ORDER BY ctid ASC;"
+);
+
+# Do the changes below to use tuples in the query.
+# "\n" -> ","
+# "(" -> "'("
+# ")" -> ")'"
+(my $tuples_query = $tuples) =~ s/\n/,/g;
+$tuples_query =~ s/\(/\'\(/g;
+$tuples_query =~ s/\)/\)\'/g;
+
+$node->safe_psql

Re: Use read streams in pg_visibility

2024-09-05 Thread Nazir Bilal Yavuz

Hi,

On Wed, 4 Sept 2024 at 21:43, Noah Misch  wrote:
>
> https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
> then observed that collect_corrupt_items() was now guaranteed to never detect
> corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
> the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: meson vs. llvm bitcode files

2024-09-05 Thread Nazir Bilal Yavuz

Hi,

On Thu, 5 Sept 2024 at 11:56, Peter Eisentraut  wrote:
>
> The meson build currently does not produce llvm bitcode (.bc) files.
> AFAIK, this is the last major regression for using meson for production
> builds.
>
> Is anyone working on that?  I vaguely recall that some in-progress code
> was shared a couple of years ago, but I haven't seen anything since.  It
> would be great if we could collect any existing code and notes to maybe
> get this moving again.

I found that Andres shared a patch
(v17-0021-meson-Add-LLVM-bitcode-emission.patch) a while ago [1].

[1] 
https://www.postgresql.org/message-id/20220927011951.j3h4o7n6bhf7dwau%40awork3.anarazel.de

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in pg_visibility

2024-09-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 3 Sept 2024 at 22:20, Noah Misch  wrote:
>
> On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> > On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:
> > > Thanks, updated patches are attached.
> >
> > > +/*
> > > + * Ask the callback which block it would like us to read next, with a 
> > > small
> > > + * buffer in front to allow read_stream_unget_block() to work and to 
> > > allow the
> > > + * fast path to skip this function and work directly from the array.
> > >   */
> > >  static inline BlockNumber
> > >  read_stream_get_block(ReadStream *stream, void *per_buffer_data)
> >
> > v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
> > this update to the read_stream_get_block() header comment, but we're not
> > changing read_stream_get_block() here.  I reverted this.

Sorry, it should be left from rebase. Thanks for reverting it.

> > Pushed with some other cosmetic changes.

Thanks!

> I see I pushed something unacceptable under ASAN.  I will look into that:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20

I think it is related to the scope of BlockRangeReadStreamPrivate in
the collect_visibility_data() function. Attached a small fix for that.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 5393dfe6e2a0888cb33ad15f31def1691637564f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 3 Sep 2024 22:35:00 +0300
Subject: [PATCH v5] Fix ASAN error introduced in ed1b1ee59f

---
 contrib/pg_visibility/pg_visibility.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9975e8876e6..db796e35cb2 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -490,6 +490,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
 	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
@@ -505,8 +506,6 @@ collect_visibility_data(Oid relid, bool include_pd)
 	/* Create a stream if reading main fork. */
 	if (include_pd)
 	{
-		BlockRangeReadStreamPrivate p;
-
 		p.current_blocknum = 0;
 		p.last_exclusive = nblocks;
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
-- 
2.45.2

Re: PG_TEST_EXTRA and meson

2024-09-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 30 Aug 2024 at 21:36, Jacob Champion
 wrote:
>
> On Wed, Aug 28, 2024 at 8:21 AM Nazir Bilal Yavuz  wrote:
> > I do not exactly remember the reason but I think I copied the same
> > behavior as before, PG_TEST_EXTRA variable was checked in the
> > src/test/Makefile so I exported it there.
>
> Okay, give v3 a try then. This exports directly from Makefile.global.
> Since that gets pulled into a bunch of places, the scope is a lot
> wider than it used to be; I've disabled it for PGXS so it doesn't end
> up poisoning other extensions.

Patch looks good and it passes all the test cases in Ashutosh's test script.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Virtual generated columns

2024-09-02 Thread Nazir Bilal Yavuz

Hi,

On Thu, 29 Aug 2024 at 15:16, Peter Eisentraut  wrote:
>
> I also committed the two patches that renamed the existing test files,
> so those are not included here anymore.
>
> The new patch does some rebasing and contains various fixes to the
> issues you presented.  As I mentioned, I'll look into improving the
> rewriting.

xid_wraparound test started to fail after edee0c621d. It seems the
error message used in xid_wraparound/002_limits is updated. The patch
that applies the same update to the test file is attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 748721898e8171d35d54ffe2b6edb38b9f5b020d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 2 Sep 2024 16:18:57 +0300
Subject: [PATCH v1] Fix xid_wraparound/002_limits test

Error message used in xid_wraparound/002_limits is updated in edee0c621d.
Apply the same update in the test file as well.
---
 src/test/modules/xid_wraparound/t/002_limits.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index aca3fa15149..889689d3bde 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -103,7 +103,7 @@ $ret = $node->psql(
 	stderr => \$stderr);
 like(
 	$stderr,
-	qr/ERROR:  database is not accepting commands that assign new XIDs to avoid wraparound data loss in database "postgres"/,
+	qr/ERROR:  database is not accepting commands that assign new transaction IDs to avoid wraparound data loss in database "postgres"/,
 	"stop-limit");
 
 # Finish the old transaction, to allow vacuum freezing to advance
-- 
2.45.2

Re: Use read streams in pg_visibility

2024-09-02 Thread Nazir Bilal Yavuz

Hi,

On Sat, 31 Aug 2024 at 02:51, Noah Misch  wrote:
>
> To read blocks 10 and 11, I would expect to initialize the struct with one of:
>
> { .first=10, .nblocks=2 }
> { .first=10, .last_inclusive=11 }
> { .first=10, .last_exclusive=12 }
>
> With the patch's API, I would need {.first=10,.nblocks=12}.  The struct field
> named "nblocks" behaves like a last_block_exclusive.  Please either make the
> behavior an "nblocks" behavior or change the field name to replace the term
> "nblocks" with something matching the behavior.  (I used longer field names in
> my examples here, to disambiguate those examples.  It's okay if the final
> field names aren't those, as long as the field names and the behavior align.)

I decided to use 'current_blocknum' and 'last_exclusive'. I think
these are easier to understand and use.

> > Thanks for the information, I will check these. What I still do not
> > understand is how to make sure that only the second block is processed
> > and the first one is skipped. pg_check_visible() and pg_check_frozen()
> > returns TIDs that cause corruption in the visibility map, there is no
> > information about block numbers.
>
> I see what you're saying.  collect_corrupt_items() needs a corrupt table to
> report anything; all corruption-free tables get the same output.  Testing this
> would need extra C code or techniques like corrupt_page_checksum() to create
> the corrupt state.  That wouldn't be a bad thing to have, but it's big enough
> that I'll consider it out of scope for $SUBJECT.  With the callback change
> above, I'll be ready to push all this.

Thanks, updated patches are attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 1dbdeacc54c3575a2cb82e95aab5b335b0fa1d2f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 26 Aug 2024 12:12:52 +0300
Subject: [PATCH v4 1/5] Add general-use struct and callback to read stream

Number of callbacks that just iterates over block range in read stream
are increasing. So, add general-use struct and callback to read stream.
---
 src/include/storage/read_stream.h | 13 +
 src/backend/storage/aio/read_stream.c | 21 +++--
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f26..0a2398fd7df 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -45,11 +45,24 @@
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
+/*
+ * General-use struct to use in callback functions for block range scans.
+ * Callback loops between current_blocknum (inclusive) and last_exclusive.
+ */
+typedef struct BlockRangeReadStreamPrivate
+{
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+} BlockRangeReadStreamPrivate;
+
 /* Callback that returns the next block number to read. */
 typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 void *callback_private_data,
 void *per_buffer_data);
 
+extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
+			  void *callback_private_data,
+			  void *per_buffer_data);
 extern ReadStream *read_stream_begin_relation(int flags,
 			  BufferAccessStrategy strategy,
 			  Relation rel,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..8a449bab8a0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -164,8 +164,25 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 }
 
 /*
- * Ask the callback which block it would like us to read next, with a one block
- * buffer in front to allow read_stream_unget_block() to work.
+ * General-use callback function for block range scans.
+ */
+BlockNumber
+block_range_read_stream_cb(ReadStream *stream,
+		   void *callback_private_data,
+		   void *per_buffer_data)
+{
+	BlockRangeReadStreamPrivate *p = callback_private_data;
+
+	if (p->current_blocknum < p->last_exclusive)
+		return p->current_blocknum++;
+
+	return InvalidBlockNumber;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..df3f336bec0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -275,6 +275,7 @@ BlockId
 BlockIdData
 BlockInfoRecord
 BlockNumber
+BlockRangeReadStreamPrivate
 BlockRefTable
 BlockRefT

Re: PG_TEST_EXTRA and meson

2024-08-28 Thread Nazir Bilal Yavuz

Hi,

On Wed, 28 Aug 2024 at 18:11, Jacob Champion
 wrote:
>
> On Wed, Aug 28, 2024 at 6:41 AM Ashutosh Bapat
>  wrote:
> > Nazir, since you authored c3382a3c3cc, can you please provide input
> > that Jacob needs?
>
> Specifically, why the PG_TEST_EXTRA variable was being exported at the
> src/test level only. If there's no longer a use case being targeted,
> we can always change it in this patch, but I didn't want to do that
> without understanding why it was like that to begin with.

I do not exactly remember the reason but I think I copied the same
behavior as before, PG_TEST_EXTRA variable was checked in the
src/test/Makefile so I exported it there.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-08-28 Thread Nazir Bilal Yavuz

Hi,

On Wed, 28 Aug 2024 at 16:41, Ashutosh Bapat
 wrote:
>
> On Wed, Aug 28, 2024 at 5:26 PM Ashutosh Bapat
>  wrote:
> >
> > On Fri, Aug 23, 2024 at 9:55 PM Jacob Champion
> >  wrote:
> > >
> > > On Fri, Aug 23, 2024 at 5:59 AM Ashutosh Bapat
> > >  wrote:
> > > > If I run
> > > > export PG_TEST_EXTRA=xid_wraparound; ./configure --prefix=$BuildDir
> > > > --enable-tap-tests && make -j4 && make -j4 install; unset
> > > > PG_TEST_EXTRA
> > > > followed by
> > > > make -C $XID_MODULE_DIR check where
> > > > XID_MODULE_DIR=src/test/modules/xid_wraparound - it skips the tests.
> > >
> > > Right.
> > >
> > > > I thought this was working before.
> > >
> > > No, that goes back to my note in [1] -- as of c3382a3c3cc, the
> > > variable was exported at only the src/test level, and I wanted to get
> > > input on that so we could decide on the next approach if needed.
>
> I had an offline discussion with Jacob. Because of c3382a3c3cc, `make
> -C src/test/modules/xid_wraparound check` will skip xid_wraparound
> tests. It should be noted that when that commit was added well before
> the xid_wraparound tests were added. But I don't know whether the
> tests controlled by PG_TEST_EXTRA could be run using make -C that
> time.

I did not test but I think they could not be run because of the same
reason as below.

> Anyway, I think, supporting PG_TEST_EXTRA at configure time without
> being able to run tests using `make -C  ` is not that
> useful. It only helps make check-world but not when running individual
> tests.
>
> Nazir, since you authored c3382a3c3cc, can you please provide input
> that Jacob needs?

I think the problem is that we define PG_TEST_EXTRA in the
Makefile.global file but we do not export it there, instead we export
in the src/test/Makefile. But when you run tests from their Makefiles,
they do not include src/test/Makefile (they include Makefile.global);
so they can not get the PG_TEST_EXTRA env variable. Exporting
PG_TEST_EXTRA in the Makefile.global should solve the problem.

Also, I think TEST 110 and 170 do not look correct to me. In the
current way, we do not pass PG_TEST_EXTRA to the make command.

110 should be:
'cd $XID_MODULE_DIR && PG_TEST_EXTRA=xid_wraparound make check'
instead of 'PG_TEST_EXTRA=xid_wraparound cd $XID_MODULE_DIR && make
check'

170 should be:
'cd $XID_MODULE_DIR && PG_TEST_EXTRA="" make check && cd $PGDir'
instead of 'PG_TEST_EXTRA="" cd $XID_MODULE_DIR && make check && cd
$PGDir'

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in pg_visibility

2024-08-27 Thread Nazir Bilal Yavuz

Hi,

On Fri, 23 Aug 2024 at 22:01, Noah Misch  wrote:
>
> On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 20 Aug 2024 at 21:47, Noah Misch  wrote:
> > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > > The downside of this approach is there are too many "vmbuffer is valid
> > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > > vmbuffer needs to be read again" cases in the read stream version (700
> > > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > > the read stream reading a new vmbuf while getting next block numbers.
> > > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > > have changed while we were acquiring the page lock.
> > >
> > > Did the test that found 700 "read again" use a different procedure than 
> > > the
> > > one shared as setup.sh down-thread?  Using that script alone, none of the 
> > > vm
> > > bits are set, so I'd not expect any heap page reads.
> >
> > Yes, it is basically the same script but the query is 'SELECT
> > pg_check_visible('${TABLE}'::regclass);'.
>
> I gather the scripts deal exclusively in tables with no vm bits set, so
> pg_visibility did no heap page reads under those scripts.  But the '700 "read
> again"' result suggests either I'm guessing wrong, or that result came from a
> different test procedure.  Thoughts?

Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.

> > > The "vmbuffer needs to be read again" regression fits what I would expect 
> > > the
> > > v1 patch to do with a table having many vm bits set.  In general, I think 
> > > the
> > > fix is to keep two heap Buffer vars, so the callback can work on one 
> > > vmbuffer
> > > while collect_corrupt_items() works on another vmbuffer.  Much of the 
> > > time,
> > > they'll be the same buffer.  It could be as simple as that, but you could
> > > consider further optimizations like these:
> >
> > Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> > "read-again" cases to ~40. I ran the test again and the timing
> > improvement is ~5% now.
>
> > I think the number of "read-again" cases are low enough now.
>
> Agreed.  The increase from 20 to 40 is probably an increase in buffer mapping
> lookups, not an increase in I/O.
>
> > Does creating something like
> > pg_general_read_stream_next_block() in read_stream code and exporting
> > it makes sense?
>
> To me, that name isn't conveying the function's behavior, and the words "pg_"
> and "general_" aren't adding information there.  How about one of these names
> or similar:
>
> seq_read_stream_cb
> sequential_read_stream_cb
> block_range_read_stream_cb

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

> > > The callback doesn't return blocks having zero vm bits, so the blkno 
> > > variable
> > > is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> > > returning spurious results." looks at the bit for the wrong block.  If 
> > > that's
> > > what v1 does, could you expand the test file to catch that?  For example, 
> > > make
> > > a two-block table with only the second block all-visible.
> >
> > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > checked and confirmed it is working as expected by manually logging
> > blocknos returned from the read stream. I am not sure how to add a
> > test case for this.
>
> VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a particular
> TID, even if rolled back, clears both vm bits for the TID's page.  Past tests
> like that had instability problems.  One cause is a concurrent session's XID
> or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table may
> have been one of the countermeasures, but I don't remember clearly.  Hence,
> please search the archives or the existing pg_visibility tests for how we
> dealt with that.  It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second

Re: Use read streams in pg_visibility

2024-08-23 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review and feedback!

On Tue, 20 Aug 2024 at 21:47, Noah Misch  wrote:
>
> On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > 2- collect_corrupt_items()
> >
> > This one is more complicated. The read stream callback function loops
> > until it finds a suitable block to read. So, if the callback returns
> > an InvalidBlockNumber; it means that the stream processed all possible
> > blocks and the stream should be finished. There is ~3% timing
> > improvement with this change. I started the server with the default
> > settings and created a 6 GB table. Then run 100 times
> > pg_check_visible() by clearing the OS cache between each run.
> >
> > The downside of this approach is there are too many "vmbuffer is valid
> > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > vmbuffer needs to be read again" cases in the read stream version (700
> > vs 20 for the 6 GB table). This is caused by the callback function of
> > the read stream reading a new vmbuf while getting next block numbers.
> > So, vmbuf is wrong when we are checking visibility map bits that might
> > have changed while we were acquiring the page lock.
>
> Did the test that found 700 "read again" use a different procedure than the
> one shared as setup.sh down-thread?  Using that script alone, none of the vm
> bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

> The "vmbuffer needs to be read again" regression fits what I would expect the
> v1 patch to do with a table having many vm bits set.  In general, I think the
> fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> while collect_corrupt_items() works on another vmbuffer.  Much of the time,
> they'll be the same buffer.  It could be as simple as that, but you could
> consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

> - Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
>   lookup when the other Buffer var already has the block you want.
>
> - [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
>   to stop calling the ReadStreamBlockNumberCB for awhile.  This could help if
>   nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
>   block.  For example, if one block out of every 33000 is all-visible, every
>   heap block we visit has a different vmbuffer.  It's likely not optimal for
>   the callback to pin and unpin 20 vmbuffers, then have
>   collect_corrupt_items() pin and unpin the same 20 vmbuffers.  pg_visibility
>   could notice this trend and request a stop of the callbacks until more of
>   the heap block work completes.  If pg_visibility is going to be the only
>   place in the code with this use case, it's probably not worth carrying the
>   extra API just for pg_visibility.  However, if we get a stronger use case
>   later, pg_visibility could be another beneficiary.

I think the number of "read-again" cases are low enough now. I am
working on 'using ReadRecentBuffer() or a similar technique' but that
may take more time, so I attached patches without these optimizations.

> > +/*
> > + * Callback function to get next block for read stream object used in
> > + * collect_visibility_data() function.
> > + */
> > +static BlockNumber
> > +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> > +   
> >  void *callback_private_data,
> > +   
> >  void *per_buffer_data)
> > +{
> > + struct collect_visibility_data_read_stream_private *p = 
> > callback_private_data;
> > +
> > + if (p->blocknum < p->nblocks)
> > + return p->blocknum++;
> > +
> > + return InvalidBlockNumber;
>
> This is the third callback that just iterates over a block range, after
> pg_prewarm_read_stream_next_block() and
> copy_storage_using_buffer_read_stream_next_block().  While not a big problem,
> I think it's time to have a general-use callback for block range scans.  The
> quantity of duplicate code is low, but the existing function names are long
> and less informative than a behavior-based name.

I agree. Does creating something like
pg_general_read_str

Re: Use read streams in pg_visibility

2024-08-19 Thread Nazir Bilal Yavuz

Hi,

On Mon, 19 Aug 2024 at 09:30, Michael Paquier  wrote:
>
> On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > I am working on using the read stream in pg_visibility. There are two
> > places to use it:
> >
> > 1- collect_visibility_data()
> >
> > This one is straightforward. I created a read stream object if
> > 'include_pd' is true because read I/O is done when 'include_pd' is
> > true. There is ~4% timing improvement with this change. I started the
> > server with the default settings and created a 6 GB table. Then run
> > 100 times pg_visibility() by clearing the OS cache between each run.
> > --
>
> Mind sharing a script for reproducibility?  Except for the drop_caches
> part, of course..

Sure, the script is attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft


setup.sh
Description: application/shellscript

Use read streams in pg_visibility

2024-08-13 Thread Nazir Bilal Yavuz

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
--

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.
--

Both patches are attached.

Any feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From c178e010759f042dc089fd4c5b2283a04b95f017 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 9 Aug 2024 11:41:52 +0300
Subject: [PATCH v1 1/2] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 53 ++-
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 1a1a4ff7be7..10f961c58f0 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,16 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_visibility_data() function.
+ */
+struct collect_visibility_data_read_stream_private
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -463,6 +474,23 @@ pg_visibility_tupdesc(bool include_blkno, bool include_pd)
 	return BlessTupleDesc(tupdesc);
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+			   void *callback_private_data,
+			   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Collect visibility data about a relation.
  *
@@ -478,6 +506,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	struct collect_visibility_data_read_stream_private p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +519,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+			bstrategy,
+			rel,
+			MAIN_FORKNUM,
+			collect_visibility_data_read_stream_next_block,
+			&p,
+			0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +557,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-		bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +568,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_st

Re: CREATE DATABASE with filesystem cloning

2024-08-08 Thread Nazir Bilal Yavuz

Hi,

Rebased version of the patch is attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From a419004a26410c9ad9348e2f1420695db8ca35b6 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 8 Aug 2024 15:01:48 +0300
Subject: [PATCH v9] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE method is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Reviewed-by: Robert Haas 
Reviewed-by: Ranier Vilela 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 86 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 19 
 src/backend/utils/misc/postgresql.conf.sample |  4 +
 doc/src/sgml/config.sgml  | 46 ++
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..0a9c6fb62c0 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,30 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+static void clone_file(const char *fromfile, const char *tofile);
+
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Uses of this function must be documented in the list of places
+ * affected by this GUC.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +84,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +232,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(fromfile, tofile, NULL, COPYFILE_CLONE_FORCE) < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not clone file \"%s\" to \"%s\": %m",
+		fromfile, tofile)));
+#elif defined(HAVE_COPY_FILE_RANGE)
+	int			srcfd;
+	int			dstfd;
+	ssize_t		nbytes;
+
+	srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY);
+	if (srcfd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", fromfile)));
+
+	dstfd = OpenTransientFile(tofile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+	if (dstfd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", tofile)));
+
+	do
+	{
+		/* If we got a cancel signal during the copy of the file, quit */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Don't copy too much at once, so we can check for interrupts from
+		 * time to time if this falls back to a slow copy.
+		 */
+		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_COPY);
+		nbytes = copy_file_range(srcfd, NULL, dstfd, NULL, 1024 * 1024, 0);
+		if (nbytes < 0 && errno != EINTR)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not clone file \

Using read stream in autoprewarm

2024-08-08 Thread Nazir Bilal Yavuz

Hi,

I am working on using the read stream in autoprewarm. I observed ~10%
performance gain with this change. The patch is attached.

The downside of the read stream approach is that a new read stream
object needs to be created for each database, relation and fork. I was
wondering if this would cause a regression but it did not (at least
depending on results of my testing). Another downside could be the
code getting complicated.

For the testing,
- I created 50 databases with each of them having 50 tables and the
size of the tables are 520KB.
- patched: 51157 ms
- master: 56769 ms
- I created 5 databases with each of them having 1 table and the size
of the tables are 3GB.
- patched: 32679 ms
- master: 36706 ms

I put debugging message with timing information in
autoprewarm_database_main() function, then run autoprewarm 100 times
(by restarting the server) and cleared the OS cache before each
restart. Also, I ensured that the block number of the buffer returning
from the read stream API is correct. I am not sure if that much
testing is enough for this kind of change.

Any feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From c5e286612912ba6840d967812171162a948153e4 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 7 Aug 2024 17:27:50 +0300
Subject: [PATCH v1] Use read stream in autoprewarm

Instead of reading blocks with ReadBufferExtended(), create read stream
object for each possible case and use it.

This change provides about 10% performance improvement.
---
 contrib/pg_prewarm/autoprewarm.c | 102 +--
 1 file changed, 97 insertions(+), 5 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d061731706a..96e93c46f85 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -44,6 +44,7 @@
 #include "storage/lwlock.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
+#include "storage/read_stream.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
@@ -429,6 +430,58 @@ apw_load_buffers(void)
 		apw_state->prewarmed_blocks, num_elements)));
 }
 
+struct apw_read_stream_private
+{
+	bool		first_block;
+	int			max_pos;
+	int			pos;
+	BlockInfoRecord *block_info;
+	BlockNumber nblocks_in_fork;
+
+};
+
+static BlockNumber
+apw_read_stream_next_block(ReadStream *stream,
+		   void *callback_private_data,
+		   void *per_buffer_data)
+{
+	struct apw_read_stream_private *p = callback_private_data;
+	bool	   *rs_have_free_buffer = per_buffer_data;
+	BlockInfoRecord *old_blk;
+	BlockInfoRecord *cur_blk;
+
+	*rs_have_free_buffer = true;
+
+	if (!have_free_buffer())
+	{
+		*rs_have_free_buffer = false;
+		return InvalidBlockNumber;
+	}
+
+	if (p->pos == p->max_pos)
+		return InvalidBlockNumber;
+
+	if (p->first_block)
+	{
+		p->first_block = false;
+		return p->block_info[p->pos++].blocknum;
+	}
+
+	old_blk = &(p->block_info[p->pos - 1]);
+	cur_blk = &(p->block_info[p->pos]);
+
+	if (old_blk->database == cur_blk->database &&
+		old_blk->forknum == cur_blk->forknum &&
+		old_blk->filenumber == cur_blk->filenumber &&
+		cur_blk->blocknum < p->nblocks_in_fork)
+	{
+		p->pos++;
+		return cur_blk->blocknum;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Prewarm all blocks for one database (and possibly also global objects, if
  * those got grouped with this database).
@@ -442,6 +495,9 @@ autoprewarm_database_main(Datum main_arg)
 	BlockNumber nblocks = 0;
 	BlockInfoRecord *old_blk = NULL;
 	dsm_segment *seg;
+	ReadStream *stream = NULL;
+	struct apw_read_stream_private p;
+	bool	   *rs_have_free_buffer;
 
 	/* Establish signal handlers; once that's done, unblock signals. */
 	pqsignal(SIGTERM, die);
@@ -458,13 +514,16 @@ autoprewarm_database_main(Datum main_arg)
 	block_info = (BlockInfoRecord *) dsm_segment_address(seg);
 	pos = apw_state->prewarm_start_idx;
 
+	p.block_info = block_info;
+	p.max_pos = apw_state->prewarm_stop_idx;
+
 	/*
 	 * Loop until we run out of blocks to prewarm or until we run out of free
 	 * buffers.
 	 */
-	while (pos < apw_state->prewarm_stop_idx && have_free_buffer())
+	for (; pos < apw_state->prewarm_stop_idx; pos++)
 	{
-		BlockInfoRecord *blk = &block_info[pos++];
+		BlockInfoRecord *blk = &block_info[pos];
 		Buffer		buf;
 
 		CHECK_FOR_INTERRUPTS();
@@ -477,6 +536,18 @@ autoprewarm_database_main(Datum main_arg)
 			old_blk->database != 0)
 			break;
 
+		/*
+		 * If stream needs to be created again, end it before closing the old
+		 * relation.
+		 */
+		if (stream && (old_blk == NULL ||
+	   old_blk->filenumber != blk->filenumber ||
+	   old_blk->forknum != blk->forknum))
+		{
+			Assert(read_stream_next_buffer(stream, (void **) &rs_have

Re: PG_TEST_EXTRA and meson

2024-07-23 Thread Nazir Bilal Yavuz

Hi,

On Tue, 23 Jul 2024 at 13:40, Ashutosh Bapat
 wrote:
>
> On Tue, Jul 23, 2024 at 4:02 PM Nazir Bilal Yavuz  wrote:
> >
> > > I wonder whether we really require pg_test_extra argument to testwrap.
> > > Why can't we use the logic in testwrap, to set run time PG_TEST_EXTRA,
> > > in meson.build directly? I.e. set test_env['PG_TEST_EXTRA'] to
> > > os.environ[;PG_TEST_EXTRA'] if the latter is set, otherwise set the
> > > first to get_option('PG_TEST_EXTRA').
> >
> > When test_env('PG_TEST_EXTRA') is set, it could not be overridden
> > afterwards. Perhaps there is a way to override test_env() but I do not
> > know how.
> >
>
> I am not suggesting to override test_env['PG_TEST_EXTRA'] but set it
> to the value after overriding if required. meson.build file seems to
> allow some conditional variable setting. So I thought this would be
> possible, haven't tried myself though.

Sorry if I caused a misunderstanding. What I meant was, when the
test_env('PG_TEST_EXTRA') is set, Meson will always use PG_TEST_EXTRA
from the setup. So, Meson needs to be built again to change
PG_TEST_EXTRA.

AFAIK Meson does not support accessing environment variables but
run_command() could be used to test this:

-test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
+pg_test_extra_env = run_command(
+  [python,
+  '-c',
+  'import os; print(os.getenv("PG_TEST_EXTRA", ""))'],
+  check: true).stdout()
+
+test_env.set('PG_TEST_EXTRA', pg_test_extra_env != '' ?
+  pg_test_extra_env :
+  get_option('PG_TEST_EXTRA'))
+

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-07-23 Thread Nazir Bilal Yavuz

Hi,

On Tue, 23 Jul 2024 at 12:26, Ashutosh Bapat
 wrote:
>
> Upthread Tom asked whether we should do a symmetric change to "make".
> This set of patches does not do that. Here are my thoughts:
> 1. Those who use make, are not using configure time PG_TEST_EXTRA
> anyway, so they don't need it.
> 2. Those meson users who use setup time PG_TEST_EXTRA and also want to
> use make may find the need for the feature in make.
> 3. https://www.postgresql.org/docs/devel/install-requirements.html
> says that the meson support is currently experimental and only works
> when building from a Git checkout. So there's a possibility (even if
> theoretical) that make and meson will co-exist. Also that we may
> abandon meson?
>
> Considering those, it seems to me that symmetry is required. I don't
> know how hard it is to introduce PG_TEST_EXTRA as a configure time
> option in "make". If it's simple, we should do that. Otherwise it will
> be better to just remove PG_EXTRA_TEST option support from meson
> support to keep make and meson symmetrical.

I agree that symmetry should be the ultimate goal.

Upthread Jacob said he could work on a patch about introducing the
PG_TEST_EXTRA configure option to make builds. Would you still be
interested in working on this? If not, I would gladly work on it.

> As far as the implementation is concerned the patch seems to be doing
> what's expected. If PG_TEST_EXTRA is specified at setup time, it is
> not needed to be specified as an environment variable at run time. But
> it can also be overridden at runtime. If PG_TEST_EXTRA is not
> specified at the time of setup, but specified at run time, it works. I
> have tested xid_wraparound and wal_consistency_check.

Thanks for testing it!

> I wonder whether we really require pg_test_extra argument to testwrap.
> Why can't we use the logic in testwrap, to set run time PG_TEST_EXTRA,
> in meson.build directly? I.e. set test_env['PG_TEST_EXTRA'] to
> os.environ[;PG_TEST_EXTRA'] if the latter is set, otherwise set the
> first to get_option('PG_TEST_EXTRA').

When test_env('PG_TEST_EXTRA') is set, it could not be overridden
afterwards. Perhaps there is a way to override test_env() but I do not
know how.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-22 Thread Nazir Bilal Yavuz

Hi,

On Sat, 20 Jul 2024 at 21:14, Noah Misch  wrote:
>
> On Sat, Jul 20, 2024 at 03:01:31PM +0300, Nazir Bilal Yavuz wrote:
> >
> > With the separate commit (e00c45f685), does it make sense to rename
> > the smgr_persistence parameter of the ReadBuffer_common() to
> > persistence? Because, ExtendBufferedRelTo() calls ReadBuffer_common()
> > with rel's persistence now, not with smgr's persistence.
>
> BMR_REL() doesn't set relpersistence, so bmr.relpersistence is associated with
> bmr.smgr and is unset if bmr.rel is set.  That is to say, bmr.relpersistence
> is an smgr_persistence.  It could make sense to change ReadBuffer_common() to
> take a BufferManagerRelation instead of the three distinct arguments.

Got it.

>
> On a different naming topic, my review missed that field name
> copy_storage_using_buffer_read_stream_private.last_block doesn't fit how the
> field is used.  Code uses it like an nblocks.  So let's either rename the
> field or change the code to use it as a last_block (e.g. initialize it to
> nblocks-1, not nblocks).

I prefered renaming it as nblocks, since that is how it was used in
RelationCopyStorageUsingBuffer() before. Also, I realized that instead
of setting p.blocknum = 0; initializing blkno as 0 and using
p.blocknum = blkno makes sense. Because, p.blocknum and blkno should
always start with the same block number. The relevant patch is
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 1110cfe915dd885c90f91014f8dbece20426efbc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 22 Jul 2024 11:07:50 +0300
Subject: [PATCH v5] Minimal refactorings on RelationCopyStorageUsingBuffer()

- copy_storage_using_buffer_read_stream_private.last_block is renamed as
  nblocks.

- blkno is initialized as 0 and used while setting
  copy_storage_using_buffer_read_stream_private.blocknum.
---
 src/backend/storage/buffer/bufmgr.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a6aeecdf534..29d862ccfde 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -143,7 +143,7 @@ typedef struct SMgrSortArray
 struct copy_storage_using_buffer_read_stream_private
 {
 	BlockNumber blocknum;
-	BlockNumber last_block;
+	BlockNumber nblocks;
 };
 
 /*
@@ -157,7 +157,7 @@ copy_storage_using_buffer_read_stream_next_block(ReadStream *stream,
 {
 	struct copy_storage_using_buffer_read_stream_private *p = callback_private_data;
 
-	if (p->blocknum < p->last_block)
+	if (p->blocknum < p->nblocks)
 		return p->blocknum++;
 
 	return InvalidBlockNumber;
@@ -4709,7 +4709,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	Page		dstPage;
 	bool		use_wal;
 	BlockNumber nblocks;
-	BlockNumber blkno;
+	BlockNumber blkno = 0;
 	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
@@ -4745,8 +4745,8 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
 
 	/* Initalize streaming read */
-	p.blocknum = 0;
-	p.last_block = nblocks;
+	p.blocknum = blkno;
+	p.nblocks = nblocks;
 	src_smgr = smgropen(srclocator, INVALID_PROC_NUMBER);
 	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL,
  bstrategy_src,
@@ -4758,7 +4758,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
  0);
 
 	/* Iterate over each block of the source relation file. */
-	for (blkno = 0; blkno < nblocks; blkno++)
+	for (; blkno < nblocks; blkno++)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-- 
2.45.2

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-20 Thread Nazir Bilal Yavuz

Hi,

On Sat, 20 Jul 2024 at 14:27, Noah Misch  wrote:
>
> On Thu, Jul 18, 2024 at 02:11:13PM +0300, Nazir Bilal Yavuz wrote:
> > v4 is attached.
>
> Removal of the PinBufferForBlock() comment about the "persistence =
> RELPERSISTENCE_PERMANENT" fallback started to feel like a loss.  I looked for
> a way to re-add a comment about the fallback.
> https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
> shows no test coverage of that fallback, and I think the fallback is
> unreachable.  Hence, I've removed the fallback in a separate commit.  I've
> pushed that and your three patches.  Thanks.

Thanks for the separate commit and push!

With the separate commit (e00c45f685), does it make sense to rename
the smgr_persistence parameter of the ReadBuffer_common() to
persistence? Because, ExtendBufferedRelTo() calls ReadBuffer_common()
with rel's persistence now, not with smgr's persistence.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-07-19 Thread Nazir Bilal Yavuz

Hi,

On Wed, 17 Jul 2024 at 13:13, Ashutosh Bapat
 wrote:
>
> If you are willing to work on this further, please add it to the commitfest.

Since general consensus is more towards having an environment variable
to override Meson configure option, I converted solution-3 to
something more like a patch. I updated the docs, added more comments
and added this to the commitfest [1].

The one downside of this approach is that PG_TEXT_EXTRA in user
defined options in meson setup output could be misleading now.

Also, with this change; PG_TEST_EXTRA from configure_scripts in the
.cirrus.tasks.yml file should be removed as they are useless now. I
added that as a second patch.

[1] https://commitfest.postgresql.org/49/5134/

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 7e6c31c73d14f8e50d13ad7ce4aa1aa167193afd Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 19 Jul 2024 09:48:47 +0300
Subject: [PATCH v1 1/2] Make PG_TEST_EXTRA env variable override its Meson
 configure option

Since PG_TEST_EXTRA configure option is set while configuring Meson,
each change on PG_TEST_EXTRA needs reconfigure and this is costly. So,
first try to use PG_TEST_EXTRA from environment. If it does not exist,
then try to use it from its configure option in Meson builds.
---
 doc/src/sgml/installation.sgml |  6 --
 meson.build| 11 ++-
 meson_options.txt  |  2 +-
 src/tools/testwrap |  6 ++
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 4784834ab9f..a78ef8cb78b 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -3132,8 +3132,10 @@ ninja install
   

 Enable test suites which require special software to run.  This option
-accepts arguments via a whitespace-separated list.  See  for details.
+accepts arguments via a whitespace-separated list.  Please note that
+this configure option is overridden by PG_TEST_EXTRA environment
+variable if it exists.  See  for
+details.

   
  
diff --git a/meson.build b/meson.build
index 5387bb6d5fd..e0b04b01756 100644
--- a/meson.build
+++ b/meson.build
@@ -3224,11 +3224,6 @@ test_env.set('PG_REGRESS', pg_regress.full_path())
 test_env.set('REGRESS_SHLIB', regress_module.full_path())
 test_env.set('INITDB_TEMPLATE', test_initdb_template)
 
-# Test suites that are not safe by default but can be run if selected
-# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
-# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
-test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
-
 # Add the temporary installation to the library search path on platforms where
 # that works (everything but windows, basically). On windows everything
 # library-like gets installed into bindir, solving that issue.
@@ -3292,6 +3287,12 @@ foreach test_dir : tests
 testwrap,
 '--basedir', meson.build_root(),
 '--srcdir', test_dir['sd'],
+# Test suites that are not safe by default but can be run if selected
+# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
+# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
+# This configure option is overridden by PG_TEST_EXTRA environment variable
+# if it exists.
+'--pg_test_extra', get_option('PG_TEST_EXTRA'),
   ]
 
   foreach kind, v : test_dir
diff --git a/meson_options.txt b/meson_options.txt
index 246cecf3827..e1d55311702 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -47,7 +47,7 @@ option('injection_points', type: 'boolean', value: false,
   description: 'Enable injection points')
 
 option('PG_TEST_EXTRA', type: 'string', value: '',
-  description: 'Enable selected extra tests')
+  description: 'Enable selected extra tests, please note that this configure option is overridden by PG_TEST_EXTRA environment variable if it exists')
 
 option('atomics', type: 'boolean', value: true,
   description: 'Use atomic operations')
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 9a270beb72d..0dca4c26f2b 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -13,6 +13,7 @@ parser.add_argument('--basedir', help='base directory of test', type=str)
 parser.add_argument('--testgroup', help='test group', type=str)
 parser.add_argument('--testname', help='test name', type=str)
 parser.add_argument('--skip', help='skip test (with reason)', type=str)
+parser.add_argument('--pg_test_extra', help='extra tests', type=str)
 parser.add_argument('test_command', nargs='*')
 
 args = par

Re: CI, macports, darwin version problems

2024-07-18 Thread Nazir Bilal Yavuz

Hi,

On Thu, 18 Jul 2024 at 17:01, Joe Conway  wrote:
>
> On 7/18/24 08:55, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Thu, 18 Jul 2024 at 15:00, Joe Conway  wrote:
> >>
> >> On 7/18/24 07:55, Joe Conway wrote:
> >> > On 7/18/24 04:12, Nazir Bilal Yavuz wrote:
> >> >> Could it be pulling the ''macos-runner:sonoma' image on every run?
> >> >
> >> > Or perhaps since this was the first run it simply needed to pull the
> >> > image for the first time?
> >
> > It was not the first run, Thomas rerun it a couple of times but all of
> > them were in the same build. So, I thought that CI may set some
> > settings to pull the image while starting the build, so it
> > re-downloads the image for all the tasks in the same build. But that
> > looks wrong because of what you said below.
> >
> >> >
> >> > The scheduling timing (21:24) looks a lot like what I observed when I
> >> > did the test for the time to download. Unfortunately I did not time the
> >> > test though.
> >>
> >> Actually it does look like the image is gone now based on the free space
> >> on the volume, so maybe it is pulling every run and cleaning up rather
> >> than caching for some reason?
> >>
> >> Filesystem   Size   Used  Avail Capacity
> >> /dev/disk3s5228Gi   39Gi  161Gi20%
> >
> > That is interesting. Only one thing comes to my mind. It seems that
> > the 'tart prune' command runs automatically to reclaim space when
> > there is no space left and thinks it can reclaim the space by removing
> > some things [1]. So, it could be that somehow 'tart prune' ran
> > automatically and deleted the sonoma image. I think you can check if
> > this is the case. You can check these locations [2] from ci-user to
> > see when ventura images are created. If they have been created less
> > than 1 day ago, I think the current space is not enough to pull both
> > ventura and sonoma images.
>
> I think you nailed it (this will wrap badly):
> 8<-
> macmini:~ ci-run$ ll ~/.tart/cache/OCIs/ghcr.io/cirruslabs/*
> /Users/ci-run/.tart/cache/OCIs/ghcr.io/cirruslabs/macos-runner:
> total 0
> drwxr-xr-x  2 ci-run  staff   64 Jul 17 23:53 .
> drwxr-xr-x  5 ci-run  staff  160 Jul 17 17:16 ..
>
> /Users/ci-run/.tart/cache/OCIs/ghcr.io/cirruslabs/macos-sonoma-base:
> total 0
> drwxr-xr-x  2 ci-run  staff   64 Jul 17 13:18 .
> drwxr-xr-x  5 ci-run  staff  160 Jul 17 17:16 ..
>
> /Users/ci-run/.tart/cache/OCIs/ghcr.io/cirruslabs/macos-ventura-base:
> total 0
> drwxr-xr-x  4 ci-run  staff  128 Jul 17 23:53 .
> drwxr-xr-x  5 ci-run  staff  160 Jul 17 17:16 ..
> lrwxr-xr-x  1 ci-run  staff  140 Jul 17 23:53 latest ->
> /Users/ci-run/.tart/cache/OCIs/ghcr.io/cirruslabs/macos-ventura-base/sha256:bddfa1e2b6f6ec41b5db844b06a6784a2bffe0b071965470efebd95ea3355b4f
> drwxr-xr-x  5 ci-run  staff  160 Jul 17 23:53
> sha256:bddfa1e2b6f6ec41b5db844b06a6784a2bffe0b071965470efebd95ea3355b4f
> 8<-
>
> So perhaps I am back to needing more storage...

You might not need more storage. Thomas knows better, but AFAIU, CFBot
will pull only sonoma images after the patch in this thread gets
merged. And your storage seems enough for storing it.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CI, macports, darwin version problems

2024-07-18 Thread Nazir Bilal Yavuz

Hi,

On Thu, 18 Jul 2024 at 15:00, Joe Conway  wrote:
>
> On 7/18/24 07:55, Joe Conway wrote:
> > On 7/18/24 04:12, Nazir Bilal Yavuz wrote:
> >> Could it be pulling the ''macos-runner:sonoma' image on every run?
> >
> > Or perhaps since this was the first run it simply needed to pull the
> > image for the first time?

It was not the first run, Thomas rerun it a couple of times but all of
them were in the same build. So, I thought that CI may set some
settings to pull the image while starting the build, so it
re-downloads the image for all the tasks in the same build. But that
looks wrong because of what you said below.

> >
> > The scheduling timing (21:24) looks a lot like what I observed when I
> > did the test for the time to download. Unfortunately I did not time the
> > test though.
>
> Actually it does look like the image is gone now based on the free space
> on the volume, so maybe it is pulling every run and cleaning up rather
> than caching for some reason?
>
> Filesystem   Size   Used  Avail Capacity
> /dev/disk3s5228Gi   39Gi  161Gi20%

That is interesting. Only one thing comes to my mind. It seems that
the 'tart prune' command runs automatically to reclaim space when
there is no space left and thinks it can reclaim the space by removing
some things [1]. So, it could be that somehow 'tart prune' ran
automatically and deleted the sonoma image. I think you can check if
this is the case. You can check these locations [2] from ci-user to
see when ventura images are created. If they have been created less
than 1 day ago, I think the current space is not enough to pull both
ventura and sonoma images.

[1] https://github.com/cirruslabs/tart/issues/33#issuecomment-1134789129
[2] https://tart.run/faq/#vm-location-on-disk

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-18 Thread Nazir Bilal Yavuz

Hi,

On Wed, 17 Jul 2024 at 23:41, Noah Misch  wrote:
>
> On Wed, Jul 17, 2024 at 12:22:49PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 16 Jul 2024 at 15:19, Noah Misch  wrote:
> > > On Tue, Jul 16, 2024 at 02:11:20PM +0300, Nazir Bilal Yavuz wrote:
> > > > On Fri, 12 Jul 2024 at 02:52, Noah Misch  wrote:
> > > > > On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> > > > > > --- a/src/backend/storage/aio/read_stream.c
> > > > > > +++ b/src/backend/storage/aio/read_stream.c
> > > > > > @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
> > > > > >   {
> > > > > >   stream->ios[i].op.rel = rel;
> > > > > >   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> > > > > > - stream->ios[i].op.smgr_persistence = 0;
> > > > > > + stream->ios[i].op.smgr_persistence = 
> > > > > > rel->rd_rel->relpersistence;
> > > > >
> > > > > Does the following comment in ReadBuffersOperation need an update?
> > > > >
> > > > > /*
> > > > >  * The following members should be set by the caller.  If 
> > > > > only smgr is
> > > > >  * provided without rel, then smgr_persistence can be set to 
> > > > > override the
> > > > >  * default assumption of RELPERSISTENCE_PERMANENT.
> > > > >  */
> > > >
> > > > I believe it does not need to be updated but I renamed
> > > > 'ReadBuffersOperation.smgr_persistence' as
> > > > 'ReadBuffersOperation.persistence'. So, this comment is updated as
> > > > well. I think that rename suits better because persistence does not
> > > > need to come from smgr, it could come from relation, too. Do you think
> > > > it is a good idea? If it is, does it need a separate commit?
> > >
> > > The rename is good.  I think the comment implies "persistence" is unused 
> > > when
> > > rel!=NULL.  That implication is true before the patch but false after the
> > > patch.
> >
> > What makes it false after the patch? I think the logic did not change.
> > If there is rel, the value of persistence is obtained from
> > 'rel->rd_rel->relpersistence'. If there is no rel, then smgr is used
> > to obtain its value.
>
> First, the patch removes the "default assumption of RELPERSISTENCE_PERMANENT".
> It's now an assertion failure.
>
> The second point is about "If only smgr is provided without rel".  Before the
> patch, the extern functions that take a ReadBuffersOperation argument examine
> smgr_persistence if and only if rel==NULL.  That's consistent with the
> comment.  After the patch, StartReadBuffersImpl() calling PinBufferForBlock()
> uses the field unconditionally.

I see, thanks for the explanation. I removed that part of the comment.

>
> On that note, does WaitReadBuffers() still have a reason to calculate its
> persistence as follows, or should this patch make it "persistence =
> operation->persistence"?
>
> persistence = operation->rel
> ? operation->rel->rd_rel->relpersistence
> : RELPERSISTENCE_PERMANENT;

Nice catch, I do not think it is needed now. WaitReadBuffers() is
called only from ReadBuffer_common() and read_stream_next_buffer().
For the ReadBuffer_common(), persistence is calculated before calling
WaitReadBuffers(). And for the read_stream_next_buffer(), it is
calculated while creating a read stream object in the
read_stream_begin_impl().

v4 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 6d677bf4c1f870c03461532ebb96297c3950db2b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH v4 1/3] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/include/storage/bufmgr.h  |  6 ++---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 37 ---
 3 files changed, 20 insertions(+), 25 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a1e71013d32..b4e454cf8af 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -115,13 +115,11 @@

Re: CI, macports, darwin version problems

2024-07-18 Thread Nazir Bilal Yavuz

Hi,

On Thu, 18 Jul 2024 at 07:40, Thomas Munro  wrote:
>
> On Thu, Jul 18, 2024 at 9:58 AM Joe Conway  wrote:
> > On 7/17/24 16:41, Andres Freund wrote:
> > > Does "tart pull ghcr.io/cirruslabs/macos-runner:sonoma" as the CI user
> > > succeed?
> >
> > Yes, with about 25 GB to spare.
>
> Thanks.  Now it works!  But for some reason it spends several minutes
> in the "scheduling" stage before it starts.  Are there any logs that
> might give a clue what it was doing, for example for this run?
>
> https://cirrus-ci.com/task/5963784852865024

Could it be pulling the ''macos-runner:sonoma' image on every run? I
cross-compared and for every new version of the
'macos-sonoma-base:latest' image [1]; scheduling takes ~4 minutes [2].
Then, it takes a couple of seconds [2] for the consecutive runs until
a new version of the image is released. Also, from their manifest; the
uncompressed size of the runner image is 5x of the sonoma-base image
[3]. This is very close to scheduling time differences between
'macos-runner:sonoma' and 'newly pulled macos-sonoma-base:latest'
(22mins / 4 mins).

[1] 
https://github.com/cirruslabs/macos-image-templates/pkgs/container/macos-sonoma-base/versions

[2]
https://cirrus-ci.com/task/5299490515582976 -> 4 minutes, first pull
https://cirrus-ci.com/task/6081946936147968 -> 20 seconds
https://cirrus-ci.com/task/6078712070799360 -> 4 minutes, new version
of the image was released on the same day (6th of July)
https://cirrus-ci.com/task/6539977129984000 -> 40 seconds
https://cirrus-ci.com/task/5839361126694912 -> 40 seconds
https://cirrus-ci.com/task/6708845278396416 -> 4 minutes, new version
of the image was released a day ago

[3]
https://github.com/cirruslabs/macos-image-templates/pkgs/container/macos-sonoma-base/245087497?tag=latest
https://github.com/orgs/cirruslabs/packages/container/macos-runner/242649219?tag=sonoma

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-07-17 Thread Nazir Bilal Yavuz

Hi,

On Wed, 17 Jul 2024 at 13:23, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Wed, 17 Jul 2024 at 13:13, Ashutosh Bapat
>  wrote:
> > xid_wraparound tests are run if PG_TEST_EXTRA contains xid_wraparound
> > or it is not set. Any other setting will not run xid_wraparound test.
> > That's how the patch is coded but it isn't intuitive that changing
> > whether a test is run by default would require configuring the build
> > again. Probably we should just get rid of config time PG_TEST_EXTRA
> > altogether.
> >
> > I am including +Tristan Partin who knows meson better.
> >
> > If you are willing to work on this further, please add it to the commitfest.
>
> I think I know why there is confusion. Could you try to set
> PG_TEST_EXTRA with quotes? Like PG_TEST_EXTRA="ldap mts
> xid_wraparound".

Sorry, the previous reply was wrong; I misunderstood what you said.
Yes, that is how the patch was coded and I agree that getting rid of
config time PG_TEST_EXTRA could be a better alternative.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PG_TEST_EXTRA and meson

2024-07-17 Thread Nazir Bilal Yavuz

Hi,

On Wed, 17 Jul 2024 at 13:13, Ashutosh Bapat
 wrote:
> xid_wraparound tests are run if PG_TEST_EXTRA contains xid_wraparound
> or it is not set. Any other setting will not run xid_wraparound test.
> That's how the patch is coded but it isn't intuitive that changing
> whether a test is run by default would require configuring the build
> again. Probably we should just get rid of config time PG_TEST_EXTRA
> altogether.
>
> I am including +Tristan Partin who knows meson better.
>
> If you are willing to work on this further, please add it to the commitfest.

I think I know why there is confusion. Could you try to set
PG_TEST_EXTRA with quotes? Like PG_TEST_EXTRA="ldap mts
xid_wraparound".

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-17 Thread Nazir Bilal Yavuz

Hi,

On Tue, 16 Jul 2024 at 15:19, Noah Misch  wrote:
>
> On Tue, Jul 16, 2024 at 02:11:20PM +0300, Nazir Bilal Yavuz wrote:
> > On Fri, 12 Jul 2024 at 02:52, Noah Misch  wrote:
> > > On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> > > > --- a/src/backend/storage/aio/read_stream.c
> > > > +++ b/src/backend/storage/aio/read_stream.c
> > > > @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
> > > >   {
> > > >   stream->ios[i].op.rel = rel;
> > > >   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> > > > - stream->ios[i].op.smgr_persistence = 0;
> > > > + stream->ios[i].op.smgr_persistence = 
> > > > rel->rd_rel->relpersistence;
> > >
> > > Does the following comment in ReadBuffersOperation need an update?
> > >
> > > /*
> > >  * The following members should be set by the caller.  If only 
> > > smgr is
> > >  * provided without rel, then smgr_persistence can be set to 
> > > override the
> > >  * default assumption of RELPERSISTENCE_PERMANENT.
> > >  */
> >
> > I believe it does not need to be updated but I renamed
> > 'ReadBuffersOperation.smgr_persistence' as
> > 'ReadBuffersOperation.persistence'. So, this comment is updated as
> > well. I think that rename suits better because persistence does not
> > need to come from smgr, it could come from relation, too. Do you think
> > it is a good idea? If it is, does it need a separate commit?
>
> The rename is good.  I think the comment implies "persistence" is unused when
> rel!=NULL.  That implication is true before the patch but false after the
> patch.

What makes it false after the patch? I think the logic did not change.
If there is rel, the value of persistence is obtained from
'rel->rd_rel->relpersistence'. If there is no rel, then smgr is used
to obtain its value.

> > > > @@ -4667,19 +4698,31 @@ RelationCopyStorageUsingBuffer(RelFileLocator 
> > > > srclocator,
> > >
> > > >   /* Iterate over each block of the source relation file. */
> > > >   for (blkno = 0; blkno < nblocks; blkno++)
> > > >   {
> > > >   CHECK_FOR_INTERRUPTS();
> > > >
> > > >   /* Read block from source relation. */
> > > > - srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, 
> > > > blkno,
> > > > -   
> > > >  RBM_NORMAL, bstrategy_src,
> > > > -   
> > > >  permanent);
> > > > + srcBuf = read_stream_next_buffer(src_stream, NULL);
> > > >   LockBuffer(srcBuf, BUFFER_LOCK_SHARE);
> > >
> > > I think this should check for read_stream_next_buffer() returning
> > > InvalidBuffer.  pg_prewarm doesn't, but the other callers do, and I think 
> > > the
> > > other callers are a better model.  LockBuffer() doesn't check the
> > > InvalidBuffer case, so let's avoid the style of using a
> > > read_stream_next_buffer() return value without checking.
> >
> > There is an assert in the LockBuffer which checks for the
> > InvalidBuffer. If that is not enough, we may add an if check for
> > InvalidBuffer but what should we do in this case? It should not
> > happen, so erroring out may be a good idea.
>
> I like this style from read_stream_reset():
>
> while ((buffer = read_stream_next_buffer(stream, NULL)) != 
> InvalidBuffer)
> {
> ...
> }
>
> That is, don't iterate over block numbers.  Drain the stream until empty.  If
> the stream returns a number of blocks higher or lower than we expected, we
> won't detect that, and that's okay.  It's not a strong preference, so I'm open
> to arguments against that from you or others.  A counterargument could be that
> read_stream_reset() doesn't know the buffer count, so it has no choice.  The
> counterargument could say that callers knowing the block count should use the
> pg_prewarm() style, and others should use the read_stream_reset() style.

I think what you said in the counter argument makes sense. Also, there
is an 'Assert(read_stream_next_buffer(src_stream, NULL) ==
InvalidBuffer);' after the loop. Which means all the blocks in the
stream

Re: PG_TEST_EXTRA and meson

2024-07-16 Thread Nazir Bilal Yavuz

Hi,

On Wed, 17 Jul 2024 at 00:27, Jacob Champion
 wrote:
>
> On Tue, Jul 16, 2024 at 2:12 PM Nazir Bilal Yavuz  wrote:
> >
> > 2- If PG_TEST_EXTRA is set from the setup command, use it from the
> > setup command and discard the environment variable. If PG_TEST_EXTRA
> > is not set from the setup command, then use it from the environment.
>
> Is there a way for the environment to override the Meson setting
> rather than vice-versa? My vote would be to have both available, but
> with the implementation in patch 2 I'd still have to reconfigure if I
> wanted to change my test setup.

I think something like attached does the trick. I did not test it
extensively but it passed the couple of tests I tried.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From f03878870cbbac661bd6f7b483fd1dcf36d5ea38 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 17 Jul 2024 00:54:29 +0300
Subject: [PATCH] Solution 3

---
 meson.build| 6 +-
 meson_options.txt  | 2 +-
 src/tools/testwrap | 5 +
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 5387bb6d5fd..3105b3e6724 100644
--- a/meson.build
+++ b/meson.build
@@ -3224,11 +3224,6 @@ test_env.set('PG_REGRESS', pg_regress.full_path())
 test_env.set('REGRESS_SHLIB', regress_module.full_path())
 test_env.set('INITDB_TEMPLATE', test_initdb_template)
 
-# Test suites that are not safe by default but can be run if selected
-# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
-# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
-test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
-
 # Add the temporary installation to the library search path on platforms where
 # that works (everything but windows, basically). On windows everything
 # library-like gets installed into bindir, solving that issue.
@@ -3292,6 +3287,7 @@ foreach test_dir : tests
 testwrap,
 '--basedir', meson.build_root(),
 '--srcdir', test_dir['sd'],
+'--pg_test_extra', get_option('PG_TEST_EXTRA'),
   ]
 
   foreach kind, v : test_dir
diff --git a/meson_options.txt b/meson_options.txt
index 246cecf3827..042fd22cebe 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -47,7 +47,7 @@ option('injection_points', type: 'boolean', value: false,
   description: 'Enable injection points')
 
 option('PG_TEST_EXTRA', type: 'string', value: '',
-  description: 'Enable selected extra tests')
+  description: 'Enable selected extra tests, please note that this can be overriden by PG_TEST_EXTRA environment variable')
 
 option('atomics', type: 'boolean', value: true,
   description: 'Use atomic operations')
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 9a270beb72d..389a7f6a113 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -13,6 +13,7 @@ parser.add_argument('--basedir', help='base directory of test', type=str)
 parser.add_argument('--testgroup', help='test group', type=str)
 parser.add_argument('--testname', help='test name', type=str)
 parser.add_argument('--skip', help='skip test (with reason)', type=str)
+parser.add_argument('--pg_test_extra', help='extra tests', type=str)
 parser.add_argument('test_command', nargs='*')
 
 args = parser.parse_args()
@@ -41,6 +42,10 @@ env_dict = {**os.environ,
 'TESTDATADIR': os.path.join(testdir, 'data'),
 'TESTLOGDIR': os.path.join(testdir, 'log')}
 
+# If PG_TEST_EXTRA is not set in the environment, then use it from meson
+if "PG_TEST_EXTRA" not in os.environ:
+env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
+
 sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE)
 # Meson categorizes a passing TODO test point as bad
 # (https://github.com/mesonbuild/meson/issues/13183).  Remove the TODO
-- 
2.45.2

Re: PG_TEST_EXTRA and meson

2024-07-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jul 2024 at 09:30, Ashutosh Bapat
 wrote:
>
> In order to run these tests, we have to run meson setup again. There
> are couple of problems with this
> 1. It's not clear why the tests were skipped. Also not clear that we
> have to run meson setup again - from the output alone
> 2. Running meson setup again is costly, every time we have to run a
> new test from PG_TEST_EXTRA.
> 3. Always configuring meson with PG_TEST_EXTRA means we will run heavy
> tests every time meson test is run. I didn't find a way to not run
> these tests as part of meson test once configured this way.
>
> We should either allow make like behaviour or provide a way to not run
> these tests even if configured that way.

I have a two quick solutions to this:

1- More like make behaviour. PG_TEST_EXTRA is able to be set from the
setup command, delete this feature so it could be set only from the
environment. Then use it from the environment.

2- If PG_TEST_EXTRA is set from the setup command, use it from the
setup command and discard the environment variable. If PG_TEST_EXTRA
is not set from the setup command, then use it from the environment.

I hope these patches help.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 5eabb10189b635ddf1969a6f85ba80d70fec3a83 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 17 Jul 2024 00:01:48 +0300
Subject: [PATCH] Solution 1

---
 meson.build   | 5 -
 meson_options.txt | 3 ---
 2 files changed, 8 deletions(-)

diff --git a/meson.build b/meson.build
index 5387bb6d5fd..4fbbac0897f 100644
--- a/meson.build
+++ b/meson.build
@@ -3224,11 +3224,6 @@ test_env.set('PG_REGRESS', pg_regress.full_path())
 test_env.set('REGRESS_SHLIB', regress_module.full_path())
 test_env.set('INITDB_TEMPLATE', test_initdb_template)
 
-# Test suites that are not safe by default but can be run if selected
-# by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
-# Export PG_TEST_EXTRA so it can be checked in individual tap tests.
-test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
-
 # Add the temporary installation to the library search path on platforms where
 # that works (everything but windows, basically). On windows everything
 # library-like gets installed into bindir, solving that issue.
diff --git a/meson_options.txt b/meson_options.txt
index 246cecf3827..0eaf7890521 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -46,9 +46,6 @@ option('tap_tests', type: 'feature', value: 'auto',
 option('injection_points', type: 'boolean', value: false,
   description: 'Enable injection points')
 
-option('PG_TEST_EXTRA', type: 'string', value: '',
-  description: 'Enable selected extra tests')
-
 option('atomics', type: 'boolean', value: true,
   description: 'Use atomic operations')
 
-- 
2.45.2

From 16170a9e48db00eaedc25a6040cc44ec30fffe63 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 16 Jul 2024 23:56:59 +0300
Subject: [PATCH] Solution 2

---
 meson.build | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index 5387bb6d5fd..a3a95372e2c 100644
--- a/meson.build
+++ b/meson.build
@@ -3227,7 +3227,9 @@ test_env.set('INITDB_TEMPLATE', test_initdb_template)
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable PG_TEST_EXTRA.
 # Export PG_TEST_EXTRA so it can be checked in individual tap tests.
-test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
+if get_option('PG_TEST_EXTRA') != ''
+  test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
+endif
 
 # Add the temporary installation to the library search path on platforms where
 # that works (everything but windows, basically). On windows everything
-- 
2.45.2

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-16 Thread Nazir Bilal Yavuz

Hi,

Thank you for the review!

On Fri, 12 Jul 2024 at 02:52, Noah Misch  wrote:
>
> On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> > I am working on using read streams in the CREATE DATABASE command when the
> > strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
> > this context. This function reads source buffers then copies them to the
>
> Please rebase.  I applied this to 40126ac for review purposes.

Rebased.

> > Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks 
> > about
> >  persistence
> >
> > There are if checks in PinBufferForBlock() function to set persistence
> > of the relation and this function is called for the each block in the
> > relation. Instead of that, set persistence of the relation before
> > PinBufferForBlock() function.
>
> I tried with the following additional patch to see if PinBufferForBlock() ever
> gets invalid smgr_relpersistence:
>
> 
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1098,6 +1098,11 @@ PinBufferForBlock(Relation rel,
>
> Assert(blockNum != P_NEW);
>
> +   if (!(smgr_persistence == RELPERSISTENCE_TEMP ||
> + smgr_persistence == RELPERSISTENCE_PERMANENT ||
> + smgr_persistence == RELPERSISTENCE_UNLOGGED))
> +   elog(WARNING, "unexpected relpersistence %d", 
> smgr_persistence);
> +
> if (smgr_persistence == RELPERSISTENCE_TEMP)
> {
> io_context = IOCONTEXT_NORMAL;
> 
>
> That still gets relpersistence==0 in various src/test/regress cases.  I think
> the intent was to prevent that.  If not, please add a comment about when
> relpersistence==0 is still allowed.

I fixed it, it is caused by (mode == RBM_ZERO_AND_CLEANUP_LOCK | mode
== RBM_ZERO_AND_LOCK) case in the ReadBuffer_common(). The persistence
was not updated for this path before. I also added an assert check for
this problem to PinBufferForBlock().

> > --- a/src/backend/storage/aio/read_stream.c
> > +++ b/src/backend/storage/aio/read_stream.c
> > @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
> >   {
> >   stream->ios[i].op.rel = rel;
> >   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> > - stream->ios[i].op.smgr_persistence = 0;
> > + stream->ios[i].op.smgr_persistence = 
> > rel->rd_rel->relpersistence;
>
> Does the following comment in ReadBuffersOperation need an update?
>
> /*
>  * The following members should be set by the caller.  If only smgr is
>  * provided without rel, then smgr_persistence can be set to override 
> the
>  * default assumption of RELPERSISTENCE_PERMANENT.
>  */
>

I believe it does not need to be updated but I renamed
'ReadBuffersOperation.smgr_persistence' as
'ReadBuffersOperation.persistence'. So, this comment is updated as
well. I think that rename suits better because persistence does not
need to come from smgr, it could come from relation, too. Do you think
it is a good idea? If it is, does it need a separate commit?

> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
>
> > +/*
> > + * Helper struct for read stream object used in
> > + * RelationCopyStorageUsingBuffer() function.
> > + */
> > +struct copy_storage_using_buffer_read_stream_private
> > +{
> > + BlockNumber blocknum;
> > + int64   last_block;
> > +};
>
> Why is last_block an int64, not a BlockNumber?

You are right, the type of last_block should be BlockNumber; done. I
copied it from pg_prewarm_read_stream_private struct and I guess the
same should be applied to it as well but it is not the topic of this
thread, so I did not update it yet.

> > @@ -4667,19 +4698,31 @@ RelationCopyStorageUsingBuffer(RelFileLocator 
> > srclocator,
>
> >   /* Iterate over each block of the source relation file. */
> >   for (blkno = 0; blkno < nblocks; blkno++)
> >   {
> >   CHECK_FOR_INTERRUPTS();
> >
> >   /* Read block from source relation. */
> > - srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
> > -   
> >  RBM_NORMAL, bstrategy_src,
> > -   
> >  permanent);
> > + srcBuf = read_stream_next_buffer(src_stream, NULL);
> >   LockBuffer(srcBuf, BUFFER_LOCK_SHARE);
&g

Re: CFbot failed on Windows platform

2024-07-11 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jul 2024 at 11:07, Tatsuo Ishii  wrote:
>
> Hi,
>
> A cfbot test suddenly failed with regression test error.
>
> http://cfbot.cputube.org/highlights/all.html#4460
>
> Following the link "regress" shows:
> https://api.cirrus-ci.com/v1/artifact/task/5428792720621568/testrun/build/testrun/recovery/027_stream_regress/data/regression.diffs
>
> diff --strip-trailing-cr -U3 
> C:/cirrus/src/test/regress/expected/collate.windows.win1252.out 
> C:/cirrus/build/testrun/recovery/027_stream_regress/data/results/collate.windows.win1252.out
> --- C:/cirrus/src/test/regress/expected/collate.windows.win1252.out 
> 2024-07-11 02:44:08.385966100 +
> +++ 
> C:/cirrus/build/testrun/recovery/027_stream_regress/data/results/collate.windows.win1252.out
> 2024-07-11 02:49:51.280212100 +
> @@ -21,10 +21,10 @@
>  );
>  \d collate_test1
>  Table "collate_tests.collate_test1"
> - Column |  Type   | Collation | Nullable | Default
> + Column |  Type   | Collation | Nullable | Default
>  +-+---+--+-
> - a  | integer |   |  |
> - b  | text| en_US | not null |
> + a  | integer |   |  |
> + b  | text| en_US | not null |
>  :
>  :
>
> The differences are that the result has an extra space at the end of
> line.  I am not familiar with Windows and has no idea why this could
> happen (I haven't changed the patch set since May 24. The last
> succeeded test was on July 9 14:58:44 (I am not sure about time zone).
> Also I see exactly the same test failures in some other tests (for
> example http://cfbot.cputube.org/highlights/all.html#4337)
>
> Any idea?

I think It is related to the '628c1d1f2c' commit. This commit changed
the output of the regress test in Windows.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: tests fail on windows with default git settings

2024-07-11 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jul 2024 at 17:04, Andrew Dunstan  wrote:
>
>
> On 2024-07-10 We 9:25 AM, Tom Lane wrote:
> > Dave Page  writes:
> >> On Wed, 10 Jul 2024 at 12:12, Andrew Dunstan  wrote:
> >>> As I was looking at this I wondered if there might be anywhere else that
> >>> needed adjustment. One thing that occurred to me was that that maybe we
> >>> should replace the use of "-w" in pg_regress.c with this rather less
> >>> dangerous flag, so instead of ignoring any white space difference we would
> >>> only ignore line end differences. The use of "-w" apparently dates back to
> >>> 2009.
> >> That seems like a good improvement to me.
> > +1
> >
> >
>
>
> OK, done.

It looks like Postgres CI did not like this change. 'Windows - Server
2019, VS 2019 - Meson & ninja' [1] task started to fail after this
commit, there is one extra space at the end of line in regress test's
output.

[1] https://cirrus-ci.com/task/6753781205958656

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Streaming I/O, vectored I/O (WIP)

2024-07-10 Thread Nazir Bilal Yavuz

Hi,

It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
[1] is partially applied, the top comment is not updated. The attached
patch just updates it.

[1] 
https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 76efb37b03b295070dc5259049825fe56460383f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 10 Jul 2024 19:04:51 +0300
Subject: [PATCH] Fix the top comment at read_stream.c

---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 74b9bae6313..238ff448a91 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -51,7 +51,7 @@
  * I/Os that have been started with StartReadBuffers(), and for which
  * WaitReadBuffers() must be called before returning the buffer.
  *
- * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * For example, if the callback returns block numbers 10, 42, 43, 44, 60 in
  * successive calls, then these data structures might appear as follows:
  *
  *  buffers buf/data   ios
-- 
2.45.2

Re: Doc: fix track_io_timing description to mention pg_stat_io

2024-06-27 Thread Nazir Bilal Yavuz

Hi,

On Thu, 27 Jun 2024 at 14:30, Melanie Plageman
 wrote:
>
> On Thu, Jun 27, 2024 at 5:06 AM  wrote:
> >
> > Hi,
> >
> > pg_stat_io has I/O statistics that are collected when track_io_timing is
> > enabled, but it is not mentioned in the description of track_io_timing.
> > I think it's better to add a description of pg_stat_io for easy reference.
> > What do you think?
>
> Seems quite reasonable to me given that track_wal_io_timing mentions
> pg_stat_wal. I noticed that the sentence about track_io_timing in the
> statistics collection configuration section [1] only mentions reads
> and writes -- perhaps it should also mention extends and fsyncs?

Both suggestions look good to me. If what you said will be
implemented, maybe track_wal_io_timing too should mention fsyncs?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CI and test improvements

2024-06-14 Thread Nazir Bilal Yavuz

Hi,

On Thu, 13 Jun 2024 at 14:56, Justin Pryzby  wrote:
>
> ccache should be installed in the image rather than re-installed on each
> invocation.

ccache is installed in the Windows VM images now [1]. It can be used
as 'set CC=ccache.exe cl.exe' in the Windows CI task.

[1] 
https://github.com/anarazel/pg-vm-images/commit/03a9225ac962fb30b5c0722c702941e2d7c1e81e

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CI and test improvements

2024-06-13 Thread Nazir Bilal Yavuz

Hi,

Thanks for working on this!

On Wed, 12 Jun 2024 at 16:10, Justin Pryzby  wrote:
>
> On Fri, Jun 24, 2022 at 08:38:50AM +1200, Thomas Munro wrote:
> > > I've also sent some patches to Thomas for cfbot to help progress some of 
> > > these
> > > patches (code coverage and documentation upload as artifacts).
> > > https://github.com/justinpryzby/cfbot/commits/master
> >
> > Thanks, sorry for lack of action, will get to these soon.
>
> On Tue, Feb 13, 2024 at 01:10:21PM -0600, Justin Pryzby wrote:
> > I sent various patches to cfbot but haven't heard back.
>
> > https://www.postgresql.org/message-id/flat/20220409021853.gp24...@telsasoft.com
> > https://www.postgresql.org/message-id/flat/20220623193125.gb22...@telsasoft.com
> > https://github.com/justinpryzby/cfbot/commits/master
> > https://github.com/macdice/cfbot/pulls
>
> @Thomas: ping
>
> I reintroduced the patch for ccache/windows -- v4.10 supports PCH, which
> can make the builds 2x faster.

I applied 0001 and 0002 to see ccache support on Windows but the build
step failed with: 'ccache: error: No stats log has been configured'.
Perhaps you forgot to add 'CCACHE_STATSLOG: $CCACHE_DIR.stats.log' to
0002? After adding that line, CI finished successfully. And, I confirm
that the build step takes ~30 seconds now; it was ~90 seconds before
that.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-06-13 Thread Nazir Bilal Yavuz

Hi,

Thank you for looking into this!

On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav  wrote:
>
> > If possible, let's have all the I/O stats (even for WAL) in
> > pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> > column and then have read_bytes or something like that to know the
> > amount of data read?
>
> The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
> database. It signifies the count of cache hits, or in other words, the
> instances where data was located in the ‘shared_buffers’. As a result,
> keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
> knowledge about the buffer cache’s efficiency and assist users in
> making educated choices when fine-tuning their database. However, if
> we include the hit count of WAL buffers in this, it may lead to
> misleading interpretations for database tuning. If there’s something
> I’ve overlooked that’s already been discussed, please feel free to
> correct me.

I think counting them as a hit makes sense. We read data from WAL
buffers instead of reading them from disk. And, WAL buffers are stored
in shared memory so I believe they can be counted as hits in the
shared buffers. Could you please explain how this change can 'lead to
misleading interpretations for database tuning' a bit more?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-06-13 Thread Nazir Bilal Yavuz

Hi,

Thank you for looking into this! And, sorry for the late answer.

On Mon, 13 May 2024 at 17:12, Bharath Rupireddy
 wrote:
>
> On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz  wrote:
> >
> > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > calls to read variable bytes, only the WAL receiver was reading
> > > variable bytes before.
> >
> > I want to start working on this again if possible. I will try to
> > summarize the current status:
>
> Thanks for working on this.
>
> > * With the 73f0a13266 commit, the WALRead() function started to read
> > variable bytes in every case. Before, only the WAL receiver was
> > reading variable bytes.
> >
> > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > discussing what we have to do when this is merged. It is decided that
> > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > this function does not perform any IO [1]. We may follow the same
> > logic by not including these to pg_stat_io?
>
> Right. WALReadFromBuffers doesn't do any I/O.
>
> Whoever reads WAL from disk (backends, walsenders, recovery process)
> using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> pg_stat_io or some other view. If it were to be in pg_stat_io,
> although we may not be able to distinguish WAL read stats at a backend
> level (like how many times/bytes a walsender or recovery process or a
> backend read WAL from disk), but it can help understand overall impact
> of WAL read I/O at a cluster level. With this approach, the WAL I/O
> stats are divided up - WAL read I/O and write I/O stats are in
> pg_stat_io and pg_stat_wal respectively.
>
> This makes me think if we need to add WAL read I/O stats also to
> pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> hits/misses there. With this approach, pg_stat_wal can be a one-stop
> view for all the WAL related stats. If needed, we can join info from
> pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
> are emitted to the end-user via pg_stat_io.

I agree that the ultimate goal is seeing WAL I/O stats from one place.
There is a reply to this from Amit:

On Tue, 28 May 2024 at 03:48, Amit Kapila  wrote:
>
> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?

I think it is better to have all the I/O stats in pg_stat_io like Amit
said. And, it makes sense to me to show 'WAL data we get from buffers'
in the hits column. Since, basically instead of doing I/O from disk;
we get data directly from WAL buffers. I think that fits the
explanation of the hits column in pg_stat_io, which is 'The number of
times a desired block was found in a shared buffer.' [1].

> > * With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
> > does not block anything related to putting WAL stats in pg_stat_io.
> >
> > If I am not missing any new changes, the only problem is reading
> > variable bytes now. We have discussed a couple of solutions:
> >
> > 1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
> > that this means some variable byte I/O is happening.
> >
> > I kind of dislike this solution because if the *only* read I/O is
> > happening in variable bytes, it will look like write and extend I/Os
> > are happening in variable bytes as well. As a solution, it could be
> > documented that only read I/Os could happen in variable bytes for now.
>
> Yes, read I/O for relation and WAL can happen in variable bytes. I
> think this idea seems reasonable and simple yet useful to know the
> cluster-wide read I/O.

I agree.

> However, another point here is how the total number of bytes read is
> represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
> It is known now with 'reads' * 'op_bytes', but with variable bytes,
> how is read bytes calculated? Maybe add new columns
> read_bytes/write_bytes?
>
> > 2- Use op_bytes_[read | write | extend] columns instead of one
> > op_bytes column, also use the first solution.
> >
> > This can solve the first solution's weakness but it introduces two
> > more columns. This is more future proof compared to the first solution
> > if there is a chance that some variable I/O could happen in write and
> > extend calls as well in the future.
>
> -1 as more columns impact the readability and usability.

I did not understand the overall difference between what you suggested
(adding read_bytes/write_

Re: CREATE DATABASE with filesystem cloning

2024-06-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 15:08, Ranier Vilela  wrote:
>
> Em ter., 21 de mai. de 2024 às 05:18, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi,
>>
>> On Thu, 16 May 2024 at 17:47, Robert Haas  wrote:
>> >
>> > On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz  
>> > wrote:
>> > > Actually, the documentation for the file_copy_method was mentioning
>> > > the things it controls; I converted it to an itemized list now. Also,
>> > > changed the comment to: 'Further uses of this function requires
>> > > updates to the list that GUC controls in its documentation.'. v7 is
>> > > attached.
>> >
>> > I think the comments need some wordsmithing.
>>
>> I changed it to 'Uses of this function must be documented in the list
>> of places affected by this GUC.', I am open to any suggestions.
>>
>> > I don't see why this parameter should be PGC_POSTMASTER.
>>
>> I changed it to 'PGC_USERSET' now. My initial point was the database
>> or tablespace to be copied with the same method. I thought copying
>> some portion of the database with the copy and rest with the clone
>> could cause potential problems. After a bit of searching, I could not
>> find any problems related to that.
>>
>> v8 is attached.
>
> Hi,
> I did some research on the subject and despite being an improvement,
> I believe that some terminologies should be changed in this patch.
> Although the new function is called *clone_file*, I'm not sure if it really 
> is "clone".
> Why MacOS added an API called clonefile [1] and Linux exists
> another called FICLONE.[2]
> So perhaps it should be treated here as a copy and not a clone?
> Leaving it open, is the possibility of implementing a true clone api?
> Thoughts?

Thank you for mentioning this.

1- I do not know where to look for MacOS' function definitions but I
followed the same links you shared. It says copyfile(...,
COPYFILE_CLONE_FORCE) means 'Clone the file instead. This is a force
flag i.e. if cloning fails, an error is returned' [1]. It does the
cloning but I still do not know whether there is a difference between
'copyfile() function with COPYFILE_CLONE_FORCE flag' and 'clonefile()'
function.

2- I read a couple of discussions about copy_file_range() and FICLONE.
It seems that the copy_file_range() function is a slightly upgraded
version of FICLONE but less available since it is newer. It still does
the clone [2]. Also, FICLONE is already used in pg_upgrade and
pg_combinebackup.

Both of these copyfile(..., COPYFILE_CLONE_FORCE) and
copy_file_range() functions do the cloning, so I think using clone
terminology is correct. But, using FICLONE instead of
copy_file_range() could be better since it is more widely available.

[1] https://www.unix.com/man-page/mojave/3/copyfile/
[2] https://elixir.bootlin.com/linux/v5.15-rc5/source/fs/read_write.c#L1495

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: meson and check-tests

2024-06-02 Thread Nazir Bilal Yavuz

Hi,

On Sun, 2 Jun 2024 at 07:06, jian he  wrote:
>
> On Sun, Jun 2, 2024 at 4:48 AM Tristan Partin  wrote:
> >
> > On Fri May 31, 2024 at 12:02 PM CDT, Ashutosh Bapat wrote:
> > > Hi Tristan,
> > > Using make I can run only selected tests under src/test/regress using
> > > TESTS=... make check-tests. I didn't find any way to do that with meson.
> > > meson test --suite regress runs all the regression tests.
> > >
> > > We talked this off-list at the conference. It seems we have to somehow
> > > avoid passing pg_regress --schedule argument and instead pass the list of
> > > tests. Any idea how to do that?
> >
> > I think there are 2 solutions to this.
> >
> > 1. Avoid passing --schedule by default, which doesn't sound like a great
> >solution.
> >
> > 2. Teach pg_regress to ignore the --schedule option if specific tests
> >are passed instead.
> >
> > 3. Add a --no-schedule option to pg_regress which would override the
> >previously added --schedule option.
> >
> > I personally prefer 2 or 3.
> >
> > 2: meson test -C build regress/regress --test-args my_specific_test
> > 3: meson test -C build regress/regress --test-args "--no-schedule 
> > my_specific_test"
> >
> > Does anyone have an opinion?
> >
>
> if each individual sql file can be tested separately, will
> `test: test_setup`?
> be aslo executed as a prerequisite?

Yes, it is required to run at least two setup tests because regress
tests require a database to be created:

1- The initdb executable is needed, and it is installed by the
'tmp_install' test for the tests.

2- Although the initdb executable exists, it is not enough by itself.
Regress tests copies its database contents from the template
directory, instead of running initdb for each test [1] (This is the
default behaviour in the meson builds' regress tests). This template
directory is created by the 'initdb_cache' test.

[1] 252dcb3239

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: zlib detection in Meson on Windows broken?

2024-05-22 Thread Nazir Bilal Yavuz

Hi,

On Wed, 22 May 2024 at 17:21, Dave Page  wrote:
>
> Hi
>
> On Wed, 22 May 2024 at 14:11, Nazir Bilal Yavuz  wrote:
>>
>>
>> I tried to install your latest zlib artifact (nmake one) to the
>> Windows CI images (not the official ones) [1]. Then, I used the
>> default meson.build file to build but meson could not find the zlib.
>> After that, I modified it like you suggested before; I used a
>> 'cc.find_library()' to find zlib as a fallback method and it seems it
>> worked [2]. Please see meson setup logs below [3], does something
>> similar to the attached solve your problem?
>
>
> That patch does solve my problem - thank you!

I am glad that it worked!

Do you think that we need to have this patch in the upstream Postgres?
I am not sure because:
- There is a case that meson is able to find zlib but tests fail.
- This might be a band-aid fix rather than a permanent fix.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: zlib detection in Meson on Windows broken?

2024-05-22 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 18:24, Dave Page  wrote:
>
>
>
> On Tue, 21 May 2024 at 16:04, Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2024-05-20 11:58:05 +0100, Dave Page wrote:
>> > I have very little experience with Meson, and even less interpreting it's
>> > logs, but it seems to me that it's not including the extra lib and include
>> > directories when it runs the test compile, given the command line it's
>> > reporting:
>> >
>> > cl C:\Users\dpage\git\postgresql\build\meson-private\tmpg_h4xcue\testfile.c
>> > /nologo /showIncludes /utf-8 /EP /nologo /showIncludes /utf-8 /EP /Od /Oi-
>> >
>> > Bug, or am I doing something silly?
>>
>> It's a buglet. We rely on meson's internal fallback detection of zlib, if 
>> it's
>> not provided via pkg-config or cmake. But it doesn't know about our
>> extra_include_dirs parameter. We should probably fix that...
>
>
> Oh good, then I'm not going bonkers. I'm still curious about how it works for 
> Andrew but not me, however fixing that buglet should solve my issue, and 
> would be sensible behaviour.
>
> Thanks!

I tried to install your latest zlib artifact (nmake one) to the
Windows CI images (not the official ones) [1]. Then, I used the
default meson.build file to build but meson could not find the zlib.
After that, I modified it like you suggested before; I used a
'cc.find_library()' to find zlib as a fallback method and it seems it
worked [2]. Please see meson setup logs below [3], does something
similar to the attached solve your problem?

The interesting thing is, I also tried this 'cc.find_library' method
with your old artifact (cmake one). It was able to find zlib but all
tests failed [4].

Experimental zlib meson.build diff is attached.

[1] https://cirrus-ci.com/task/6736867247259648
[2] https://cirrus-ci.com/build/5286228755480576
[3]
Run-time dependency zlib found: NO (tried pkgconfig, cmake and system)
Has header "zlib.h" : YES
Library zlib found: YES
...
  External libraries
...
zlib   : YES
...
[4] https://cirrus-ci.com/task/5208433811521536

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/meson.build b/meson.build
index f5ca5cfed49..d89a7a3e277 100644
--- a/meson.build
+++ b/meson.build
@@ -1373,20 +1373,23 @@ endif
 zlibopt = get_option('zlib')
 zlib = not_found_dep
 if not zlibopt.disabled()
-  zlib_t = dependency('zlib', required: zlibopt)
+  zlib = dependency('zlib', required: false)
 
-  if zlib_t.type_name() == 'internal'
-# if fallback was used, we don't need to test if headers are present (they
-# aren't built yet, so we can't test)
-zlib = zlib_t
-  elif not zlib_t.found()
-warning('did not find zlib')
-  elif not cc.has_header('zlib.h',
-  args: test_c_args, include_directories: postgres_inc,
-  dependencies: [zlib_t], required: zlibopt)
-warning('zlib header not found')
-  else
-zlib = zlib_t
+  if zlib.found() and zlib.type_name() != 'internal'
+if not cc.has_header('zlib.h',
+args: test_c_args, include_directories: postgres_inc,
+dependencies: zlib, required: false)
+  zlib = not_found_dep
+endif
+  elif not zlib.found()
+zlib_lib = cc.find_library('zlib',
+  dirs: test_lib_d,
+  header_include_directories: postgres_inc,
+  has_headers: ['zlib.h'],
+  required: zlibopt)
+if zlib_lib.found()
+  zlib = declare_dependency(dependencies: zlib_lib, include_directories: postgres_inc)
+endif
   endif
 
   if zlib.found()

Re: zlib detection in Meson on Windows broken?

2024-05-21 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 10:20, Sandeep Thakkar
 wrote:
>
> Hi Dave,
>
> Is the .pc file generated after the successful build of zlib? If yes, then 
> meson should be able to detect the installation ideally

If meson is not able to find the .pc file automatically, using 'meson
setup ... --pkg-config-path $ZLIB_PC_PATH' might help.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-21 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 17:47, Robert Haas  wrote:
>
> On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz  wrote:
> > Actually, the documentation for the file_copy_method was mentioning
> > the things it controls; I converted it to an itemized list now. Also,
> > changed the comment to: 'Further uses of this function requires
> > updates to the list that GUC controls in its documentation.'. v7 is
> > attached.
>
> I think the comments need some wordsmithing.

I changed it to 'Uses of this function must be documented in the list
of places affected by this GUC.', I am open to any suggestions.

> I don't see why this parameter should be PGC_POSTMASTER.

I changed it to 'PGC_USERSET' now. My initial point was the database
or tablespace to be copied with the same method. I thought copying
some portion of the database with the copy and rest with the clone
could cause potential problems. After a bit of searching, I could not
find any problems related to that.

v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 54dac96e493f482581c09ec14e67196e73e371ca Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 21 May 2024 10:19:29 +0300
Subject: [PATCH v8] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE method is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Reviewed-by: Robert Haas 
Reviewed-by: Ranier Vilela 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  4 +
 doc/src/sgml/config.sgml  | 46 +
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..6d8f05199ca 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry file_copy_method_options[] = {
+	{"copy", FILE_COPY_METHOD_COPY, false},
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
+	{"clone", FILE_COPY_METHOD_CLONE, false},
+#endif
+	{NULL, 0, false}
+};
+
+static void clone_file(const char *fromfile, const char *tofile);
 
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Uses of this function must be documented in the list of places
+ * affected by this GUC.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +96,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +244,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 05:34, Thomas Munro  wrote:
>
> On Wed, May 15, 2024 at 5:20 PM Peter Eisentraut  wrote:
> > Yes, let's get that v3-0001 patch into PG17.
>
> Done.
>
> Bilal recently created the CI images for Debian Bookworm[1].  You can
> try them with s/bullseye/bookworm/ in .cirrus.tasks.yml, but it looks
> like he is still wrestling with a perl installation problem[2] in the
> 32 bit build, so here is a temporary patch to do that and also delete
> the 32 bit tests for now.  This way cfbot should succeed with the
> remaining patches.  Parked here for v18.

Actually, 32 bit builds are working but the Perl version needs to be
updated to 'perl5.36-i386-linux-gnu' in .cirrus.tasks.yml. I changed
0001 with the working version of 32 bit builds [1] and the rest is the
same. All tests pass now [2].

[1] 
postgr.es/m/CAN55FZ0fY5EFHXLKCO_=p4pwfmhrovom_qse_7b48gpchfa...@mail.gmail.com
[2] https://cirrus-ci.com/task/4969910856581120

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 976d2c7ad0e470b24875ee27171359f54078a761 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 10:56:28 +0300
Subject: [PATCH v5 1/3] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Perl version is upgraded in the Bookworm images, so update Perl version
at 'Linux - Debian Bookworm - Meson' task as well.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index a2388cd5036..47a60aa7c6f 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -65,7 +65,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 8
 TEST_JOBS: 8
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
 # no options enabled, should be small
 CCACHE_MAXSIZE: "150M"
@@ -243,7 +243,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net";
@@ -314,7 +314,7 @@ task:
 #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
 
   matrix:
-- name: Linux - Debian Bullseye - Autoconf
+- name: Linux - Debian Bookworm - Autoconf
 
   env:
 SANITIZER_FLAGS: -fsanitize=address
@@ -348,7 +348,7 @@ task:
   on_failure:
 <<: *on_failure_ac
 
-- name: Linux - Debian Bullseye - Meson
+- name: Linux - Debian Bookworm - Meson
 
   env:
 CCACHE_MAXSIZE: "400M" # tests two different builds
@@ -375,7 +375,7 @@ task:
 ${LINUX_MESON_FEATURES} \
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
--DPERL=perl5.32-i386-linux-gnu \
+-DPERL=perl5.36-i386-linux-gnu \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build-32
 EOF
@@ -652,7 +652,7 @@ task:
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

From a18120ace64bcde41c2ed23a050eb86f9b8772e0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH v5 2/3] jit: Require at least LLVM 14, if enabled.

Remove support for LLVM versions 10-13.  The default on all non-EOL'd
OSes represented in our build farm will be at least LLVM 14 when
PostgreSQL 18 ships.

Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 src/backend/jit/llvm/llvmjit.c  | 101 
 src/backend/jit/llvm/llvmjit_error.cpp  |  25 --
 src/backend/jit/llvm/llvmjit_inline.cpp |  13 ---
 src/backend/jit/llvm/llvmjit_wrap.cpp   |   4 -
 config/llvm.m4  |   4 +-
 doc/src/sgml/installation.sgml  |   4 +-
 configure   |   4 +-
 meson.build |   2 +-
 8 files changed, 7 insertions(+), 150 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 1d439f24554..8f9c77eedc1 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -21,13 +21,9 @@
 #if LLVM_VERSION_MAJOR > 16
 #include 
 #endif
-#if LLVM_VERSION_MAJOR > 11
 #include 
 #include 
 #include 
-#else
-#include 
-#endif
 #include 
 #include 
 #if LLVM_VERSION_MAJOR < 17
@@ -50,13 +46,8 @@
 /* Handle of a module emitted via ORC JIT */
 typedef struct LLVMJitHandle
 {
-#if LLVM_VERSION_MAJOR > 11
 	LLVMOrcLLJITRef ll

Re: CREATE DATABASE with filesystem cloning

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 15:40, Robert Haas  wrote:
>
> On Thu, May 16, 2024 at 8:35 AM Nazir Bilal Yavuz  wrote:
> > I updated the documentation and put a comment on top of the copydir()
> > function to inform that further changes and uses of this function may
> > require documentation updates.
>
> I was assuming that the documentation for the file_copy_method was
> going to list the things that it controlled, and that the comment was
> going to say that you should update that list specifically. Just
> saying that you may need to update some part of the documentation in
> some way is fairly useless, IMHO.

Actually, the documentation for the file_copy_method was mentioning
the things it controls; I converted it to an itemized list now. Also,
changed the comment to: 'Further uses of this function requires
updates to the list that GUC controls in its documentation.'. v7 is
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 197ce921c20d49b524e306bf082701434b71b7af Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 16 May 2024 16:21:46 +0300
Subject: [PATCH v7] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE option is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +
 doc/src/sgml/config.sgml  | 48 +
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..651c74672e0 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry file_copy_method_options[] = {
+	{"copy", FILE_COPY_METHOD_COPY, false},
+#if (defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)) || defined(HAVE_COPY_FILE_RANGE)
+	{"clone", FILE_COPY_METHOD_CLONE, false},
+#endif
+	{NULL, 0, false}
+};
+
+static void clone_file(const char *fromfile, const char *tofile);
 
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Further uses of this function requires updates to the list that GUC controls
+ * in its documentation.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +96,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +244,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(fromfile, tofile, NULL, COPYFILE_CLONE_FORCE) < 0)
+		ereport(ERROR

Re: CREATE DATABASE with filesystem cloning

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 9 May 2024 at 19:29, Robert Haas  wrote:
>
> On Wed, May 8, 2024 at 10:34 AM Nazir Bilal Yavuz  wrote:
> > > I'm not so sure about the GUC name. On the one hand, it feels like
> > > createdb should be spelled out as create_database, but on the other
> > > hand, the GUC name is quite long already. Then again, why would we
> > > make this specific to CREATE DATABASE in the first place? Would we
> > > also want alter_tablespace_file_copy_method and
> > > basic_archive.file_copy_method?
> >
> > I agree that it is already quite long, because of that I chose the
> > createdb as a prefix. I did not think that file cloning was planned to
> > be used in other places. If that is the case, does something like
> > 'preferred_copy_method' work? Then, we would mention which places will
> > be affected with this GUC in the docs.
>
> I would go with file_copy_method rather than preferred_copy_method,
> because (1) there's nothing "preferred" about it if we're using it
> unconditionally and (2) it's nice to clarify that we're talking about
> copying files rather than anything else.

Done.

>
> My personal enthusiasm for making platform-specific file copy methods
> usable all over PostgreSQL is quite limited. However, it is my
> observation that other people seem far more enthusiastic about it than
> I am. For example, consider how quickly it got added to
> pg_combinebackup. So, I suspect it's smart to plan on anything we add
> in this area getting used in a bunch of places. And perhaps it is even
> best to think about making it work in all of those places right from
> the start. If we build support into copydir and copy_file, then we
> just get everything that uses those, and all that remains is to
> document was is covered (and add comments so that future patch authors
> know they should further update the documentation).

I updated the documentation and put a comment on top of the copydir()
function to inform that further changes and uses of this function may
require documentation updates.

I also changed O_RDWR to O_WRONLY flag in the clone_file() function
based on Raniers' feedback. Also, does this feature need tests? I
thought about possible test cases but since this feature requires
specific file systems to run, I could not find any.

v6 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 33fcb21730683c628f81451e21a0cf2455fd41be Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 16 May 2024 15:22:46 +0300
Subject: [PATCH v6] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE option is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +
 doc/src/sgml/config.sgml  | 35 +++
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..48e756ce284 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method

Re: Use streaming read API in ANALYZE

2024-05-15 Thread Nazir Bilal Yavuz

Hi,

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
> >
> > Pushed.  Thanks Bilal and reviewers!
>
> I wanted to discuss what will happen to this patch now that
> 27bc1772fc8 is reverted. I am continuing this thread but I can create
> another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Pros of the alternative version:

* The existing AM implementations other than heap AM can continue to
use their AMs without any change.
* AM implementations other than heap do not need to use read streams.
* Upstream code uses the read stream API and benefits from that.

Cons of the alternative version:

* 6 if cases are added to the acquire_sample_rows() function and 3 of
them are in the while loop.
* Because of these changes, the code looks messy.

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 323f28ff979cde8e4dbde8b4654bded74abf1fbc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 15 May 2024 00:03:56 +0300
Subject: [PATCH v13 1/2] Revert "Use streaming I/O in ANALYZE."

This commit reverts 041b96802ef.

041b96802ef revised the changes on 27bc1772fc8 but 27bc1772fc8 and
dd1f6b0c172 are reverted together in 6377e12a5a5. So, this commit
reverts all 27bc1772fc, 041b96802ef and dd1f6b0c172 together.

Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/include/access/tableam.h | 26 +++
 src/backend/access/heap/heapam_handler.c | 38 +-
 src/backend/commands/analyze.c   | 96 ++--
 3 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd5..e08b9627f30 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,7 +21,6 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
-#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"

@@ -655,16 +654,6 @@ typedef struct TableAmRoutine
 	struct VacuumParams *params,
 	BufferAccessStrategy bstrategy);

-	/*
-	 * Prepare to analyze the next block in the read stream.  Returns false if
-	 * the stream is exhausted and true otherwise. The scan must have been
-	 * started with SO_TYPE_ANALYZE option.
-	 *
-	 * This routine holds a buffer pin and lock on the heap page.  They are
-	 * held until heapam_scan_analyze_next_tuple() returns false.  That is
-	 * until all the items of the heap page are analyzed.
-	 */
-
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -683,7 +672,8 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-			ReadStream *stream);
+			BlockNumber blockno,
+			BufferAccessStrategy bstrategy);

 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1721,17 +1711,19 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 }

 /*
- * Prepare to analyze the next block in the read stream. The scan needs to
- * have been  started with table_beginscan_analyze().  Note that this routine
- * might acquire resources like locks that are held until
+ * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
+ * started with table_beginscan_analyze().  Note that this routine might
+ * acquire resources like locks that are held until
  * table_scan_analyze_next_tuple() returns false.
  *
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static i

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2024-05-15 Thread Nazir Bilal Yavuz

Hi,

On Sun, 12 May 2024 at 14:53, Peter Eisentraut  wrote:
>
> On 14.12.23 14:40, Nazir Bilal Yavuz wrote:
> > On Fri, 6 Oct 2023 at 17:07, Tom Lane  wrote:
> >>
> >> As a quick cross-check, I searched our commit log to see how many
> >> README-only commits there were so far this year.  I found 11 since
> >> January.  (Several were triggered by the latest round of pgindent
> >> code and process changes, so maybe this is more than typical.)
> >>
> >> Not sure what that tells us about the value of changing the CI
> >> logic, but it does seem like it could be worth the one-liner
> >> change needed to teach buildfarm animals to ignore READMEs.
> >
> > I agree that it could be worth implementing this logic on buildfarm animals.
> >
> > In case we want to implement the same logic on the CI, I added a new
> > version of the patch; it skips CI completely if the changes are only
> > in the README files.
>
> I don't see how this could be applicable widely enough to be useful:
>
> - While there are some patches that touch on README files, very few of
> those end up in a commit fest.
>
> - If someone manually pushes a change to their own CI environment, I
> don't see why we need to second-guess that.
>
> - Buildfarm runs generally batch several commits together, so it is very
> unlikely that this would be hit.
>
> I think unless some concrete reason for this change can be shown, we
> should drop it.

These points make sense. I thought that it is useful regarding Tom's
'11 README-only commit since January' analysis (at 6 Oct 2023) but
this may not be enough on its own. If there are no objections, I will
withdraw this soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Fix parallel vacuum buffer usage reporting

2024-05-13 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 19:09, Masahiko Sawada  wrote:
>
> On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > Thank you for working on this!
> >
> > On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
> > >
> > > Thank you for further testing! I've pushed the patch.
> >
> > I realized a behaviour change while looking at 'Use pgBufferUsage for
> > block reporting in analyze' thread [1]. Since that change applies here
> > as well, I thought it is better to mention it here.
> >
> > Before this commit, VacuumPageMiss did not count the blocks if its
> > read was already completed by other backends [2]. Now,
> > 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> > counts every block attempted to be read; possibly double counting if
> > someone else has already completed the read.
>
> True. IIUC there is such a difference only in HEAD but not in v15 and
> v16. The following comment in WaitReadBuffers() says that it's a
> traditional behavior that we count blocks as read even if someone else
> has already completed its I/O:
>
> /*
>  * We count all these blocks as read by this backend.  This is traditional
>  * behavior, but might turn out to be not true if we find that someone
>  * else has beaten us and completed the read of some of these blocks.  In
>  * that case the system globally double-counts, but we traditionally don't
>  * count this as a "hit", and we don't have a separate counter for "miss,
>  * but another backend completed the read".
>      */
>
> So I think using pgBufferUsage for (parallel) vacuum is a legitimate
> usage and makes it more consistent with other parallel operations.

That sounds logical. Thank you for the clarification.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Upgrade Debian CI images to Bookworm

2024-05-13 Thread Nazir Bilal Yavuz

Hi,

Bookworm versions of the Debian CI images are available now [0]. The
patches to use these images are attached.

'v1-0001-Upgrade-Debian-CI-images-to-Bookworm_REL_16+.patch' patch can
be applied to both upstream and REL_16 and all of the tasks finish
successfully.

'v1-0001-Upgrade-Debian-CI-images-to-Bookworm_REL_15.patch' patch can
be applied to REL_15 but it gives a compiler warning. The fix for this
warning is proposed here [1]. After the fix is applied, all of the
tasks finish successfully.

Any kind of feedback would be appreciated.

[0] https://github.com/anarazel/pg-vm-images/pull/91

[1] 
postgr.es/m/CAN55FZ0o9wqVoMTh_gJCmj_%2B4XbX9VXzQF8OySPZ0R1saxV3bA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 26f4b2425cb04dc7218142f24f151d3698f33191 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 10:56:28 +0300
Subject: [PATCH v1] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Perl version is upgraded in the Bookworm images, so update Perl version
at 'Linux - Debian Bookworm - Meson' task as well.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 82261eccb85..d2ff833fcad 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -65,7 +65,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 8
 TEST_JOBS: 8
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
 # no options enabled, should be small
 CCACHE_MAXSIZE: "150M"
@@ -241,7 +241,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net";
@@ -312,7 +312,7 @@ task:
 #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
 
   matrix:
-- name: Linux - Debian Bullseye - Autoconf
+- name: Linux - Debian Bookworm - Autoconf
 
   env:
 SANITIZER_FLAGS: -fsanitize=address
@@ -346,7 +346,7 @@ task:
   on_failure:
 <<: *on_failure_ac
 
-- name: Linux - Debian Bullseye - Meson
+- name: Linux - Debian Bookworm - Meson
 
   env:
 CCACHE_MAXSIZE: "400M" # tests two different builds
@@ -373,7 +373,7 @@ task:
 ${LINUX_MESON_FEATURES} \
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
--DPERL=perl5.32-i386-linux-gnu \
+-DPERL=perl5.36-i386-linux-gnu \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build-32
 EOF
@@ -648,7 +648,7 @@ task:
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

From d8f4b660bdd408fa22c857e24d1b4d05ff41df03 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 11:55:48 +0300
Subject: [PATCH v1] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index b03d128184d..a2735f8bb60 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -138,13 +138,13 @@ LINUX_CONFIGURE_FEATURES: &LINUX_CONFIGURE_FEATURES >-
 
 
 task:
-  name: Linux - Debian Bullseye
+  name: Linux - Debian Bookworm
 
   env:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net";
@@ -451,12 +451,12 @@ task:
 
   # To limit unnecessary work only run this once the normal linux test succeeds
   depends_on:
-- Linux - Debian Bullseye
+- Linux - Debian Bookworm
 
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

Re: CFbot does not recognize patch contents

2024-05-12 Thread Nazir Bilal Yavuz

Hi,

On Sun, 12 May 2024 at 09:50, Tatsuo Ishii  wrote:
>
> After sending out my v18 patches:
> https://www.postgresql.org/message-id/20240511.162307.2246647987352188848.t-ishii%40sranhm.sra.co.jp
>
> CFbot complains that the patch was broken:
> http://cfbot.cputube.org/patch_48_4460.log
>
> === Applying patches on top of PostgreSQL commit ID 
> 31e8f4e619d9b5856fa2bd5713cb1e2e170a9c7d ===
> === applying patch 
> ./v18-0001-Row-pattern-recognition-patch-for-raw-parser.patch
> gpatch:  Only garbage was found in the patch input.
>
> The patch was generated by git-format-patch (same as previous
> patches).  I failed to find any patch format problem in the
> patch. Does anybody know what's wrong here?

I am able to apply all your patches. I found that a similar thing
happened before [0] and I guess your case is similar. Adding Thomas to
CC, he may be able to help more.

Nitpick: There is a trailing space warning while applying one of your patches:
Applying: Row pattern recognition patch (docs).
.git/rebase-apply/patch:81: trailing whitespace.
 company  |   tdate| price | first_value | max | count

[0] 
postgr.es/m/CA%2BhUKGLiY1e%2B1%3DpB7hXJOyGj1dJOfgde%2BHmiSnv3gDKayUFJMA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 16:55, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Fri, 10 May 2024 at 14:49, Alena Rybakina  
> > wrote:
> > >
> > > Hi! I could try to check it with the test, but I want to ask you about
> > > details, because I'm not sure that I completely understand the test case.
> > >
> > > You mean that we need to have two backends and on one of them we deleted
> > > the tuples before vacuum called the other, do you?
> > >

There should be some other backend(s) which will try to read the same
buffer with the ongoing VACUUM operation. I think it works now but the
reproduction steps are a bit racy. See:

1- Build Postgres with attached diff, it is the same
see_previous_output.diff that I shared two mails ago.

2- Run Postgres, all settings are default.

3- Use two client backends, let's name them as A and B client backends.

4- On A client backend, run:

CREATE TABLE vacuum_fix (aid int, bid int) with (autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 2000);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1, bid = bid + 1;

5- Now it will be a bit racy, SQL commands below need to be run at the
same time. The aim is for VACUUM on A client backend and SELECT on B
client backend to read the same buffers at the same time. So, some of
the buffers will be double counted.

Firstly, run VACUUM on A client backend; immediately after running
VACUUM, run SELECT on B backend.

A client backend:
VACUUM VERBOSE vacuum_fix;

B client backend:
SELECT * from vacuum_fix WHERE aid = -1;

This is the output of the VACUUM VERBOSE on my end:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 0
pages: 0 removed, 176992 remain, 176992 scanned (100.00% of total)
...
...
buffer usage: 254181 hits, 99030 misses in the previous version, 99865
misses in the patched version, 106830 dirtied
...
VACUUM
Time: 2578.217 ms (00:02.578)

VACUUM does not run parallel, so this test case does not trigger what
is fixed in this thread. As it can be seen, there is ~1000 buffers
difference.

I am not sure if there is an easier way to reproduce this but I hope this helps.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 84cc983b6e6..582973d575b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -309,6 +309,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0,
 startwritetime = 0;
 	WalUsage	startwalusage = pgWalUsage;
+	int64		StartPageMiss = VacuumPageMiss;
 	BufferUsage startbufferusage = pgBufferUsage;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
@@ -606,6 +607,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
+			int64		PageMissOp = VacuumPageMiss - StartPageMiss;
 			double		read_rate = 0,
 		write_rate = 0;
 
@@ -748,8 +750,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
 			appendStringInfo(&buf,
-			 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+			 _("buffer usage: %lld hits, %lld misses in the previous version, %lld misses in the patched version, %lld dirtied\n"),
 			 (long long) (bufferusage.shared_blks_hit + bufferusage.local_blks_hit),
+			 (long long) (PageMissOp),
 			 (long long) (bufferusage.shared_blks_read + bufferusage.local_blks_read),
 			 (long long) (bufferusage.shared_blks_dirtied + bufferusage.local_blks_dirtied));
 			appendStringInfo(&buf,

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 14:49, Alena Rybakina  wrote:
> >
> > Hi! I could try to check it with the test, but I want to ask you about
> > details, because I'm not sure that I completely understand the test case.
> >
> > You mean that we need to have two backends and on one of them we deleted
> > the tuples before vacuum called the other, do you?
> >
>
> I think triggering a parallel vacuum is enough. I am able to see the
> differences with the following:
>
> You can apply the attached diff file to see the differences between
> the previous version and the patched version. Then, run this query:
>
> CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
> (autovacuum_enabled=false);
> INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 100);
> CREATE INDEX a_idx on vacuum_fix (aid);
> CREATE INDEX b_idx on vacuum_fix (bid);
> CREATE INDEX c_idx on vacuum_fix (cid);
> VACUUM vacuum_fix;
> UPDATE vacuum_fix SET aid = aid + 1;
> VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;
>
> After that I saw:
>
> INFO:  vacuuming "test.public.vacuum_fix"
> INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
> INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
> ...
> ...
> buffer usage: 29343 hits, 9580 misses in the previous version, 14165
> misses in the patched version, 14262 dirtied
>
> Patched version counts 14165 misses but the previous version counts
> 9580 misses in this specific example.

I am sorry that I showed the wrong thing, this is exactly what is
fixed in this patch. Actually, I do not know how to trigger it;
currently I am looking for it. I will share if anything comes to my
mind.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..a6f1df11066 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1355,6 +1355,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOContext	io_context;
 	IOObject	io_object;
 	char		persistence;
+	static int  double_counts = 0;
 
 	/*
 	 * Currently operations are only allowed to include a read of some range,
@@ -1426,6 +1427,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			  operation->smgr->smgr_rlocator.locator.relNumber,
 			  operation->smgr->smgr_rlocator.backend,
 			  true);
+			double_counts++;
 			continue;
 		}
 
@@ -1523,6 +1525,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 	}
+	elog(LOG, "Double counts = %d", double_counts);
 }
 
 /*

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 14:49, Alena Rybakina  wrote:
>
> Hi! I could try to check it with the test, but I want to ask you about
> details, because I'm not sure that I completely understand the test case.
>
> You mean that we need to have two backends and on one of them we deleted
> the tuples before vacuum called the other, do you?
>

I think triggering a parallel vacuum is enough. I am able to see the
differences with the following:

You can apply the attached diff file to see the differences between
the previous version and the patched version. Then, run this query:

CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
(autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 100);
CREATE INDEX a_idx on vacuum_fix (aid);
CREATE INDEX b_idx on vacuum_fix (bid);
CREATE INDEX c_idx on vacuum_fix (cid);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1;
VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;

After that I saw:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
...
...
buffer usage: 29343 hits, 9580 misses in the previous version, 14165
misses in the patched version, 14262 dirtied

Patched version counts 14165 misses but the previous version counts
9580 misses in this specific example.

--
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 84cc983b6e6..582973d575b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -309,6 +309,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0,
 startwritetime = 0;
 	WalUsage	startwalusage = pgWalUsage;
+	int64		StartPageMiss = VacuumPageMiss;
 	BufferUsage startbufferusage = pgBufferUsage;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
@@ -606,6 +607,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
+			int64		PageMissOp = VacuumPageMiss - StartPageMiss;
 			double		read_rate = 0,
 		write_rate = 0;
 
@@ -748,8 +750,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
 			appendStringInfo(&buf,
-			 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+			 _("buffer usage: %lld hits, %lld misses in the previous version, %lld misses in the patched version, %lld dirtied\n"),
 			 (long long) (bufferusage.shared_blks_hit + bufferusage.local_blks_hit),
+			 (long long) (PageMissOp),
 			 (long long) (bufferusage.shared_blks_read + bufferusage.local_blks_read),
 			 (long long) (bufferusage.shared_blks_dirtied + bufferusage.local_blks_dirtied));
 			appendStringInfo(&buf,

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

Thank you for working on this!

On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
>
> Thank you for further testing! I've pushed the patch.

I realized a behaviour change while looking at 'Use pgBufferUsage for
block reporting in analyze' thread [1]. Since that change applies here
as well, I thought it is better to mention it here.

Before this commit, VacuumPageMiss did not count the blocks if its
read was already completed by other backends [2]. Now,
'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
counts every block attempted to be read; possibly double counting if
someone else has already completed the read. I do not know which
behaviour is correct but I wanted to mention this.

[1] 
https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com

[2] In the WaitReadBuffers() function, see comment:
/*
 * Skip this block if someone else has already completed it.  If an
 * I/O is already in progress in another backend, this will wait for
 * the outcome: either done, or something went wrong and we will
 * retry.
     */

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: gcc 12.1.0 warning

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Tue, 23 Apr 2024 at 19:59, Andres Freund  wrote:
>
>
> Which seems entirely legitimate. ISTM that guc_var_compare() ought to only
> cast the pointers to the key type, i.e. char *.  And incidentally that does
> prevent the warning.
>
> The reason it doesn't happen in newer versions of postgres is that we aren't
> using guc_var_compare() in the relevant places anymore...

The fix is attached. It cleanly applies from REL_15_STABLE to
REL_12_STABLE, fixes the warnings and the tests pass.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 588c99f5c402fc41414702133636e5a51f9e3470 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 10 May 2024 10:43:45 +0300
Subject: [PATCH v1] Fix -Warray-bounds warning in guc_var_compare() function

There are a couple of places that guc_var_compare() function takes
'const char ***' type and then casts it to the
'const struct config_generic *' type. This triggers '-Warray-bounds'
warning. So, instead cast them to the 'const char *' type.

Author: Nazir Bilal Yavuz 
Reported-by: Erik Rijkers 
Suggested-by: Andres Freund 
Discussion: https://postgr.es/m/a74a1a0d-0fd2-3649-5224-4f754e8f91aa%40xs4all.nl
---
 src/backend/utils/misc/guc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c410ba532d2..eec97dea659 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5721,10 +5721,10 @@ find_option(const char *name, bool create_placeholders, bool skip_errors,
 static int
 guc_var_compare(const void *a, const void *b)
 {
-	const struct config_generic *confa = *(struct config_generic *const *) a;
-	const struct config_generic *confb = *(struct config_generic *const *) b;
+	const char *confa_name = **(char **const *) a;
+	const char *confb_name = **(char **const *) b;
 
-	return guc_name_compare(confa->name, confb->name);
+	return guc_name_compare(confa_name, confb_name);
 }
 
 /*
-- 
2.43.0

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 16:58, Robert Haas  wrote:
>
> On Tue, May 7, 2024 at 8:00 AM Nazir Bilal Yavuz  wrote:
> > We had an off-list talk with Thomas and we thought making this option
> > GUC instead of SQL command level could solve this problem.
> >
> > I am posting a new rebased version of the patch with some important changes:
> >
> > * 'createdb_file_copy_method' GUC is created. Possible values are
> > 'copy' and 'clone'. Copy is the default option. Clone option can be
> > chosen if the system supports it, otherwise it gives error at the
> > startup.
>
> This seems like a smart idea, because the type of file copying that we
> do during redo need not be the same as what was done when the
> operation was originally performed.
>
> I'm not so sure about the GUC name. On the one hand, it feels like
> createdb should be spelled out as create_database, but on the other
> hand, the GUC name is quite long already. Then again, why would we
> make this specific to CREATE DATABASE in the first place? Would we
> also want alter_tablespace_file_copy_method and
> basic_archive.file_copy_method?

I agree that it is already quite long, because of that I chose the
createdb as a prefix. I did not think that file cloning was planned to
be used in other places. If that is the case, does something like
'preferred_copy_method' work? Then, we would mention which places will
be affected with this GUC in the docs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 15:23, Ranier Vilela  wrote:
>
> Em qua., 8 de mai. de 2024 às 08:42, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi,
>>
>> On Wed, 8 May 2024 at 14:16, Ranier Vilela  wrote:
>> >
>> >
>> > Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz 
>> >  escreveu:
>> >>
>> >> Hi Ranier,
>> >>
>> >> Thanks for looking into this!
>> >>
>> >> I am not sure why but your reply does not show up in the thread, so I
>> >> copied your reply and answered it in the thread for visibility.
>> >>
>> >> On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>> >> >
>> >> > I know it's coming from copy-and-paste, but
>> >> > I believe the flags could be:
>> >> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | 
>> >> > PG_BINARY);
>> >> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | 
>> >> > O_EXCL | PG_BINARY);
>> >> >
>> >> > The flags:
>> >> > O_WRONLY | O_TRUNC
>> >> >
>> >> > Allow the OS to make some optimizations, if you deem it possible.
>> >> >
>> >> > The flag O_RDWR indicates that the file can be read, which is not true 
>> >> > in this case.
>> >> > The destination file will just be written.
>> >>
>> >> You may be right about the O_WRONLY flag but I am not sure about the
>> >> O_TRUNC flag.
>> >>
>> >> O_TRUNC from the linux man page [1]: If the file already exists and is
>> >> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> >> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> >> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> >> effect of O_TRUNC is unspecified.
>> >
>> > O_TRUNC is usually used in conjunction with O_WRONLY.
>> > See the example at:
>> > open.html
>> >
>> > O_TRUNC signals the OS to forget the current contents of the file, if it 
>> > happens to exist.
>> > In other words, there will be no seeks, only and exclusively writes.
>>
>> You are right; the O_TRUNC flag truncates the file, if it happens to
>> exist. But it should not exist in the first place because the code at
>> [2] should check this before the OpenTransientFile() call. Even if we
>> assume somehow the check at [2] does not work, I do not think the
>> correct operation is to truncate the contents of the existing file.
>
> I don't know if there is a communication problem here.
> But what I'm trying to suggest refers to the destination file,
> which doesn't matter if it exists or not?

I do not think there is a communication problem. Actually it matters
because the destination file should not exist, there is a code [2]
which already checks and confirms that it does not exist.

>
> If the destination file does not exist, O_TRUNC is ignored.
> If the destination file exists, O_TRUNC truncates the current contents of the 
> file.
> I don't see why you think it's a problem to truncate the current content if 
> the destination file exists.
> Isn't he going to be replaced anyway?

'If the destination file does not exist' means the code at [2] works
well and we do not need the O_TRUNC flag.

'If the destination file already exists' means the code at [2] is
broken somehow and there is a high chance that we could truncate
something that we do not want to. For example, there is a foo db and
we want to create bar db, Postgres chose the foo db's location as the
destination of the bar db (which should not happen but let's assume
somehow checks at [2] failed), then we could wrongly truncate the foo
db's contents.

Hence, if Postgres works successfully I think the O_TRUNC flag is
unnecessary but if Postgres does not work successfully, the O_TRUNC
flag could cause harm.

>
> Unless we want to preserve the current content (destination file), in case 
> the copy/clone fails?

Like I said above, Postgres should not choose the existing file as the
destination file.

Also, we have O_CREAT | O_EXCL flags together, from the link [3] you
shared before: If O_CREAT and O_EXCL are set, open() shall fail if the
file exists. So, overwriting the already existing file is already
prevented.

[3] https://pubs.opengroup.org/onlinepubs/009696699/functions/open.html

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 14:16, Ranier Vilela  wrote:
>
>
> Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi Ranier,
>>
>> Thanks for looking into this!
>>
>> I am not sure why but your reply does not show up in the thread, so I
>> copied your reply and answered it in the thread for visibility.
>>
>> On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>> >
>> > I know it's coming from copy-and-paste, but
>> > I believe the flags could be:
>> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL 
>> > | PG_BINARY);
>> >
>> > The flags:
>> > O_WRONLY | O_TRUNC
>> >
>> > Allow the OS to make some optimizations, if you deem it possible.
>> >
>> > The flag O_RDWR indicates that the file can be read, which is not true in 
>> > this case.
>> > The destination file will just be written.
>>
>> You may be right about the O_WRONLY flag but I am not sure about the
>> O_TRUNC flag.
>>
>> O_TRUNC from the linux man page [1]: If the file already exists and is
>> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> effect of O_TRUNC is unspecified.
>
> O_TRUNC is usually used in conjunction with O_WRONLY.
> See the example at:
> open.html
>
> O_TRUNC signals the OS to forget the current contents of the file, if it 
> happens to exist.
> In other words, there will be no seeks, only and exclusively writes.

You are right; the O_TRUNC flag truncates the file, if it happens to
exist. But it should not exist in the first place because the code at
[2] should check this before the OpenTransientFile() call. Even if we
assume somehow the check at [2] does not work, I do not think the
correct operation is to truncate the contents of the existing file.

>>
>> But it should be already checked if the destination directory already
>> exists in dbcommands.c file in createdb() function [2], so this should
>> not happen.
>
> I'm not sure what you're referring to here.

Sorry, I meant that the destination directory / file should not exist
because the code at [2] confirms it does not exist, otherwise it
errors out.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi Ranier,

Thanks for looking into this!

I am not sure why but your reply does not show up in the thread, so I
copied your reply and answered it in the thread for visibility.

On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>
> I know it's coming from copy-and-paste, but
> I believe the flags could be:
> - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
> + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | 
> PG_BINARY);
>
> The flags:
> O_WRONLY | O_TRUNC
>
> Allow the OS to make some optimizations, if you deem it possible.
>
> The flag O_RDWR indicates that the file can be read, which is not true in 
> this case.
> The destination file will just be written.

You may be right about the O_WRONLY flag but I am not sure about the
O_TRUNC flag.

O_TRUNC from the linux man page [1]: If the file already exists and is
a regular file and the access mode allows writing (i.e., is O_RDWR or
O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
terminal device file, the O_TRUNC flag is ignored. Otherwise, the
effect of O_TRUNC is unspecified.

But it should be already checked if the destination directory already
exists in dbcommands.c file in createdb() function [2], so this should
not happen.

[1] https://man7.org/linux/man-pages/man2/open.2.html

[2]
/*
 * If database OID is configured, check if the OID is already in use or
 * data directory already exists.
 */
if (OidIsValid(dboid))
{
char   *existing_dbname = get_database_name(dboid);

if (existing_dbname != NULL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
errmsg("database OID %u is already in use by
database \"%s\"",
   dboid, existing_dbname));

if (check_db_file_conflict(dboid))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
errmsg("data directory with the specified OID %u
already exists", dboid));
}
else
{
/* Select an OID for the new database if is not explicitly
configured. */
do
{
dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
   Anum_pg_database_oid);
    } while (check_db_file_conflict(dboid));
}

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-07 Thread Nazir Bilal Yavuz

Hi,

On Wed, 6 Mar 2024 at 05:17, Thomas Munro  wrote:
>
> The main thing that is missing is support for redo.  It's mostly
> trivial I think, probably just a record type for "try cloning first"
> and then teaching that clone function to fall back to the regular copy
> path if it fails in recovery, do you agree with that idea?  Another
> approach would be to let it fail if it doesn't work on the replica, so
> you don't finish up using dramatically different amounts of disk
> space, but that seems terrible because now your replica is broken.  So
> probably fallback with logged warning (?), though I'm not sure exactly
> which errnos to give that treatment to.

We had an off-list talk with Thomas and we thought making this option
GUC instead of SQL command level could solve this problem.

I am posting a new rebased version of the patch with some important changes:

* 'createdb_file_copy_method' GUC is created. Possible values are
'copy' and 'clone'. Copy is the default option. Clone option can be
chosen if the system supports it, otherwise it gives error at the
startup.

* #else part of the clone_file() function calls pg_unreachable()
instead of ereport().

* Documentation updates.

Also, what should happen when the kernel has clone support but the
file system does not?

- I tested this on Linux and copy_file_range() silently uses normal
copy when this happens. I assume the same thing will happen for
FreeBSD because it uses the copy_file_range() function as well.

- I am not sure about MacOS since the force flag
(COPYFILE_CLONE_FORCE) is used. I do not have MacOS so I can not test
it but I assume it will error out when this happens. If that is the
case, is this a problem? I am asking that since this is a GUC now, the
user will have the full responsibility.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 404e301dbdb252c23ea9d451b817cf6e372d0d9a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 7 May 2024 14:16:09 +0300
Subject: [PATCH v5] Use CLONE method with GUC on CREATE DATABASE ...
 STRATEGY=FILE_COPY.

createdb_file_copy_method GUC option is introduced. It can be set to
either COPY (default) or CLONE (if system supports it).

If CLONE option is chosen, similar to STRATEGY=FILE_COPY; but attempting
to use efficient file copying system calls.  The kernel has the
opportunity to share block ranges in copy-on-write file systems, or
maybe push down the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/commands/dbcommands.h |  9 ++
 src/include/storage/copydir.h |  3 +-
 src/backend/commands/dbcommands.c | 21 -
 src/backend/storage/file/copydir.c| 83 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 13 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 ++
 doc/src/sgml/config.sgml  | 17 
 doc/src/sgml/ref/create_database.sgml | 24 --
 src/tools/pgindent/typedefs.list  |  1 +
 10 files changed, 162 insertions(+), 15 deletions(-)

diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 92e17c71158..2e1d3565edc 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -14,6 +14,15 @@
 #ifndef DBCOMMANDS_H
 #define DBCOMMANDS_H
 
+typedef enum CreateDBFileCopyMethod
+{
+	CREATEDB_FILE_COPY_METHOD_COPY,
+	CREATEDB_FILE_COPY_METHOD_CLONE,
+} CreateDBFileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int createdb_file_copy_method;
+
 #include "access/xlogreader.h"
 #include "catalog/objectaddress.h"
 #include "lib/stringinfo.h"
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..9ff28f2eec9 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,7 +13,8 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
-extern void copydir(const char *fromdir, const char *todir, bool recurse);
+extern void copydir(const char *fromdir, const char *todir, bool recurse,
+	bool clone_files);
 extern void copy_file(const char *fromfile, const char *tofile);
 
 #endif			/* COPYDIR_H */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index be629ea92cf..cf0dff65e69 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -69,6 +69,20 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/* GUCs */
+int			createdb_file_copy_method = CREATEDB_FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const str

Re: Use streaming read API in ANALYZE

2024-04-29 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
>
> Pushed.  Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM
agnostic again. So, read stream changes might have to be pushed down
now but there are a couple of roadblocks like Melanie mentioned [1]
before.

Quote from Melanie [1]:

On Thu, 11 Apr 2024 at 19:19, Melanie Plageman
 wrote:
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.

If we do not want to add a new table AM callback like Melanie
mentioned, it is pretty much required to pass BufferAccessStrategy and
BlockSamplerData to the initscan().

> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I wonder the same, I could not think of any alternative solution to
this problem.

Another quote from Melanie [2] in the same thread:

On Thu, 11 Apr 2024 at 20:48, Melanie Plageman
 wrote:
>
> I will also say that, had this been 6 months ago, I would probably
> suggest we restructure ANALYZE's table AM interface to accommodate
> read stream setup and to address a few other things I find odd about
> the current code. For example, I think creating a scan descriptor for
> the analyze scan in acquire_sample_rows() is quite odd. It seems like
> it would be better done in the relation_analyze callback. The
> relation_analyze callback saves some state like the callbacks for
> acquire_sample_rows() and the Buffer Access Strategy. But at least in
> the heap implementation, it just saves them in static variables in
> analyze.c. It seems like it would be better to save them in a useful
> data structure that could be accessed later. We have access to pretty
> much everything we need at that point (in the relation_analyze
> callback). I also think heap's implementation of
> table_beginscan_analyze() doesn't need most of
> heap_beginscan()/initscan(), so doing this instead of something
> ANALYZE specific seems more confusing than helpful.

If we want to implement ANALYZE specific counterparts of
heap_beginscan()/initscan(); we may think of passing
BufferAccessStrategy and BlockSamplerData to them.

Also, there is an ongoing(?) discussion about a few problems /
improvements about the acquire_sample_rows() mentioned at the end of
the 'Table AM Interface Enhancements' thread [3]. Should we wait for
these discussions to be resolved or can we resume working on this
patch?

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Nazir Bilal Yavuz

On Fri, 26 Apr 2024 at 14:54, Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Congratulations to both of you!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: ci: Allow running mingw tests by default via environment variable

2024-04-25 Thread Nazir Bilal Yavuz

Hi,

Thank you for working on this!

On Sat, 13 Apr 2024 at 05:12, Andres Freund  wrote:
>
> Hi,
>
> We have CI support for mingw, but don't run the task by default, as it eats up
> precious CI credits.  However, for cfbot we are using custom compute resources
> (currently donated by google), so we can afford to run the mingw tests. Right
> now that'd require cfbot to patch .cirrus.tasks.yml.

And I think mingw ends up not running most of the time. +1 to running
it as default at least on cfbot. Also, this gives people a chance to
run mingw as default on their personal repositories (which I would
like to run).

> While one can manually trigger manual task in one one's own repo, most won't
> have the permissions to do so for cfbot.
>
>
> I propose that we instead run the task automatically if
> $REPO_MINGW_TRIGGER_BY_DEFAULT is set, typically in cirrus' per-repository
> configuration.
>
> Unfortunately that's somewhat awkward to do in the cirrus-ci yaml
> configuration, the set of conditional expressions supported is very
> simplistic.
>
> To deal with that, I extended .cirrus.star to compute the required environment
> variable. If $REPO_MINGW_TRIGGER_BY_DEFAULT is set, CI_MINGW_TRIGGER_TYPE is
> set to 'automatic', if not it's 'manual'.
>

Changes look good to me. My only complaint could be using only 'true'
for the REPO_MINGW_TRIGGER_BY_DEFAULT, not a possible list of values
but this is not important.

> We've also talked in other threads about adding CI support for
> 1) windows, building with visual studio
> 2) linux, with musl libc
> 3) free/netbsd
>
> That becomes more enticing, if we can enable them by default on cfbot but not
> elsewhere. With this change, it'd be easy to add further variables to control
> such future tasks.

I agree.

> I also attached a second commit, that makes the "code" dealing with ci-os-only
> in .cirrus.tasks.yml simpler. While I think it does nicely simplify
> .cirrus.tasks.yml, overall it adds lines, possibly making this not worth it.
> I'm somewhat on the fence.
>
>
> Thoughts?

Although it adds more lines, this makes the .cirrus.tasks.yml file
more readable and understandable which is more important in my
opinion.

> On the code level, I thought if it'd be good to have a common prefix for all
> the automatically set variables. Right now that's CI_, but I'm not at all
> wedded to that.

I agree with your thoughts and CI_ prefix.

I tested both patches and they work as expected.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-04-24 Thread Nazir Bilal Yavuz

Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz  wrote:
> > On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > > I agree with your points. While the other I/O related work is
> > > > happening we can discuss what we should do in the variable op_byte
> > > > cases. Also, this is happening only for WAL right now but if we try to
> > > > extend pg_stat_io in the future, that problem possibly will rise
> > > > again. So, it could be good to come up with a general solution, not
> > > > only for WAL.
> > >
> > > Okay, I've marked the patch as RwF for this CF.

Since the last commitfest entry was returned with feedback, I created
a new commitfest entry: https://commitfest.postgresql.org/48/4950/

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: slightly misleading Error message in guc.c

2024-04-22 Thread Nazir Bilal Yavuz

Hi,

On Mon, 22 Apr 2024 at 11:44, jian he  wrote:
>
> hi.
> minor issue in guc.c.
>
> set work_mem to '1kB';
> ERROR:  1 kB is outside the valid range for parameter "work_mem" (64
> .. 2147483647)
> should it be
> ERROR:  1 kB is outside the valid range for parameter "work_mem" (64
> kB .. 2147483647 kB)
> ?
> since the units for work_mem are { "B", "kB", "MB", "GB", and "TB"}
>
> search `outside the valid range for parameter`,
> there are two occurrences in guc.c.

Nice find. I agree it could cause confusion.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-04-19 Thread Nazir Bilal Yavuz

Hi,

On Mon, 19 Feb 2024 at 10:28, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> >
> > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > I agree with your points. While the other I/O related work is
> > > happening we can discuss what we should do in the variable op_byte
> > > cases. Also, this is happening only for WAL right now but if we try to
> > > extend pg_stat_io in the future, that problem possibly will rise
> > > again. So, it could be good to come up with a general solution, not
> > > only for WAL.
> >
> > Okay, I've marked the patch as RwF for this CF.
>
> I wanted to inform you that the 73f0a13266 commit changed all WALRead
> calls to read variable bytes, only the WAL receiver was reading
> variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

I hope that I did not miss anything and my explanations are clear.

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAFiTN-sE7CJn-ZFj%2B-0Wv6TNytv_fp4n%2BeCszspxJ3mt77t5ig%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for looking into this!
>
> On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
> >
> > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > > Streaming API has been committed but the committed version has a minor
> > > change, the read_stream_begin_relation function takes Relation instead
> > > of BufferManagerRelation now. So, here is a v5 which addresses this
> > > change.
> >
> > I'm getting a repeatable segfault / assertion failure with this:
> >
> > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> > CREATE TABLE
> > postgres=# insert into tengiga select g, repeat('x', 900) from
> > generate_series(1, 140) g;
> > INSERT 0 140
> > postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> > SET
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> > "heapam_handler.c", Line: 1079, PID: 262232
> > postgres: heikki postgres [local]
> > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> > postgres: heikki postgres [local]
> > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> > postgres: heikki postgres [local]
> > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> > postgres: heikki postgres [local]
> > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> > postgres: heikki postgres [local]
> > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> > postgres: heikki postgres [local]
> > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> > postgres: heikki postgres [local]
> > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> > 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> > was terminated by signal 6: Aborted
>
> I realized the same error while working on Jakub's benchmarking results.
>
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.

I wanted to re-check this problem and I realized that I was wrong. I
tried using nblocks again and this time there was no failure. I looked
at block sampling logic and I am pretty sure that BlockSampler_Init()
function correctly returns the number of blocks that
BlockSampler_Next() will return. It seems 158f581923 fixed this issue
as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Typos in the code and README

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

Thanks for working on this!

On Mon, 15 Apr 2024 at 15:26, Daniel Gustafsson  wrote:
>
> > On 14 Apr 2024, at 13:19, David Rowley  wrote:
> >
> > On Sat, 13 Apr 2024 at 09:17, Daniel Gustafsson  wrote:
> >>
> >>> On 12 Apr 2024, at 23:15, Heikki Linnakangas  wrote:
> >>> Here's a few more. I've accumulate these over the past couple of months, 
> >>> keeping them stashed in a branch, adding to it whenever I've spotted a 
> >>> minor typo while reading the code.
> >>
> >> Nice, let's lot all these together.
> >
> > Here are a few additional ones to add to that.
>
> Thanks.  Collecting all the ones submitted here, as well as a few submitted
> off-list by Alexander, the patch is now a 3-part patchset of cleanups:
>
> 0001 contains the typos and duplicate words fixups, 0002 fixes a parameter 
> with
> the wrong name in the prototype and 0003 removes a leftover prototype which 
> was
> accidentally left in a refactoring.

I realized two small typos: 'sgmr' -> 'smgr'. You may want to include
them in 0001.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

I am working on using read streams in the CREATE DATABASE command when the
strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
this context. This function reads source buffers then copies them to the
destination buffers. I used read streams only when reading source buffers
because the destination buffers are read by 'RBM_ZERO_AND_LOCK' option, so
it is not important.

I created a ~6 GB table [1] and created a new database with the wal_log
strategy using the database that table was created in as a template [2]. My
benchmarking results are:

a. Timings:

patched:
12955.027 ms
12917.475 ms
13177.846 ms
12971.308 ms
13059.985 ms

master:
13156.375 ms
13054.071 ms
13151.607 ms
13152.633 ms
13160.538 ms

There is no difference in timings, the patched version is a tiny bit better
but it is negligible. I actually expected the patched version to be better
because there was no prefetching before, but the read stream API detects
sequential access and disables prefetching.

b. strace:

patched:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 68.023.749359   2   1285782   pwrite64
 18.541.021734  21 46730   preadv
  9.490.522889 826   633   fdatasync
  2.550.140339  59  2368   pwritev
  1.140.062583 409   153   fsync

master:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 59.713.831542   2   1288365   pwrite64
 29.841.914540   2747936   pread64
  7.900.506843 837   605   fdatasync
  1.580.101575  54  1856   pwritev
  0.750.048431 400   121   fsync

There are fewer (~1/16) read system calls in the patched version.

c. perf:

patched:

-   97.83% 1.13%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - 97.83% RelationCopyStorageUsingBuffer

  - 44.28% ReadBufferWithoutRelcache

 + 42.20% GetVictimBuffer

   0.81% ZeroBuffer

  + 31.86% log_newpage_buffer

  - 19.51% read_stream_next_buffer

 - 17.92% WaitReadBuffers

+ 17.61% mdreadv

 - 1.47% read_stream_start_pending_read

+ 1.46% StartReadBuffers

master:

-   97.68% 0.57%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - RelationCopyStorageUsingBuffer

  - 65.48% ReadBufferWithoutRelcache

 + 41.16% GetVictimBuffer

 - 20.42% WaitReadBuffers

+ 19.90% mdreadv

 + 1.85% StartReadBuffer

   0.75% ZeroBuffer

  + 30.82% log_newpage_buffer

Patched version spends less CPU time in read calls and more CPU time in
other calls such as write.

There are three patch files attached. First two are optimization and adding
a way to create a read stream object by using SMgrRelation, these are
already proposed in the streaming I/O thread [3]. The third one is the
actual patch file.

Any kind of feedback would be appreciated.

[1] CREATE TABLE t as select repeat('a', 100) || i || repeat('b', 500) as
filler from generate_series(1, 900) as i;
[2] CREATE DATABASE test_1 STRATEGY 'wal_log' TEMPLATE test;
[3]
https://www.postgresql.org/message-id/CAN55FZ1yGvCzCW_aufu83VimdEYHbG_zuOY3J9JL-nBptyJyKA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From edcfacec40a70a8747c5f18777b6c28b0fccba7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 901b7230fb9..17cb7127f99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -

Re: Table AM Interface Enhancements

2024-04-15 Thread Nazir Bilal Yavuz

Hi,

On Mon, 15 Apr 2024 at 18:36, Robert Haas  wrote:
>
> On Sat, Apr 13, 2024 at 5:28 AM Alexander Korotkov  
> wrote:
> > Yes, I think so.  Table AM API deals with TIDs and block numbers, but
> > doesn't force on what they actually mean.  For example, in ZedStore
> > [1], data is stored on per-column B-trees, where TID used in table AM
> > is just a logical key of that B-trees.  Similarly, blockNumber is a
> > range for B-trees.
> >
> > c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
> > assumption that we are sampling physical blocks as they are stored in
> > data files.  That couldn't anymore be some "logical" block numbers
> > with meaning only table AM implementation knows.  That was pointed out
> > by Andres [2].  I'm not sure if ZedStore is alive, but there could be
> > other table AM implementations like this, or other implementations in
> > development, etc.  Anyway, I don't feel good about narrowing the API,
> > which is there from pg12.
>
> I spent some time looking at this. I think it's valid to complain
> about the tighter coupling, but c6fc50cb4028 is there starting in v14,
> so I don't think I understand why the situation after 041b96802ef is
> materially worse than what we've had for the last few releases. I
> think it is worse in the sense that, before, you could dodge the
> problem without defining USE_PREFETCH, and now you can't, but I don't
> think we can regard nonphysical block numbers as a supported scenario
> on that basis.

I agree with you but I did not understand one thing. If out-of-core
AMs are used, does not all block sampling logic (BlockSampler_Init(),
BlockSampler_Next() etc.) need to be edited as well since these
functions assume block numbers are actual physical on-disk location,
right? I mean if the block number is something different than the
actual physical on-disk location, the acquire_sample_rows() function
looks wrong to me before c6fc50cb4028 as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: gcc 12.1.0 warning

2024-04-15 Thread Nazir Bilal Yavuz

Hi,

On Sat, 13 Apr 2024 at 05:25, Andres Freund  wrote:
>
> Hi,
>
> On 2023-10-27 13:09:01 +0300, Nazir Bilal Yavuz wrote:
> > I was testing 'upgrading CI Debian images to bookworm'. I tested the
> > bookworm on REL_15, REL_16 and upstream. REL_16 and upstream finished
> > successfully but the CompilerWarnings task failed on REL_15 with the
> > same error.
>
> Is that still the case?  I briefly tried to repro this outside of CI and
> couldn't reproduce the warning.
>
> I'd really like to upgrade the CI images, it doesn't seem great to continue
> using oldstable.
>
>
> > gcc version: 12.2.0
> >
> > CI Run: https://cirrus-ci.com/task/6151742664998912
>
> Unfortunately the logs aren't accessible anymore, so I can't check the precise
> patch level of the compiler and/or the precise invocation used.

I am able to reproduce this. I regenerated the debian bookworm image
and ran CI on REL_15_STABLE with this image.

CI Run: https://cirrus-ci.com/task/4978799442395136

--

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Speed up clean meson builds by ~25%

2024-04-08 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin  wrote:
>
> Hello Michael,
>
> 08.04.2024 08:23, Michael Paquier wrote:
> > On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
> >> Agreed. While not a full solution, I think this patch is still good to
> >> address some of the pain: Waiting 10 seconds at the end of my build
> >> with only 1 of my 10 cores doing anything.
> >>
> >> So while this doesn't decrease CPU time spent it does decrease
> >> wall-clock time significantly in some cases, with afaict no downsides.
> > Well, this is also painful with ./configure.  So, even if we are going
> > to move away from it at this point, we still need to support it for a
> > couple of years.  It looks to me that letting the clang folks know
> > about the situation is the best way forward.
> >
>
> As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
> is fixed already.
> Perhaps it's worth rechecking...
>
> [1] 
> https://www.postgresql.org/message-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8%40gmail.com

I had this problem on my local computer. My build times are:

gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25s

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> > >
> > > I had been planning to commit v14 this morning but got cold feet with
> > > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > > neither did I.  I have now removed it, and it seems much better.  No
> > > other significant changes, just parameter types and inlining details.
> > > For example:
> > >
> > >  * read_stream_begin_relation() now takes a Relation, likes its name says
> > >  * StartReadBuffers()'s operation takes smgr and optional rel
> > >  * ReadBuffer_common() takes smgr and optional rel
> >
> > Read stream objects can be created only using Relations now. There
> > could be read stream users which do not have a Relation but
> > SMgrRelations. So, I created another constructor for the read streams
> > which use SMgrRelations instead of Relations. Related patch is
> > attached.
>
> After sending this, I realized that I forgot to add persistence value
> to the new constructor. While working on it I also realized that
> current code sets persistence in PinBufferForBlock() function and this
> function is called for each block, which can be costly. So, I moved
> setting persistence to the out of PinBufferForBlock() function.
>
> Setting persistence outside of the PinBufferForBlock() function (0001)
> and creating the new constructor that uses SMgrRelations (0002) are
> attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBuffere

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
>
> I had been planning to commit v14 this morning but got cold feet with
> the BMR-based interface.  Heikki didn't like it much, and in the end,
> neither did I.  I have now removed it, and it seems much better.  No
> other significant changes, just parameter types and inlining details.
> For example:
>
>  * read_stream_begin_relation() now takes a Relation, likes its name says
>  * StartReadBuffers()'s operation takes smgr and optional rel
>  * ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 38b57ec7e54a54a0c7b0117a0ecaaf68c643e1b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 20:16:00 +0300
Subject: [PATCH] Add a way to create read stream object by using SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h |  8 +++
 src/backend/storage/aio/read_stream.c | 74 ++-
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..601a7fcf92b 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,13 @@ extern ReadStream *read_stream_begin_relation(int flags,
 			  ReadStreamBlockNumberCB callback,
 			  void *callback_private_data,
 			  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+   BufferAccessStrategy strategy,
+   SMgrRelation smgr,
+   ForkNumber forknum,
+   ReadStreamBlockNumberCB callback,
+   void *callback_private_data,
+   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..a9a3b0de6c9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-		   BufferAccessStrategy strategy,
-		   Relation rel,
-		   ForkNumber forknum,
-		   ReadStreamBlockNumberCB callback,
-		   void *callback_private_data,
-		   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+	   BufferAccessStrategy strategy,
+	   Relation rel,
+	   SMgrRelation smgr,
+	   ForkNumber forknum,
+	   ReadStreamBlockNumberCB callback,
+	   void *callback_private_data,
+	   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +423,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +432,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,7 +546,7 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.smgr = smgr;
 		stream->ios[i].op.smgr_persistence = 0;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
@@ -557,6 +555,56 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begi

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-04-07 Thread Nazir Bilal Yavuz

Hi Andrey,

On Sun, 7 Apr 2024 at 08:29, Andrey M. Borodin  wrote:
>
>
>
> > On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz  wrote:
> >
> > I did not have the time to check other things you mentioned but I
> > tested the read performance. The table size is 5.5GB, I did 20 runs in
> > total.
>
> Hi Nazir!
>
> Do you plan to review anything else? Or do you think it worth to look at by 
> someone else? Or is the patch Ready for Committer? If so, please swap CF 
> entry [0] to status accordingly, currently it's "Waiting on Author".

Thanks for reminding me! I think this needs review by someone else
(especially the prefetch part) so I changed it to 'Needs review'.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-04 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  wrote:
>
>
> I've reviewed the patches inline below and attached a patch that has
> some of my ideas on top of your patch.

Thank you!

>
> > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Wed, 3 Apr 2024 15:14:15 +0300
> > Subject: [PATCH v6] Use streaming read API in ANALYZE
> >
> > ANALYZE command gets random tuples using BlockSampler algorithm. Use
> > streaming reads to get these tuples by using BlockSampler algorithm in
> > streaming read API prefetch logic.
> > ---
> >  src/include/access/heapam.h  |  6 +-
> >  src/backend/access/heap/heapam_handler.c | 22 +++---
> >  src/backend/commands/analyze.c   | 85 
> >  3 files changed, 42 insertions(+), 71 deletions(-)
> >
> > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > index a307fb5f245..633caee9d95 100644
> > --- a/src/include/access/heapam.h
> > +++ b/src/include/access/heapam.h
> > @@ -25,6 +25,7 @@
> >  #include "storage/bufpage.h"
> >  #include "storage/dsm.h"
> >  #include "storage/lockdefs.h"
> > +#include "storage/read_stream.h"
> >  #include "storage/shm_toc.h"
> >  #include "utils/relcache.h"
> >  #include "utils/snapshot.h"
> > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> > struct 
> > GlobalVisState *vistest);
> >
> >  /* in heap/heapam_handler.c*/
> > -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> > -   
> >  BlockNumber blockno,
> > -   
> >  BufferAccessStrategy bstrategy);
> > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> > +   
> >  ReadStream *stream);
> >  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
> > 
> >  TransactionId OldestXmin,
> > 
> >  double *liverows, double *deadrows,
> > diff --git a/src/backend/access/heap/heapam_handler.c 
> > b/src/backend/access/heap/heapam_handler.c
> > index 0952d4a98eb..d83fbbe6af3 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > Relation NewHeap,
> >  }
> >
> >  /*
> > - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> > - * with SO_TYPE_ANALYZE option.
> > + * Prepare to analyze block returned from streaming object.  If the block 
> > returned
> > + * from streaming object is valid, true is returned; otherwise false is 
> > returned.
> > + * The scan has been started with SO_TYPE_ANALYZE option.
> >   *
> >   * This routine holds a buffer pin and lock on the heap page.  They are 
> > held
> >   * until heapam_scan_analyze_next_tuple() returns false.  That is until 
> > all the
> >   * items of the heap page are analyzed.
> >   */
> > -void
> > -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> > -
> > BufferAccessStrategy bstrategy)
> > +bool
> > +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
> >  {
> >   HeapScanDesc hscan = (HeapScanDesc) scan;
> >
> > @@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, 
> > BlockNumber blockno,
> >* doing much work per tuple, the extra lock traffic is probably 
> > better
> >* avoided.
>
> Personally I think heapam_scan_analyze_next_block() should be inlined.
> It only has a few lines. I would find it clearer inline. At the least,
> there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
> a TableScanDesc instead of a HeapScanDesc.

I agree.

>
> >*/
> > - hscan->rs_cblock = blockno;
> > - hscan->rs_cindex = FirstOffsetNumber;
> > - hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
> > -

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
>
> On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > Streaming API has been committed but the committed version has a minor
> > change, the read_stream_begin_relation function takes Relation instead
> > of BufferManagerRelation now. So, here is a v5 which addresses this
> > change.
>
> I'm getting a repeatable segfault / assertion failure with this:
>
> postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> CREATE TABLE
> postgres=# insert into tengiga select g, repeat('x', 900) from
> generate_series(1, 140) g;
> INSERT 0 140
> postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> SET
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> "heapam_handler.c", Line: 1079, PID: 262232
> postgres: heikki postgres [local]
> ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> postgres: heikki postgres [local]
> ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> postgres: heikki postgres [local]
> ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> postgres: heikki postgres [local]
> ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> postgres: heikki postgres [local]
> ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> postgres: heikki postgres [local]
> ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> postgres: heikki postgres [local]
> ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

-- 
Regards,
Nazir Bilal Yavu

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi Jakub,

Thank you for looking into this and doing a performance analysis.

On Wed, 3 Apr 2024 at 11:42, Jakub Wartak  wrote:
>
> On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
> [..]
> > v4 is rebased on top of v14 streaming read API changes.
>
> Hi Nazir, so with streaming API committed, I gave a try to this patch.
> With autovacuum=off and 30GB table on NVMe (with standard readahead of
> 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
> default) created using: create table t as select repeat('a', 100) || i
> || repeat('b', 500) as filler from generate_series(1, 4500) as i;
>
> on master, effect of mainteance_io_concurency [default 10] is like
> that (when resetting the fs cache after each ANALYZE):
>
> m_io_c = 0:
> Time: 3137.914 ms (00:03.138)
> Time: 3094.540 ms (00:03.095)
> Time: 3452.513 ms (00:03.453)
>
> m_io_c = 1:
> Time: 2972.751 ms (00:02.973)
> Time: 2939.551 ms (00:02.940)
> Time: 2904.428 ms (00:02.904)
>
> m_io_c = 2:
> Time: 1580.260 ms (00:01.580)
> Time: 1572.132 ms (00:01.572)
> Time: 1558.334 ms (00:01.558)
>
> m_io_c = 4:
> Time: 938.304 ms
> Time: 931.772 ms
> Time: 920.044 ms
>
> m_io_c = 8:
> Time: 666.025 ms
> Time: 660.241 ms
> Time: 648.848 ms
>
> m_io_c = 16:
> Time: 542.450 ms
> Time: 561.155 ms
> Time: 539.683 ms
>
> m_io_c = 32:
> Time: 538.487 ms
> Time: 541.705 ms
> Time: 538.101 ms
>
> with patch applied:
>
> m_io_c = 0:
> Time: 3106.469 ms (00:03.106)
> Time: 3140.343 ms (00:03.140)
> Time: 3044.133 ms (00:03.044)
>
> m_io_c = 1:
> Time: 2959.817 ms (00:02.960)
> Time: 2920.265 ms (00:02.920)
> Time: 2911.745 ms (00:02.912)
>
> m_io_c = 2:
> Time: 1581.912 ms (00:01.582)
> Time: 1561.444 ms (00:01.561)
> Time: 1558.251 ms (00:01.558)
>
> m_io_c = 4:
> Time: 908.116 ms
> Time: 901.245 ms
> Time: 901.071 ms
>
> m_io_c = 8:
> Time: 619.870 ms
> Time: 620.327 ms
> Time: 614.266 ms
>
> m_io_c = 16:
> Time: 529.885 ms
> Time: 526.958 ms
> Time: 528.474 ms
>
> m_io_c = 32:
> Time: 521.185 ms
> Time: 520.713 ms
> Time: 517.729 ms
>
> No difference to me, which seems to be good. I've double checked and
> patch used the new way
>
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
> -> mdreadv -> FileReadV -> pg_preadv (inlined)
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
> -> ...
>
> I gave also io_combine_limit to 32 (max, 256kB) a try and got those
> slightly better results:
>
> [..]
> m_io_c = 16:
> Time: 494.599 ms
> Time: 496.345 ms
> Time: 973.500 ms
>
> m_io_c = 32:
> Time: 461.031 ms
> Time: 449.037 ms
> Time: 443.375 ms
>
> and that (last one) apparently was able to push it to ~50-60k still
> random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
> still reading random , so I assume no merging was done:
>
> Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
> w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
> drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
> nvme0n1   61212.00591.82 0.00   0.000.10 9.90
> 2.00  0.02 0.00   0.000.0012.000.00  0.00
> 0.00   0.000.00 0.000.000.006.28  85.20
>
> So in short it looks good to me.

My results are similar to yours, also I realized a bug while working
on your benchmarking cases. I will share the cause and the fix soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 12:11, Daniel Gustafsson  wrote:
>
> > On 6 Mar 2024, at 09:59, Daniel Gustafsson  wrote:
>
> > Nothing sticks out from reading through these patches, they seem quite 
> > ready to
> > me.
>
> Took another look at this today and committed it. Thanks!

Thanks for the commit!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz  wrote:
>
> v4 is rebased on top of v14 streaming read API changes.

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 01:22:59 +0300
Subject: [PATCH v5] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h  |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c   | 85 
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b632fe953c4..4e35caeb42e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-		   BlockNumber blockno,
-		   BufferAccessStrategy bstrategy);
+		   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 		   TransactionId OldestXmin,
 		   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c86000d245b..0533d9660c4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..105285c3ea2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+   void *user_data,
+   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSam

Re: Use streaming read API in ANALYZE

2024-04-02 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review!

On Wed, 27 Mar 2024 at 23:15, Melanie Plageman
 wrote:
>
> On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> > >
> > >
> > > The new version of the streaming read API [1] is posted. I updated the
> > > streaming read API changes patch (0001), using the streaming read API
> > > in ANALYZE patch (0002) remains the same. This should make it easier
> > > to review as it can be applied on top of master
> > >
> > >
> >
> > The new version of the streaming read API is posted [1]. I rebased the
> > patch on top of master and v9 of the streaming read API.
> >
> > There is a minimal change in the 'using the streaming read API in ANALYZE
> > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> > to copy exactly the same behavior as before. Also, some benchmarking
> > results:
> >
> > I created a 22 GB table and set the size of shared buffers to 30GB, the
> > rest is default.
> >
> > ╔═══╦═╦╗
> > ║   ║  Avg Timings in ms  ║║
> > ╠═══╬══╦══╬╣
> > ║   ║  master  ║  patched ║ percentage ║
> > ╠═══╬══╬══╬╣
> > ║ Both OS cache and ║  ║  ║║
> > ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> > ╠═══╬══╬══╬╣
> > ║   OS cache is loaded but  ║  ║  ║║
> > ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> > ╠═══╬══╬══╬╣
> > ║ Shared buffers are loaded ║  ║  ║║
> > ║   ║  89.2846 ║  84.6952 ║%5.1║
> > ╚═══╩══╩══╩╝
> >
> > Any kind of feedback would be appreciated.
>
> Thanks for working on this!
>
> A general review comment: I noticed you have the old streaming read
> (pgsr) naming still in a few places (including comments) -- so I would
> just make sure and update everywhere when you rebase in Thomas' latest
> version of the read stream API.

Done.

>
> > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Mon, 19 Feb 2024 14:30:47 +0300
> > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
> >
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> > *index_expr)
> >   return stats;
> >  }
> >
> > +/*
> > + * Prefetch callback function to get next block number while using
> > + * BlockSampling algorithm
> > + */
> > +static BlockNumber
> > +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> > +   
> > void *user_data,
> > +   
> > void *per_buffer_data)
>
> I don't think you need the pg_ prefix

Done.

>
> > +{
> > + BlockSamplerData *bs = user_data;
> > + BlockNumber *current_block = per_buffer_data;
>
> Why can't you just do BufferGetBlockNumber() on the buffer returned from
> the read stream API instead of allocating per_buffer_data for the block
> number?

Done.

>
> > +
> > + if (BlockSampler_HasMore(bs))
> > + *current_block = BlockSampler_Next(bs);
> > + else
> > + *current_block = InvalidBlockNumber;
> > +
> > + return *current_block;
>
>
> I think we'd like to keep the read stream code in heapam-specific code.
> Instead of doing streaming_read_buffer_begin() here, you could put this
> in heap_beginscan() or initscan() guarded by
> scan->rs_base.rs_flags & SO_TYPE_ANALYZE

In the recent changes [1], heapam_scan_analyze_next_[block | tuple]
are removed from tableam. They are directly called from
heapam-specific code now. So, IMO, no need to do this now.

v4 is rebased on top of v14 streaming read API changes.

[1] 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subjec

Re: Building with meson on NixOS/nixpkgs

2024-04-01 Thread Nazir Bilal Yavuz

Hi,

>From your prior reply:

On Thu, 21 Mar 2024 at 23:44, Wolfgang Walther  wrote:
>
> Nazir Bilal Yavuz:
> > 0001 & 0002: Adding code comments to explain why they have fallback
> > could be nice.
> > 0003: Looks good to me.
>
> Added some comments in the attached.

Comments look good, thanks.

On Fri, 29 Mar 2024 at 21:48,  wrote:
>
> In v3, I added another small patch for meson, this one about proper
> handling of -Dlibedit_preferred when used together with -Dreadline=enabled.

You are right. I confirm the bug and your proposed patch fixes this.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

1 2 3 >

1 - 100 of 226 matches

Mail list logo