from:"Nazir Bilal Yavuz"

Re: CREATE DATABASE with filesystem cloning

2024-06-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 15:08, Ranier Vilela  wrote:
>
> Em ter., 21 de mai. de 2024 às 05:18, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi,
>>
>> On Thu, 16 May 2024 at 17:47, Robert Haas  wrote:
>> >
>> > On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz  
>> > wrote:
>> > > Actually, the documentation for the file_copy_method was mentioning
>> > > the things it controls; I converted it to an itemized list now. Also,
>> > > changed the comment to: 'Further uses of this function requires
>> > > updates to the list that GUC controls in its documentation.'. v7 is
>> > > attached.
>> >
>> > I think the comments need some wordsmithing.
>>
>> I changed it to 'Uses of this function must be documented in the list
>> of places affected by this GUC.', I am open to any suggestions.
>>
>> > I don't see why this parameter should be PGC_POSTMASTER.
>>
>> I changed it to 'PGC_USERSET' now. My initial point was the database
>> or tablespace to be copied with the same method. I thought copying
>> some portion of the database with the copy and rest with the clone
>> could cause potential problems. After a bit of searching, I could not
>> find any problems related to that.
>>
>> v8 is attached.
>
> Hi,
> I did some research on the subject and despite being an improvement,
> I believe that some terminologies should be changed in this patch.
> Although the new function is called *clone_file*, I'm not sure if it really 
> is "clone".
> Why MacOS added an API called clonefile [1] and Linux exists
> another called FICLONE.[2]
> So perhaps it should be treated here as a copy and not a clone?
> Leaving it open, is the possibility of implementing a true clone api?
> Thoughts?

Thank you for mentioning this.

1- I do not know where to look for MacOS' function definitions but I
followed the same links you shared. It says copyfile(...,
COPYFILE_CLONE_FORCE) means 'Clone the file instead. This is a force
flag i.e. if cloning fails, an error is returned' [1]. It does the
cloning but I still do not know whether there is a difference between
'copyfile() function with COPYFILE_CLONE_FORCE flag' and 'clonefile()'
function.

2- I read a couple of discussions about copy_file_range() and FICLONE.
It seems that the copy_file_range() function is a slightly upgraded
version of FICLONE but less available since it is newer. It still does
the clone [2]. Also, FICLONE is already used in pg_upgrade and
pg_combinebackup.

Both of these copyfile(..., COPYFILE_CLONE_FORCE) and
copy_file_range() functions do the cloning, so I think using clone
terminology is correct. But, using FICLONE instead of
copy_file_range() could be better since it is more widely available.

[1] https://www.unix.com/man-page/mojave/3/copyfile/
[2] https://elixir.bootlin.com/linux/v5.15-rc5/source/fs/read_write.c#L1495

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: meson and check-tests

2024-06-02 Thread Nazir Bilal Yavuz

Hi,

On Sun, 2 Jun 2024 at 07:06, jian he  wrote:
>
> On Sun, Jun 2, 2024 at 4:48 AM Tristan Partin  wrote:
> >
> > On Fri May 31, 2024 at 12:02 PM CDT, Ashutosh Bapat wrote:
> > > Hi Tristan,
> > > Using make I can run only selected tests under src/test/regress using
> > > TESTS=... make check-tests. I didn't find any way to do that with meson.
> > > meson test --suite regress runs all the regression tests.
> > >
> > > We talked this off-list at the conference. It seems we have to somehow
> > > avoid passing pg_regress --schedule argument and instead pass the list of
> > > tests. Any idea how to do that?
> >
> > I think there are 2 solutions to this.
> >
> > 1. Avoid passing --schedule by default, which doesn't sound like a great
> >solution.
> >
> > 2. Teach pg_regress to ignore the --schedule option if specific tests
> >are passed instead.
> >
> > 3. Add a --no-schedule option to pg_regress which would override the
> >previously added --schedule option.
> >
> > I personally prefer 2 or 3.
> >
> > 2: meson test -C build regress/regress --test-args my_specific_test
> > 3: meson test -C build regress/regress --test-args "--no-schedule 
> > my_specific_test"
> >
> > Does anyone have an opinion?
> >
>
> if each individual sql file can be tested separately, will
> `test: test_setup`?
> be aslo executed as a prerequisite?

Yes, it is required to run at least two setup tests because regress
tests require a database to be created:

1- The initdb executable is needed, and it is installed by the
'tmp_install' test for the tests.

2- Although the initdb executable exists, it is not enough by itself.
Regress tests copies its database contents from the template
directory, instead of running initdb for each test [1] (This is the
default behaviour in the meson builds' regress tests). This template
directory is created by the 'initdb_cache' test.

[1] 252dcb3239

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: zlib detection in Meson on Windows broken?

2024-05-22 Thread Nazir Bilal Yavuz

Hi,

On Wed, 22 May 2024 at 17:21, Dave Page  wrote:
>
> Hi
>
> On Wed, 22 May 2024 at 14:11, Nazir Bilal Yavuz  wrote:
>>
>>
>> I tried to install your latest zlib artifact (nmake one) to the
>> Windows CI images (not the official ones) [1]. Then, I used the
>> default meson.build file to build but meson could not find the zlib.
>> After that, I modified it like you suggested before; I used a
>> 'cc.find_library()' to find zlib as a fallback method and it seems it
>> worked [2]. Please see meson setup logs below [3], does something
>> similar to the attached solve your problem?
>
>
> That patch does solve my problem - thank you!

I am glad that it worked!

Do you think that we need to have this patch in the upstream Postgres?
I am not sure because:
- There is a case that meson is able to find zlib but tests fail.
- This might be a band-aid fix rather than a permanent fix.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: zlib detection in Meson on Windows broken?

2024-05-22 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 18:24, Dave Page  wrote:
>
>
>
> On Tue, 21 May 2024 at 16:04, Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2024-05-20 11:58:05 +0100, Dave Page wrote:
>> > I have very little experience with Meson, and even less interpreting it's
>> > logs, but it seems to me that it's not including the extra lib and include
>> > directories when it runs the test compile, given the command line it's
>> > reporting:
>> >
>> > cl C:\Users\dpage\git\postgresql\build\meson-private\tmpg_h4xcue\testfile.c
>> > /nologo /showIncludes /utf-8 /EP /nologo /showIncludes /utf-8 /EP /Od /Oi-
>> >
>> > Bug, or am I doing something silly?
>>
>> It's a buglet. We rely on meson's internal fallback detection of zlib, if 
>> it's
>> not provided via pkg-config or cmake. But it doesn't know about our
>> extra_include_dirs parameter. We should probably fix that...
>
>
> Oh good, then I'm not going bonkers. I'm still curious about how it works for 
> Andrew but not me, however fixing that buglet should solve my issue, and 
> would be sensible behaviour.
>
> Thanks!

I tried to install your latest zlib artifact (nmake one) to the
Windows CI images (not the official ones) [1]. Then, I used the
default meson.build file to build but meson could not find the zlib.
After that, I modified it like you suggested before; I used a
'cc.find_library()' to find zlib as a fallback method and it seems it
worked [2]. Please see meson setup logs below [3], does something
similar to the attached solve your problem?

The interesting thing is, I also tried this 'cc.find_library' method
with your old artifact (cmake one). It was able to find zlib but all
tests failed [4].

Experimental zlib meson.build diff is attached.

[1] https://cirrus-ci.com/task/6736867247259648
[2] https://cirrus-ci.com/build/5286228755480576
[3]
Run-time dependency zlib found: NO (tried pkgconfig, cmake and system)
Has header "zlib.h" : YES
Library zlib found: YES
...
  External libraries
...
zlib   : YES
...
[4] https://cirrus-ci.com/task/5208433811521536

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/meson.build b/meson.build
index f5ca5cfed49..d89a7a3e277 100644
--- a/meson.build
+++ b/meson.build
@@ -1373,20 +1373,23 @@ endif
 zlibopt = get_option('zlib')
 zlib = not_found_dep
 if not zlibopt.disabled()
-  zlib_t = dependency('zlib', required: zlibopt)
+  zlib = dependency('zlib', required: false)
 
-  if zlib_t.type_name() == 'internal'
-# if fallback was used, we don't need to test if headers are present (they
-# aren't built yet, so we can't test)
-zlib = zlib_t
-  elif not zlib_t.found()
-warning('did not find zlib')
-  elif not cc.has_header('zlib.h',
-  args: test_c_args, include_directories: postgres_inc,
-  dependencies: [zlib_t], required: zlibopt)
-warning('zlib header not found')
-  else
-zlib = zlib_t
+  if zlib.found() and zlib.type_name() != 'internal'
+if not cc.has_header('zlib.h',
+args: test_c_args, include_directories: postgres_inc,
+dependencies: zlib, required: false)
+  zlib = not_found_dep
+endif
+  elif not zlib.found()
+zlib_lib = cc.find_library('zlib',
+  dirs: test_lib_d,
+  header_include_directories: postgres_inc,
+  has_headers: ['zlib.h'],
+  required: zlibopt)
+if zlib_lib.found()
+  zlib = declare_dependency(dependencies: zlib_lib, include_directories: postgres_inc)
+endif
   endif
 
   if zlib.found()

Re: zlib detection in Meson on Windows broken?

2024-05-21 Thread Nazir Bilal Yavuz

Hi,

On Tue, 21 May 2024 at 10:20, Sandeep Thakkar
 wrote:
>
> Hi Dave,
>
> Is the .pc file generated after the successful build of zlib? If yes, then 
> meson should be able to detect the installation ideally

If meson is not able to find the .pc file automatically, using 'meson
setup ... --pkg-config-path $ZLIB_PC_PATH' might help.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-21 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 17:47, Robert Haas  wrote:
>
> On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz  wrote:
> > Actually, the documentation for the file_copy_method was mentioning
> > the things it controls; I converted it to an itemized list now. Also,
> > changed the comment to: 'Further uses of this function requires
> > updates to the list that GUC controls in its documentation.'. v7 is
> > attached.
>
> I think the comments need some wordsmithing.

I changed it to 'Uses of this function must be documented in the list
of places affected by this GUC.', I am open to any suggestions.

> I don't see why this parameter should be PGC_POSTMASTER.

I changed it to 'PGC_USERSET' now. My initial point was the database
or tablespace to be copied with the same method. I thought copying
some portion of the database with the copy and rest with the clone
could cause potential problems. After a bit of searching, I could not
find any problems related to that.

v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 54dac96e493f482581c09ec14e67196e73e371ca Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 21 May 2024 10:19:29 +0300
Subject: [PATCH v8] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE method is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Reviewed-by: Robert Haas 
Reviewed-by: Ranier Vilela 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  4 +
 doc/src/sgml/config.sgml  | 46 +
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..6d8f05199ca 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry file_copy_method_options[] = {
+	{"copy", FILE_COPY_METHOD_COPY, false},
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE) || defined(HAVE_COPY_FILE_RANGE)
+	{"clone", FILE_COPY_METHOD_CLONE, false},
+#endif
+	{NULL, 0, false}
+};
+
+static void clone_file(const char *fromfile, const char *tofile);
 
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Uses of this function must be documented in the list of places
+ * affected by this GUC.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +96,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +244,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+#if defined(HAVE_COPYFILE) && de

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 05:34, Thomas Munro  wrote:
>
> On Wed, May 15, 2024 at 5:20 PM Peter Eisentraut  wrote:
> > Yes, let's get that v3-0001 patch into PG17.
>
> Done.
>
> Bilal recently created the CI images for Debian Bookworm[1].  You can
> try them with s/bullseye/bookworm/ in .cirrus.tasks.yml, but it looks
> like he is still wrestling with a perl installation problem[2] in the
> 32 bit build, so here is a temporary patch to do that and also delete
> the 32 bit tests for now.  This way cfbot should succeed with the
> remaining patches.  Parked here for v18.

Actually, 32 bit builds are working but the Perl version needs to be
updated to 'perl5.36-i386-linux-gnu' in .cirrus.tasks.yml. I changed
0001 with the working version of 32 bit builds [1] and the rest is the
same. All tests pass now [2].

[1] 
postgr.es/m/CAN55FZ0fY5EFHXLKCO_=p4pwfmhrovom_qse_7b48gpchfa...@mail.gmail.com
[2] https://cirrus-ci.com/task/4969910856581120

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 976d2c7ad0e470b24875ee27171359f54078a761 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 10:56:28 +0300
Subject: [PATCH v5 1/3] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Perl version is upgraded in the Bookworm images, so update Perl version
at 'Linux - Debian Bookworm - Meson' task as well.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index a2388cd5036..47a60aa7c6f 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -65,7 +65,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 8
 TEST_JOBS: 8
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
 # no options enabled, should be small
 CCACHE_MAXSIZE: "150M"
@@ -243,7 +243,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net;
@@ -314,7 +314,7 @@ task:
 #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
 
   matrix:
-- name: Linux - Debian Bullseye - Autoconf
+- name: Linux - Debian Bookworm - Autoconf
 
   env:
 SANITIZER_FLAGS: -fsanitize=address
@@ -348,7 +348,7 @@ task:
   on_failure:
 <<: *on_failure_ac
 
-- name: Linux - Debian Bullseye - Meson
+- name: Linux - Debian Bookworm - Meson
 
   env:
 CCACHE_MAXSIZE: "400M" # tests two different builds
@@ -375,7 +375,7 @@ task:
 ${LINUX_MESON_FEATURES} \
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
--DPERL=perl5.32-i386-linux-gnu \
+-DPERL=perl5.36-i386-linux-gnu \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build-32
 EOF
@@ -652,7 +652,7 @@ task:
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

From a18120ace64bcde41c2ed23a050eb86f9b8772e0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH v5 2/3] jit: Require at least LLVM 14, if enabled.

Remove support for LLVM versions 10-13.  The default on all non-EOL'd
OSes represented in our build farm will be at least LLVM 14 when
PostgreSQL 18 ships.

Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 src/backend/jit/llvm/llvmjit.c  | 101 
 src/backend/jit/llvm/llvmjit_error.cpp  |  25 --
 src/backend/jit/llvm/llvmjit_inline.cpp |  13 ---
 src/backend/jit/llvm/llvmjit_wrap.cpp   |   4 -
 config/llvm.m4  |   4 +-
 doc/src/sgml/installation.sgml  |   4 +-
 configure   |   4 +-
 meson.build |   2 +-
 8 files changed, 7 insertions(+), 150 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 1d439f24554..8f9c77eedc1 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -21,13 +21,9 @@
 #if LLVM_VERSION_MAJOR > 16
 #include 
 #endif
-#if LLVM_VERSION_MAJOR > 11
 #include 
 #include 
 #include 
-#else
-#include 
-#endif
 #include 
 #include 
 #if LLVM_VERSION_MAJOR < 17
@@ -50,13 +46,8 @@
 /* Handle of a module emitted via ORC JIT */
 typedef struct LLVMJitHandle
 {
-#if LLVM_VERSION_MAJOR > 11
 	LLVMOrcLLJITRef lljit;
 	LLVMOrcResourceTrackerRef re

Re: CREATE DATABASE with filesystem cloning

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 16 May 2024 at 15:40, Robert Haas  wrote:
>
> On Thu, May 16, 2024 at 8:35 AM Nazir Bilal Yavuz  wrote:
> > I updated the documentation and put a comment on top of the copydir()
> > function to inform that further changes and uses of this function may
> > require documentation updates.
>
> I was assuming that the documentation for the file_copy_method was
> going to list the things that it controlled, and that the comment was
> going to say that you should update that list specifically. Just
> saying that you may need to update some part of the documentation in
> some way is fairly useless, IMHO.

Actually, the documentation for the file_copy_method was mentioning
the things it controls; I converted it to an itemized list now. Also,
changed the comment to: 'Further uses of this function requires
updates to the list that GUC controls in its documentation.'. v7 is
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 197ce921c20d49b524e306bf082701434b71b7af Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 16 May 2024 16:21:46 +0300
Subject: [PATCH v7] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE option is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +
 doc/src/sgml/config.sgml  | 48 +
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..651c74672e0 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry file_copy_method_options[] = {
+	{"copy", FILE_COPY_METHOD_COPY, false},
+#if (defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)) || defined(HAVE_COPY_FILE_RANGE)
+	{"clone", FILE_COPY_METHOD_CLONE, false},
+#endif
+	{NULL, 0, false}
+};
+
+static void clone_file(const char *fromfile, const char *tofile);
 
 /*
  * copydir: copy a directory
  *
  * If recurse is false, subdirectories are ignored.  Anything that's not
  * a directory or a regular file is ignored.
+ *
+ * This function uses a file_copy_method GUC to determine copy method.
+ * Further uses of this function requires updates to the list that GUC controls
+ * in its documentation.
  */
 void
 copydir(const char *fromdir, const char *todir, bool recurse)
@@ -71,7 +96,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+		{
+			if (file_copy_method == FILE_COPY_METHOD_CLONE)
+clone_file(fromfile, tofile);
+			else
+copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -214,3 +244,69 @@ copy_file(const char *fromfile, const char *tofile)
 
 	pfree(buffer);
 }
+
+/*
+ * clone one file
+ */
+static void
+clone_file(const char *fromfile, const char *tofile)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(fromfile, tofile, NULL, COPYFILE_CLONE_FORCE) < 0)
+		ereport(ERROR,
+(errcode

Re: CREATE DATABASE with filesystem cloning

2024-05-16 Thread Nazir Bilal Yavuz

Hi,

On Thu, 9 May 2024 at 19:29, Robert Haas  wrote:
>
> On Wed, May 8, 2024 at 10:34 AM Nazir Bilal Yavuz  wrote:
> > > I'm not so sure about the GUC name. On the one hand, it feels like
> > > createdb should be spelled out as create_database, but on the other
> > > hand, the GUC name is quite long already. Then again, why would we
> > > make this specific to CREATE DATABASE in the first place? Would we
> > > also want alter_tablespace_file_copy_method and
> > > basic_archive.file_copy_method?
> >
> > I agree that it is already quite long, because of that I chose the
> > createdb as a prefix. I did not think that file cloning was planned to
> > be used in other places. If that is the case, does something like
> > 'preferred_copy_method' work? Then, we would mention which places will
> > be affected with this GUC in the docs.
>
> I would go with file_copy_method rather than preferred_copy_method,
> because (1) there's nothing "preferred" about it if we're using it
> unconditionally and (2) it's nice to clarify that we're talking about
> copying files rather than anything else.

Done.

>
> My personal enthusiasm for making platform-specific file copy methods
> usable all over PostgreSQL is quite limited. However, it is my
> observation that other people seem far more enthusiastic about it than
> I am. For example, consider how quickly it got added to
> pg_combinebackup. So, I suspect it's smart to plan on anything we add
> in this area getting used in a bunch of places. And perhaps it is even
> best to think about making it work in all of those places right from
> the start. If we build support into copydir and copy_file, then we
> just get everything that uses those, and all that remains is to
> document was is covered (and add comments so that future patch authors
> know they should further update the documentation).

I updated the documentation and put a comment on top of the copydir()
function to inform that further changes and uses of this function may
require documentation updates.

I also changed O_RDWR to O_WRONLY flag in the clone_file() function
based on Raniers' feedback. Also, does this feature need tests? I
thought about possible test cases but since this feature requires
specific file systems to run, I could not find any.

v6 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 33fcb21730683c628f81451e21a0cf2455fd41be Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 16 May 2024 15:22:46 +0300
Subject: [PATCH v6] Introduce file_copy_method GUC

This GUC can be set to either COPY (default) or CLONE (if system
supports it).

If CLONE option is chosen, similar to COPY; but attempting to use
efficient file copying system calls.  The kernel has the opportunity to
share block ranges in copy-on-write file systems, or maybe push down
the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/storage/copydir.h |  9 ++
 src/backend/storage/file/copydir.c| 98 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +
 doc/src/sgml/config.sgml  | 35 +++
 doc/src/sgml/ref/alter_database.sgml  |  3 +-
 doc/src/sgml/ref/create_database.sgml |  4 +-
 src/tools/pgindent/typedefs.list  |  1 +
 9 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..6edc3ea4f69 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,6 +13,15 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
+typedef enum FileCopyMethod
+{
+	FILE_COPY_METHOD_COPY,
+	FILE_COPY_METHOD_CLONE,
+} FileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int file_copy_method;
+
 extern void copydir(const char *fromdir, const char *todir, bool recurse);
 extern void copy_file(const char *fromfile, const char *tofile);
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..48e756ce284 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,17 +21,42 @@
 #include 
 #include 
 
+#ifdef HAVE_COPYFILE_H
+#include 
+#endif
+
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
+#include "utils/guc.h"
+
+/* GUCs */
+int			file_copy_method = FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC suppor

Re: Use streaming read API in ANALYZE

2024-05-15 Thread Nazir Bilal Yavuz

Hi,

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
> >
> > Pushed.  Thanks Bilal and reviewers!
>
> I wanted to discuss what will happen to this patch now that
> 27bc1772fc8 is reverted. I am continuing this thread but I can create
> another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Pros of the alternative version:

* The existing AM implementations other than heap AM can continue to
use their AMs without any change.
* AM implementations other than heap do not need to use read streams.
* Upstream code uses the read stream API and benefits from that.

Cons of the alternative version:

* 6 if cases are added to the acquire_sample_rows() function and 3 of
them are in the while loop.
* Because of these changes, the code looks messy.

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 323f28ff979cde8e4dbde8b4654bded74abf1fbc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 15 May 2024 00:03:56 +0300
Subject: [PATCH v13 1/2] Revert "Use streaming I/O in ANALYZE."

This commit reverts 041b96802ef.

041b96802ef revised the changes on 27bc1772fc8 but 27bc1772fc8 and
dd1f6b0c172 are reverted together in 6377e12a5a5. So, this commit
reverts all 27bc1772fc, 041b96802ef and dd1f6b0c172 together.

Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/include/access/tableam.h | 26 +++
 src/backend/access/heap/heapam_handler.c | 38 +-
 src/backend/commands/analyze.c   | 96 ++--
 3 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd5..e08b9627f30 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,7 +21,6 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
-#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"

@@ -655,16 +654,6 @@ typedef struct TableAmRoutine
 	struct VacuumParams *params,
 	BufferAccessStrategy bstrategy);

-	/*
-	 * Prepare to analyze the next block in the read stream.  Returns false if
-	 * the stream is exhausted and true otherwise. The scan must have been
-	 * started with SO_TYPE_ANALYZE option.
-	 *
-	 * This routine holds a buffer pin and lock on the heap page.  They are
-	 * held until heapam_scan_analyze_next_tuple() returns false.  That is
-	 * until all the items of the heap page are analyzed.
-	 */
-
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -683,7 +672,8 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-			ReadStream *stream);
+			BlockNumber blockno,
+			BufferAccessStrategy bstrategy);

 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1721,17 +1711,19 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 }

 /*
- * Prepare to analyze the next block in the read stream. The scan needs to
- * have been  started with table_beginscan_analyze().  Note that this routine
- * might acquire resources like locks that are held until
+ * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
+ * started with table_beginscan_analyze().  Note that this routine might
+ * acquire resources like locks that are held until
  * table_scan_analyze_next_tuple() returns false.
  *
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_an

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2024-05-15 Thread Nazir Bilal Yavuz

Hi,

On Sun, 12 May 2024 at 14:53, Peter Eisentraut  wrote:
>
> On 14.12.23 14:40, Nazir Bilal Yavuz wrote:
> > On Fri, 6 Oct 2023 at 17:07, Tom Lane  wrote:
> >>
> >> As a quick cross-check, I searched our commit log to see how many
> >> README-only commits there were so far this year.  I found 11 since
> >> January.  (Several were triggered by the latest round of pgindent
> >> code and process changes, so maybe this is more than typical.)
> >>
> >> Not sure what that tells us about the value of changing the CI
> >> logic, but it does seem like it could be worth the one-liner
> >> change needed to teach buildfarm animals to ignore READMEs.
> >
> > I agree that it could be worth implementing this logic on buildfarm animals.
> >
> > In case we want to implement the same logic on the CI, I added a new
> > version of the patch; it skips CI completely if the changes are only
> > in the README files.
>
> I don't see how this could be applicable widely enough to be useful:
>
> - While there are some patches that touch on README files, very few of
> those end up in a commit fest.
>
> - If someone manually pushes a change to their own CI environment, I
> don't see why we need to second-guess that.
>
> - Buildfarm runs generally batch several commits together, so it is very
> unlikely that this would be hit.
>
> I think unless some concrete reason for this change can be shown, we
> should drop it.

These points make sense. I thought that it is useful regarding Tom's
'11 README-only commit since January' analysis (at 6 Oct 2023) but
this may not be enough on its own. If there are no objections, I will
withdraw this soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Fix parallel vacuum buffer usage reporting

2024-05-13 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 19:09, Masahiko Sawada  wrote:
>
> On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > Thank you for working on this!
> >
> > On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
> > >
> > > Thank you for further testing! I've pushed the patch.
> >
> > I realized a behaviour change while looking at 'Use pgBufferUsage for
> > block reporting in analyze' thread [1]. Since that change applies here
> > as well, I thought it is better to mention it here.
> >
> > Before this commit, VacuumPageMiss did not count the blocks if its
> > read was already completed by other backends [2]. Now,
> > 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> > counts every block attempted to be read; possibly double counting if
> > someone else has already completed the read.
>
> True. IIUC there is such a difference only in HEAD but not in v15 and
> v16. The following comment in WaitReadBuffers() says that it's a
> traditional behavior that we count blocks as read even if someone else
> has already completed its I/O:
>
> /*
>  * We count all these blocks as read by this backend.  This is traditional
>  * behavior, but might turn out to be not true if we find that someone
>  * else has beaten us and completed the read of some of these blocks.  In
>  * that case the system globally double-counts, but we traditionally don't
>  * count this as a "hit", and we don't have a separate counter for "miss,
>  * but another backend completed the read".
>  */
>
> So I think using pgBufferUsage for (parallel) vacuum is a legitimate
> usage and makes it more consistent with other parallel operations.

That sounds logical. Thank you for the clarification.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Upgrade Debian CI images to Bookworm

2024-05-13 Thread Nazir Bilal Yavuz

Hi,

Bookworm versions of the Debian CI images are available now [0]. The
patches to use these images are attached.

'v1-0001-Upgrade-Debian-CI-images-to-Bookworm_REL_16+.patch' patch can
be applied to both upstream and REL_16 and all of the tasks finish
successfully.

'v1-0001-Upgrade-Debian-CI-images-to-Bookworm_REL_15.patch' patch can
be applied to REL_15 but it gives a compiler warning. The fix for this
warning is proposed here [1]. After the fix is applied, all of the
tasks finish successfully.

Any kind of feedback would be appreciated.

[0] https://github.com/anarazel/pg-vm-images/pull/91

[1] 
postgr.es/m/CAN55FZ0o9wqVoMTh_gJCmj_%2B4XbX9VXzQF8OySPZ0R1saxV3bA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 26f4b2425cb04dc7218142f24f151d3698f33191 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 10:56:28 +0300
Subject: [PATCH v1] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Perl version is upgraded in the Bookworm images, so update Perl version
at 'Linux - Debian Bookworm - Meson' task as well.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 82261eccb85..d2ff833fcad 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -65,7 +65,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 8
 TEST_JOBS: 8
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
 # no options enabled, should be small
 CCACHE_MAXSIZE: "150M"
@@ -241,7 +241,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net;
@@ -312,7 +312,7 @@ task:
 #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
 
   matrix:
-- name: Linux - Debian Bullseye - Autoconf
+- name: Linux - Debian Bookworm - Autoconf
 
   env:
 SANITIZER_FLAGS: -fsanitize=address
@@ -346,7 +346,7 @@ task:
   on_failure:
 <<: *on_failure_ac
 
-- name: Linux - Debian Bullseye - Meson
+- name: Linux - Debian Bookworm - Meson
 
   env:
 CCACHE_MAXSIZE: "400M" # tests two different builds
@@ -373,7 +373,7 @@ task:
 ${LINUX_MESON_FEATURES} \
 -Dllvm=disabled \
 --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
--DPERL=perl5.32-i386-linux-gnu \
+-DPERL=perl5.36-i386-linux-gnu \
 -DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build-32
 EOF
@@ -648,7 +648,7 @@ task:
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

From d8f4b660bdd408fa22c857e24d1b4d05ff41df03 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 13 May 2024 11:55:48 +0300
Subject: [PATCH v1] Upgrade Debian CI images to Bookworm

New Debian version, namely Bookworm, is released. Use these new images
in CI tasks.

Upgrading Debian CI images to Bookworm PR: https://github.com/anarazel/pg-vm-images/pull/91
---
 .cirrus.tasks.yml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index b03d128184d..a2735f8bb60 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -138,13 +138,13 @@ LINUX_CONFIGURE_FEATURES: _CONFIGURE_FEATURES >-
 
 
 task:
-  name: Linux - Debian Bullseye
+  name: Linux - Debian Bookworm
 
   env:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net;
@@ -451,12 +451,12 @@ task:
 
   # To limit unnecessary work only run this once the normal linux test succeeds
   depends_on:
-- Linux - Debian Bullseye
+- Linux - Debian Bookworm
 
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.43.0

Re: CFbot does not recognize patch contents

2024-05-12 Thread Nazir Bilal Yavuz

Hi,

On Sun, 12 May 2024 at 09:50, Tatsuo Ishii  wrote:
>
> After sending out my v18 patches:
> https://www.postgresql.org/message-id/20240511.162307.2246647987352188848.t-ishii%40sranhm.sra.co.jp
>
> CFbot complains that the patch was broken:
> http://cfbot.cputube.org/patch_48_4460.log
>
> === Applying patches on top of PostgreSQL commit ID 
> 31e8f4e619d9b5856fa2bd5713cb1e2e170a9c7d ===
> === applying patch 
> ./v18-0001-Row-pattern-recognition-patch-for-raw-parser.patch
> gpatch:  Only garbage was found in the patch input.
>
> The patch was generated by git-format-patch (same as previous
> patches).  I failed to find any patch format problem in the
> patch. Does anybody know what's wrong here?

I am able to apply all your patches. I found that a similar thing
happened before [0] and I guess your case is similar. Adding Thomas to
CC, he may be able to help more.

Nitpick: There is a trailing space warning while applying one of your patches:
Applying: Row pattern recognition patch (docs).
.git/rebase-apply/patch:81: trailing whitespace.
 company  |   tdate| price | first_value | max | count

[0] 
postgr.es/m/CA%2BhUKGLiY1e%2B1%3DpB7hXJOyGj1dJOfgde%2BHmiSnv3gDKayUFJMA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 16:55, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Fri, 10 May 2024 at 14:49, Alena Rybakina  
> > wrote:
> > >
> > > Hi! I could try to check it with the test, but I want to ask you about
> > > details, because I'm not sure that I completely understand the test case.
> > >
> > > You mean that we need to have two backends and on one of them we deleted
> > > the tuples before vacuum called the other, do you?
> > >

There should be some other backend(s) which will try to read the same
buffer with the ongoing VACUUM operation. I think it works now but the
reproduction steps are a bit racy. See:

1- Build Postgres with attached diff, it is the same
see_previous_output.diff that I shared two mails ago.

2- Run Postgres, all settings are default.

3- Use two client backends, let's name them as A and B client backends.

4- On A client backend, run:

CREATE TABLE vacuum_fix (aid int, bid int) with (autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 2000);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1, bid = bid + 1;

5- Now it will be a bit racy, SQL commands below need to be run at the
same time. The aim is for VACUUM on A client backend and SELECT on B
client backend to read the same buffers at the same time. So, some of
the buffers will be double counted.

Firstly, run VACUUM on A client backend; immediately after running
VACUUM, run SELECT on B backend.

A client backend:
VACUUM VERBOSE vacuum_fix;

B client backend:
SELECT * from vacuum_fix WHERE aid = -1;

This is the output of the VACUUM VERBOSE on my end:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 0
pages: 0 removed, 176992 remain, 176992 scanned (100.00% of total)
...
...
buffer usage: 254181 hits, 99030 misses in the previous version, 99865
misses in the patched version, 106830 dirtied
...
VACUUM
Time: 2578.217 ms (00:02.578)

VACUUM does not run parallel, so this test case does not trigger what
is fixed in this thread. As it can be seen, there is ~1000 buffers
difference.

I am not sure if there is an easier way to reproduce this but I hope this helps.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 84cc983b6e6..582973d575b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -309,6 +309,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0,
 startwritetime = 0;
 	WalUsage	startwalusage = pgWalUsage;
+	int64		StartPageMiss = VacuumPageMiss;
 	BufferUsage startbufferusage = pgBufferUsage;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
@@ -606,6 +607,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
+			int64		PageMissOp = VacuumPageMiss - StartPageMiss;
 			double		read_rate = 0,
 		write_rate = 0;
 
@@ -748,8 +750,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			appendStringInfo(, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
 			appendStringInfo(,
-			 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+			 _("buffer usage: %lld hits, %lld misses in the previous version, %lld misses in the patched version, %lld dirtied\n"),
 			 (long long) (bufferusage.shared_blks_hit + bufferusage.local_blks_hit),
+			 (long long) (PageMissOp),
 			 (long long) (bufferusage.shared_blks_read + bufferusage.local_blks_read),
 			 (long long) (bufferusage.shared_blks_dirtied + bufferusage.local_blks_dirtied));
 			appendStringInfo(,

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 16:21, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Fri, 10 May 2024 at 14:49, Alena Rybakina  wrote:
> >
> > Hi! I could try to check it with the test, but I want to ask you about
> > details, because I'm not sure that I completely understand the test case.
> >
> > You mean that we need to have two backends and on one of them we deleted
> > the tuples before vacuum called the other, do you?
> >
>
> I think triggering a parallel vacuum is enough. I am able to see the
> differences with the following:
>
> You can apply the attached diff file to see the differences between
> the previous version and the patched version. Then, run this query:
>
> CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
> (autovacuum_enabled=false);
> INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 100);
> CREATE INDEX a_idx on vacuum_fix (aid);
> CREATE INDEX b_idx on vacuum_fix (bid);
> CREATE INDEX c_idx on vacuum_fix (cid);
> VACUUM vacuum_fix;
> UPDATE vacuum_fix SET aid = aid + 1;
> VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;
>
> After that I saw:
>
> INFO:  vacuuming "test.public.vacuum_fix"
> INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
> INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
> ...
> ...
> buffer usage: 29343 hits, 9580 misses in the previous version, 14165
> misses in the patched version, 14262 dirtied
>
> Patched version counts 14165 misses but the previous version counts
> 9580 misses in this specific example.

I am sorry that I showed the wrong thing, this is exactly what is
fixed in this patch. Actually, I do not know how to trigger it;
currently I am looking for it. I will share if anything comes to my
mind.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..a6f1df11066 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1355,6 +1355,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	IOContext	io_context;
 	IOObject	io_object;
 	char		persistence;
+	static int  double_counts = 0;
 
 	/*
 	 * Currently operations are only allowed to include a read of some range,
@@ -1426,6 +1427,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			  operation->smgr->smgr_rlocator.locator.relNumber,
 			  operation->smgr->smgr_rlocator.backend,
 			  true);
+			double_counts++;
 			continue;
 		}
 
@@ -1523,6 +1525,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 	}
+	elog(LOG, "Double counts = %d", double_counts);
 }
 
 /*

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Fri, 10 May 2024 at 14:49, Alena Rybakina  wrote:
>
> Hi! I could try to check it with the test, but I want to ask you about
> details, because I'm not sure that I completely understand the test case.
>
> You mean that we need to have two backends and on one of them we deleted
> the tuples before vacuum called the other, do you?
>

I think triggering a parallel vacuum is enough. I am able to see the
differences with the following:

You can apply the attached diff file to see the differences between
the previous version and the patched version. Then, run this query:

CREATE TABLE vacuum_fix (aid int, bid int, cid int) with
(autovacuum_enabled=false);
INSERT INTO vacuum_fix SELECT *, *, * FROM generate_series(1, 100);
CREATE INDEX a_idx on vacuum_fix (aid);
CREATE INDEX b_idx on vacuum_fix (bid);
CREATE INDEX c_idx on vacuum_fix (cid);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET aid = aid + 1;
VACUUM (VERBOSE, PARALLEL 2) vacuum_fix ;

After that I saw:

INFO:  vacuuming "test.public.vacuum_fix"
INFO:  launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO:  finished vacuuming "test.public.vacuum_fix": index scans: 1
...
...
buffer usage: 29343 hits, 9580 misses in the previous version, 14165
misses in the patched version, 14262 dirtied

Patched version counts 14165 misses but the previous version counts
9580 misses in this specific example.

--
Regards,
Nazir Bilal Yavuz
Microsoft
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 84cc983b6e6..582973d575b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -309,6 +309,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0,
 startwritetime = 0;
 	WalUsage	startwalusage = pgWalUsage;
+	int64		StartPageMiss = VacuumPageMiss;
 	BufferUsage startbufferusage = pgBufferUsage;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
@@ -606,6 +607,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
+			int64		PageMissOp = VacuumPageMiss - StartPageMiss;
 			double		read_rate = 0,
 		write_rate = 0;
 
@@ -748,8 +750,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			appendStringInfo(, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
 			appendStringInfo(,
-			 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
+			 _("buffer usage: %lld hits, %lld misses in the previous version, %lld misses in the patched version, %lld dirtied\n"),
 			 (long long) (bufferusage.shared_blks_hit + bufferusage.local_blks_hit),
+			 (long long) (PageMissOp),
 			 (long long) (bufferusage.shared_blks_read + bufferusage.local_blks_read),
 			 (long long) (bufferusage.shared_blks_dirtied + bufferusage.local_blks_dirtied));
 			appendStringInfo(,

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

Thank you for working on this!

On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
>
> Thank you for further testing! I've pushed the patch.

I realized a behaviour change while looking at 'Use pgBufferUsage for
block reporting in analyze' thread [1]. Since that change applies here
as well, I thought it is better to mention it here.

Before this commit, VacuumPageMiss did not count the blocks if its
read was already completed by other backends [2]. Now,
'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
counts every block attempted to be read; possibly double counting if
someone else has already completed the read. I do not know which
behaviour is correct but I wanted to mention this.

[1] 
https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com

[2] In the WaitReadBuffers() function, see comment:
/*
 * Skip this block if someone else has already completed it.  If an
 * I/O is already in progress in another backend, this will wait for
 * the outcome: either done, or something went wrong and we will
 * retry.
 */

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: gcc 12.1.0 warning

2024-05-10 Thread Nazir Bilal Yavuz

Hi,

On Tue, 23 Apr 2024 at 19:59, Andres Freund  wrote:
>
>
> Which seems entirely legitimate. ISTM that guc_var_compare() ought to only
> cast the pointers to the key type, i.e. char *.  And incidentally that does
> prevent the warning.
>
> The reason it doesn't happen in newer versions of postgres is that we aren't
> using guc_var_compare() in the relevant places anymore...

The fix is attached. It cleanly applies from REL_15_STABLE to
REL_12_STABLE, fixes the warnings and the tests pass.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 588c99f5c402fc41414702133636e5a51f9e3470 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 10 May 2024 10:43:45 +0300
Subject: [PATCH v1] Fix -Warray-bounds warning in guc_var_compare() function

There are a couple of places that guc_var_compare() function takes
'const char ***' type and then casts it to the
'const struct config_generic *' type. This triggers '-Warray-bounds'
warning. So, instead cast them to the 'const char *' type.

Author: Nazir Bilal Yavuz 
Reported-by: Erik Rijkers 
Suggested-by: Andres Freund 
Discussion: https://postgr.es/m/a74a1a0d-0fd2-3649-5224-4f754e8f91aa%40xs4all.nl
---
 src/backend/utils/misc/guc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c410ba532d2..eec97dea659 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5721,10 +5721,10 @@ find_option(const char *name, bool create_placeholders, bool skip_errors,
 static int
 guc_var_compare(const void *a, const void *b)
 {
-	const struct config_generic *confa = *(struct config_generic *const *) a;
-	const struct config_generic *confb = *(struct config_generic *const *) b;
+	const char *confa_name = **(char **const *) a;
+	const char *confb_name = **(char **const *) b;
 
-	return guc_name_compare(confa->name, confb->name);
+	return guc_name_compare(confa_name, confb_name);
 }
 
 /*
-- 
2.43.0

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 16:58, Robert Haas  wrote:
>
> On Tue, May 7, 2024 at 8:00 AM Nazir Bilal Yavuz  wrote:
> > We had an off-list talk with Thomas and we thought making this option
> > GUC instead of SQL command level could solve this problem.
> >
> > I am posting a new rebased version of the patch with some important changes:
> >
> > * 'createdb_file_copy_method' GUC is created. Possible values are
> > 'copy' and 'clone'. Copy is the default option. Clone option can be
> > chosen if the system supports it, otherwise it gives error at the
> > startup.
>
> This seems like a smart idea, because the type of file copying that we
> do during redo need not be the same as what was done when the
> operation was originally performed.
>
> I'm not so sure about the GUC name. On the one hand, it feels like
> createdb should be spelled out as create_database, but on the other
> hand, the GUC name is quite long already. Then again, why would we
> make this specific to CREATE DATABASE in the first place? Would we
> also want alter_tablespace_file_copy_method and
> basic_archive.file_copy_method?

I agree that it is already quite long, because of that I chose the
createdb as a prefix. I did not think that file cloning was planned to
be used in other places. If that is the case, does something like
'preferred_copy_method' work? Then, we would mention which places will
be affected with this GUC in the docs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 15:23, Ranier Vilela  wrote:
>
> Em qua., 8 de mai. de 2024 às 08:42, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi,
>>
>> On Wed, 8 May 2024 at 14:16, Ranier Vilela  wrote:
>> >
>> >
>> > Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz 
>> >  escreveu:
>> >>
>> >> Hi Ranier,
>> >>
>> >> Thanks for looking into this!
>> >>
>> >> I am not sure why but your reply does not show up in the thread, so I
>> >> copied your reply and answered it in the thread for visibility.
>> >>
>> >> On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>> >> >
>> >> > I know it's coming from copy-and-paste, but
>> >> > I believe the flags could be:
>> >> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | 
>> >> > PG_BINARY);
>> >> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | 
>> >> > O_EXCL | PG_BINARY);
>> >> >
>> >> > The flags:
>> >> > O_WRONLY | O_TRUNC
>> >> >
>> >> > Allow the OS to make some optimizations, if you deem it possible.
>> >> >
>> >> > The flag O_RDWR indicates that the file can be read, which is not true 
>> >> > in this case.
>> >> > The destination file will just be written.
>> >>
>> >> You may be right about the O_WRONLY flag but I am not sure about the
>> >> O_TRUNC flag.
>> >>
>> >> O_TRUNC from the linux man page [1]: If the file already exists and is
>> >> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> >> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> >> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> >> effect of O_TRUNC is unspecified.
>> >
>> > O_TRUNC is usually used in conjunction with O_WRONLY.
>> > See the example at:
>> > open.html
>> >
>> > O_TRUNC signals the OS to forget the current contents of the file, if it 
>> > happens to exist.
>> > In other words, there will be no seeks, only and exclusively writes.
>>
>> You are right; the O_TRUNC flag truncates the file, if it happens to
>> exist. But it should not exist in the first place because the code at
>> [2] should check this before the OpenTransientFile() call. Even if we
>> assume somehow the check at [2] does not work, I do not think the
>> correct operation is to truncate the contents of the existing file.
>
> I don't know if there is a communication problem here.
> But what I'm trying to suggest refers to the destination file,
> which doesn't matter if it exists or not?

I do not think there is a communication problem. Actually it matters
because the destination file should not exist, there is a code [2]
which already checks and confirms that it does not exist.

>
> If the destination file does not exist, O_TRUNC is ignored.
> If the destination file exists, O_TRUNC truncates the current contents of the 
> file.
> I don't see why you think it's a problem to truncate the current content if 
> the destination file exists.
> Isn't he going to be replaced anyway?

'If the destination file does not exist' means the code at [2] works
well and we do not need the O_TRUNC flag.

'If the destination file already exists' means the code at [2] is
broken somehow and there is a high chance that we could truncate
something that we do not want to. For example, there is a foo db and
we want to create bar db, Postgres chose the foo db's location as the
destination of the bar db (which should not happen but let's assume
somehow checks at [2] failed), then we could wrongly truncate the foo
db's contents.

Hence, if Postgres works successfully I think the O_TRUNC flag is
unnecessary but if Postgres does not work successfully, the O_TRUNC
flag could cause harm.

>
> Unless we want to preserve the current content (destination file), in case 
> the copy/clone fails?

Like I said above, Postgres should not choose the existing file as the
destination file.

Also, we have O_CREAT | O_EXCL flags together, from the link [3] you
shared before: If O_CREAT and O_EXCL are set, open() shall fail if the
file exists. So, overwriting the already existing file is already
prevented.

[3] https://pubs.opengroup.org/onlinepubs/009696699/functions/open.html

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 May 2024 at 14:16, Ranier Vilela  wrote:
>
>
> Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz  
> escreveu:
>>
>> Hi Ranier,
>>
>> Thanks for looking into this!
>>
>> I am not sure why but your reply does not show up in the thread, so I
>> copied your reply and answered it in the thread for visibility.
>>
>> On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>> >
>> > I know it's coming from copy-and-paste, but
>> > I believe the flags could be:
>> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL 
>> > | PG_BINARY);
>> >
>> > The flags:
>> > O_WRONLY | O_TRUNC
>> >
>> > Allow the OS to make some optimizations, if you deem it possible.
>> >
>> > The flag O_RDWR indicates that the file can be read, which is not true in 
>> > this case.
>> > The destination file will just be written.
>>
>> You may be right about the O_WRONLY flag but I am not sure about the
>> O_TRUNC flag.
>>
>> O_TRUNC from the linux man page [1]: If the file already exists and is
>> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> effect of O_TRUNC is unspecified.
>
> O_TRUNC is usually used in conjunction with O_WRONLY.
> See the example at:
> open.html
>
> O_TRUNC signals the OS to forget the current contents of the file, if it 
> happens to exist.
> In other words, there will be no seeks, only and exclusively writes.

You are right; the O_TRUNC flag truncates the file, if it happens to
exist. But it should not exist in the first place because the code at
[2] should check this before the OpenTransientFile() call. Even if we
assume somehow the check at [2] does not work, I do not think the
correct operation is to truncate the contents of the existing file.

>>
>> But it should be already checked if the destination directory already
>> exists in dbcommands.c file in createdb() function [2], so this should
>> not happen.
>
> I'm not sure what you're referring to here.

Sorry, I meant that the destination directory / file should not exist
because the code at [2] confirms it does not exist, otherwise it
errors out.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-08 Thread Nazir Bilal Yavuz

Hi Ranier,

Thanks for looking into this!

I am not sure why but your reply does not show up in the thread, so I
copied your reply and answered it in the thread for visibility.

On Tue, 7 May 2024 at 16:28, Ranier Vilela  wrote:
>
> I know it's coming from copy-and-paste, but
> I believe the flags could be:
> - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
> + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | 
> PG_BINARY);
>
> The flags:
> O_WRONLY | O_TRUNC
>
> Allow the OS to make some optimizations, if you deem it possible.
>
> The flag O_RDWR indicates that the file can be read, which is not true in 
> this case.
> The destination file will just be written.

You may be right about the O_WRONLY flag but I am not sure about the
O_TRUNC flag.

O_TRUNC from the linux man page [1]: If the file already exists and is
a regular file and the access mode allows writing (i.e., is O_RDWR or
O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
terminal device file, the O_TRUNC flag is ignored. Otherwise, the
effect of O_TRUNC is unspecified.

But it should be already checked if the destination directory already
exists in dbcommands.c file in createdb() function [2], so this should
not happen.

[1] https://man7.org/linux/man-pages/man2/open.2.html

[2]
/*
 * If database OID is configured, check if the OID is already in use or
 * data directory already exists.
 */
if (OidIsValid(dboid))
{
char   *existing_dbname = get_database_name(dboid);

if (existing_dbname != NULL)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
errmsg("database OID %u is already in use by
database \"%s\"",
   dboid, existing_dbname));

if (check_db_file_conflict(dboid))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
errmsg("data directory with the specified OID %u
already exists", dboid));
}
else
{
/* Select an OID for the new database if is not explicitly
configured. */
do
{
dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
   Anum_pg_database_oid);
    } while (check_db_file_conflict(dboid));
}

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

2024-05-07 Thread Nazir Bilal Yavuz

Hi,

On Wed, 6 Mar 2024 at 05:17, Thomas Munro  wrote:
>
> The main thing that is missing is support for redo.  It's mostly
> trivial I think, probably just a record type for "try cloning first"
> and then teaching that clone function to fall back to the regular copy
> path if it fails in recovery, do you agree with that idea?  Another
> approach would be to let it fail if it doesn't work on the replica, so
> you don't finish up using dramatically different amounts of disk
> space, but that seems terrible because now your replica is broken.  So
> probably fallback with logged warning (?), though I'm not sure exactly
> which errnos to give that treatment to.

We had an off-list talk with Thomas and we thought making this option
GUC instead of SQL command level could solve this problem.

I am posting a new rebased version of the patch with some important changes:

* 'createdb_file_copy_method' GUC is created. Possible values are
'copy' and 'clone'. Copy is the default option. Clone option can be
chosen if the system supports it, otherwise it gives error at the
startup.

* #else part of the clone_file() function calls pg_unreachable()
instead of ereport().

* Documentation updates.

Also, what should happen when the kernel has clone support but the
file system does not?

- I tested this on Linux and copy_file_range() silently uses normal
copy when this happens. I assume the same thing will happen for
FreeBSD because it uses the copy_file_range() function as well.

- I am not sure about MacOS since the force flag
(COPYFILE_CLONE_FORCE) is used. I do not have MacOS so I can not test
it but I assume it will error out when this happens. If that is the
case, is this a problem? I am asking that since this is a GUC now, the
user will have the full responsibility.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 404e301dbdb252c23ea9d451b817cf6e372d0d9a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 7 May 2024 14:16:09 +0300
Subject: [PATCH v5] Use CLONE method with GUC on CREATE DATABASE ...
 STRATEGY=FILE_COPY.

createdb_file_copy_method GUC option is introduced. It can be set to
either COPY (default) or CLONE (if system supports it).

If CLONE option is chosen, similar to STRATEGY=FILE_COPY; but attempting
to use efficient file copying system calls.  The kernel has the
opportunity to share block ranges in copy-on-write file systems, or
maybe push down the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/commands/dbcommands.h |  9 ++
 src/include/storage/copydir.h |  3 +-
 src/backend/commands/dbcommands.c | 21 -
 src/backend/storage/file/copydir.c| 83 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 13 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 ++
 doc/src/sgml/config.sgml  | 17 
 doc/src/sgml/ref/create_database.sgml | 24 --
 src/tools/pgindent/typedefs.list  |  1 +
 10 files changed, 162 insertions(+), 15 deletions(-)

diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 92e17c71158..2e1d3565edc 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -14,6 +14,15 @@
 #ifndef DBCOMMANDS_H
 #define DBCOMMANDS_H
 
+typedef enum CreateDBFileCopyMethod
+{
+	CREATEDB_FILE_COPY_METHOD_COPY,
+	CREATEDB_FILE_COPY_METHOD_CLONE,
+} CreateDBFileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int createdb_file_copy_method;
+
 #include "access/xlogreader.h"
 #include "catalog/objectaddress.h"
 #include "lib/stringinfo.h"
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..9ff28f2eec9 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,7 +13,8 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
-extern void copydir(const char *fromdir, const char *todir, bool recurse);
+extern void copydir(const char *fromdir, const char *todir, bool recurse,
+	bool clone_files);
 extern void copy_file(const char *fromfile, const char *tofile);
 
 #endif			/* COPYDIR_H */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index be629ea92cf..cf0dff65e69 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -69,6 +69,20 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/* GUCs */
+int			createdb_file_copy_method = CREATEDB_FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry createdb_file_copy_method_op

Re: Use streaming read API in ANALYZE

2024-04-29 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
>
> Pushed.  Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM
agnostic again. So, read stream changes might have to be pushed down
now but there are a couple of roadblocks like Melanie mentioned [1]
before.

Quote from Melanie [1]:

On Thu, 11 Apr 2024 at 19:19, Melanie Plageman
 wrote:
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.

If we do not want to add a new table AM callback like Melanie
mentioned, it is pretty much required to pass BufferAccessStrategy and
BlockSamplerData to the initscan().

> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I wonder the same, I could not think of any alternative solution to
this problem.

Another quote from Melanie [2] in the same thread:

On Thu, 11 Apr 2024 at 20:48, Melanie Plageman
 wrote:
>
> I will also say that, had this been 6 months ago, I would probably
> suggest we restructure ANALYZE's table AM interface to accommodate
> read stream setup and to address a few other things I find odd about
> the current code. For example, I think creating a scan descriptor for
> the analyze scan in acquire_sample_rows() is quite odd. It seems like
> it would be better done in the relation_analyze callback. The
> relation_analyze callback saves some state like the callbacks for
> acquire_sample_rows() and the Buffer Access Strategy. But at least in
> the heap implementation, it just saves them in static variables in
> analyze.c. It seems like it would be better to save them in a useful
> data structure that could be accessed later. We have access to pretty
> much everything we need at that point (in the relation_analyze
> callback). I also think heap's implementation of
> table_beginscan_analyze() doesn't need most of
> heap_beginscan()/initscan(), so doing this instead of something
> ANALYZE specific seems more confusing than helpful.

If we want to implement ANALYZE specific counterparts of
heap_beginscan()/initscan(); we may think of passing
BufferAccessStrategy and BlockSamplerData to them.

Also, there is an ongoing(?) discussion about a few problems /
improvements about the acquire_sample_rows() mentioned at the end of
the 'Table AM Interface Enhancements' thread [3]. Should we wait for
these discussions to be resolved or can we resume working on this
patch?

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Nazir Bilal Yavuz

On Fri, 26 Apr 2024 at 14:54, Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Congratulations to both of you!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: ci: Allow running mingw tests by default via environment variable

2024-04-25 Thread Nazir Bilal Yavuz

Hi,

Thank you for working on this!

On Sat, 13 Apr 2024 at 05:12, Andres Freund  wrote:
>
> Hi,
>
> We have CI support for mingw, but don't run the task by default, as it eats up
> precious CI credits.  However, for cfbot we are using custom compute resources
> (currently donated by google), so we can afford to run the mingw tests. Right
> now that'd require cfbot to patch .cirrus.tasks.yml.

And I think mingw ends up not running most of the time. +1 to running
it as default at least on cfbot. Also, this gives people a chance to
run mingw as default on their personal repositories (which I would
like to run).

> While one can manually trigger manual task in one one's own repo, most won't
> have the permissions to do so for cfbot.
>
>
> I propose that we instead run the task automatically if
> $REPO_MINGW_TRIGGER_BY_DEFAULT is set, typically in cirrus' per-repository
> configuration.
>
> Unfortunately that's somewhat awkward to do in the cirrus-ci yaml
> configuration, the set of conditional expressions supported is very
> simplistic.
>
> To deal with that, I extended .cirrus.star to compute the required environment
> variable. If $REPO_MINGW_TRIGGER_BY_DEFAULT is set, CI_MINGW_TRIGGER_TYPE is
> set to 'automatic', if not it's 'manual'.
>

Changes look good to me. My only complaint could be using only 'true'
for the REPO_MINGW_TRIGGER_BY_DEFAULT, not a possible list of values
but this is not important.

> We've also talked in other threads about adding CI support for
> 1) windows, building with visual studio
> 2) linux, with musl libc
> 3) free/netbsd
>
> That becomes more enticing, if we can enable them by default on cfbot but not
> elsewhere. With this change, it'd be easy to add further variables to control
> such future tasks.

I agree.

> I also attached a second commit, that makes the "code" dealing with ci-os-only
> in .cirrus.tasks.yml simpler. While I think it does nicely simplify
> .cirrus.tasks.yml, overall it adds lines, possibly making this not worth it.
> I'm somewhat on the fence.
>
>
> Thoughts?

Although it adds more lines, this makes the .cirrus.tasks.yml file
more readable and understandable which is more important in my
opinion.

> On the code level, I thought if it'd be good to have a common prefix for all
> the automatically set variables. Right now that's CI_, but I'm not at all
> wedded to that.

I agree with your thoughts and CI_ prefix.

I tested both patches and they work as expected.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-04-24 Thread Nazir Bilal Yavuz

Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz  wrote:
> > On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > > I agree with your points. While the other I/O related work is
> > > > happening we can discuss what we should do in the variable op_byte
> > > > cases. Also, this is happening only for WAL right now but if we try to
> > > > extend pg_stat_io in the future, that problem possibly will rise
> > > > again. So, it could be good to come up with a general solution, not
> > > > only for WAL.
> > >
> > > Okay, I've marked the patch as RwF for this CF.

Since the last commitfest entry was returned with feedback, I created
a new commitfest entry: https://commitfest.postgresql.org/48/4950/

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: slightly misleading Error message in guc.c

2024-04-22 Thread Nazir Bilal Yavuz

Hi,

On Mon, 22 Apr 2024 at 11:44, jian he  wrote:
>
> hi.
> minor issue in guc.c.
>
> set work_mem to '1kB';
> ERROR:  1 kB is outside the valid range for parameter "work_mem" (64
> .. 2147483647)
> should it be
> ERROR:  1 kB is outside the valid range for parameter "work_mem" (64
> kB .. 2147483647 kB)
> ?
> since the units for work_mem are { "B", "kB", "MB", "GB", and "TB"}
>
> search `outside the valid range for parameter`,
> there are two occurrences in guc.c.

Nice find. I agree it could cause confusion.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-04-19 Thread Nazir Bilal Yavuz

Hi,

On Mon, 19 Feb 2024 at 10:28, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
> >
> > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > I agree with your points. While the other I/O related work is
> > > happening we can discuss what we should do in the variable op_byte
> > > cases. Also, this is happening only for WAL right now but if we try to
> > > extend pg_stat_io in the future, that problem possibly will rise
> > > again. So, it could be good to come up with a general solution, not
> > > only for WAL.
> >
> > Okay, I've marked the patch as RwF for this CF.
>
> I wanted to inform you that the 73f0a13266 commit changed all WALRead
> calls to read variable bytes, only the WAL receiver was reading
> variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

I hope that I did not miss anything and my explanations are clear.

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAFiTN-sE7CJn-ZFj%2B-0Wv6TNytv_fp4n%2BeCszspxJ3mt77t5ig%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for looking into this!
>
> On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
> >
> > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > > Streaming API has been committed but the committed version has a minor
> > > change, the read_stream_begin_relation function takes Relation instead
> > > of BufferManagerRelation now. So, here is a v5 which addresses this
> > > change.
> >
> > I'm getting a repeatable segfault / assertion failure with this:
> >
> > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> > CREATE TABLE
> > postgres=# insert into tengiga select g, repeat('x', 900) from
> > generate_series(1, 140) g;
> > INSERT 0 140
> > postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> > SET
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> > "heapam_handler.c", Line: 1079, PID: 262232
> > postgres: heikki postgres [local]
> > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> > postgres: heikki postgres [local]
> > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> > postgres: heikki postgres [local]
> > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> > postgres: heikki postgres [local]
> > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> > postgres: heikki postgres [local]
> > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> > postgres: heikki postgres [local]
> > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> > postgres: heikki postgres [local]
> > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> > 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> > was terminated by signal 6: Aborted
>
> I realized the same error while working on Jakub's benchmarking results.
>
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.

I wanted to re-check this problem and I realized that I was wrong. I
tried using nblocks again and this time there was no failure. I looked
at block sampling logic and I am pretty sure that BlockSampler_Init()
function correctly returns the number of blocks that
BlockSampler_Next() will return. It seems 158f581923 fixed this issue
as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Typos in the code and README

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

Thanks for working on this!

On Mon, 15 Apr 2024 at 15:26, Daniel Gustafsson  wrote:
>
> > On 14 Apr 2024, at 13:19, David Rowley  wrote:
> >
> > On Sat, 13 Apr 2024 at 09:17, Daniel Gustafsson  wrote:
> >>
> >>> On 12 Apr 2024, at 23:15, Heikki Linnakangas  wrote:
> >>> Here's a few more. I've accumulate these over the past couple of months, 
> >>> keeping them stashed in a branch, adding to it whenever I've spotted a 
> >>> minor typo while reading the code.
> >>
> >> Nice, let's lot all these together.
> >
> > Here are a few additional ones to add to that.
>
> Thanks.  Collecting all the ones submitted here, as well as a few submitted
> off-list by Alexander, the patch is now a 3-part patchset of cleanups:
>
> 0001 contains the typos and duplicate words fixups, 0002 fixes a parameter 
> with
> the wrong name in the prototype and 0003 removes a leftover prototype which 
> was
> accidentally left in a refactoring.

I realized two small typos: 'sgmr' -> 'smgr'. You may want to include
them in 0001.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-04-16 Thread Nazir Bilal Yavuz

Hi,

I am working on using read streams in the CREATE DATABASE command when the
strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
this context. This function reads source buffers then copies them to the
destination buffers. I used read streams only when reading source buffers
because the destination buffers are read by 'RBM_ZERO_AND_LOCK' option, so
it is not important.

I created a ~6 GB table [1] and created a new database with the wal_log
strategy using the database that table was created in as a template [2]. My
benchmarking results are:

a. Timings:

patched:
12955.027 ms
12917.475 ms
13177.846 ms
12971.308 ms
13059.985 ms

master:
13156.375 ms
13054.071 ms
13151.607 ms
13152.633 ms
13160.538 ms

There is no difference in timings, the patched version is a tiny bit better
but it is negligible. I actually expected the patched version to be better
because there was no prefetching before, but the read stream API detects
sequential access and disables prefetching.

b. strace:

patched:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 68.023.749359   2   1285782   pwrite64
 18.541.021734  21 46730   preadv
  9.490.522889 826   633   fdatasync
  2.550.140339  59  2368   pwritev
  1.140.062583 409   153   fsync

master:
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 59.713.831542   2   1288365   pwrite64
 29.841.914540   2747936   pread64
  7.900.506843 837   605   fdatasync
  1.580.101575  54  1856   pwritev
  0.750.048431 400   121   fsync

There are fewer (~1/16) read system calls in the patched version.

c. perf:

patched:

-   97.83% 1.13%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - 97.83% RelationCopyStorageUsingBuffer

  - 44.28% ReadBufferWithoutRelcache

 + 42.20% GetVictimBuffer

   0.81% ZeroBuffer

  + 31.86% log_newpage_buffer

  - 19.51% read_stream_next_buffer

 - 17.92% WaitReadBuffers

+ 17.61% mdreadv

 - 1.47% read_stream_start_pending_read

+ 1.46% StartReadBuffers

master:

-   97.68% 0.57%  postgres  postgres   [.]
RelationCopyStorageUsingBuffer

   - RelationCopyStorageUsingBuffer

  - 65.48% ReadBufferWithoutRelcache

 + 41.16% GetVictimBuffer

 - 20.42% WaitReadBuffers

+ 19.90% mdreadv

 + 1.85% StartReadBuffer

   0.75% ZeroBuffer

  + 30.82% log_newpage_buffer

Patched version spends less CPU time in read calls and more CPU time in
other calls such as write.

There are three patch files attached. First two are optimization and adding
a way to create a read stream object by using SMgrRelation, these are
already proposed in the streaming I/O thread [3]. The third one is the
actual patch file.

Any kind of feedback would be appreciated.

[1] CREATE TABLE t as select repeat('a', 100) || i || repeat('b', 500) as
filler from generate_series(1, 900) as i;
[2] CREATE DATABASE test_1 STRATEGY 'wal_log' TEMPLATE test;
[3]
https://www.postgresql.org/message-id/CAN55FZ1yGvCzCW_aufu83VimdEYHbG_zuOY3J9JL-nBptyJyKA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From edcfacec40a70a8747c5f18777b6c28b0fccba7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 901b7230fb9..17cb7127f99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Re

Re: Table AM Interface Enhancements

2024-04-15 Thread Nazir Bilal Yavuz

Hi,

On Mon, 15 Apr 2024 at 18:36, Robert Haas  wrote:
>
> On Sat, Apr 13, 2024 at 5:28 AM Alexander Korotkov  
> wrote:
> > Yes, I think so.  Table AM API deals with TIDs and block numbers, but
> > doesn't force on what they actually mean.  For example, in ZedStore
> > [1], data is stored on per-column B-trees, where TID used in table AM
> > is just a logical key of that B-trees.  Similarly, blockNumber is a
> > range for B-trees.
> >
> > c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
> > assumption that we are sampling physical blocks as they are stored in
> > data files.  That couldn't anymore be some "logical" block numbers
> > with meaning only table AM implementation knows.  That was pointed out
> > by Andres [2].  I'm not sure if ZedStore is alive, but there could be
> > other table AM implementations like this, or other implementations in
> > development, etc.  Anyway, I don't feel good about narrowing the API,
> > which is there from pg12.
>
> I spent some time looking at this. I think it's valid to complain
> about the tighter coupling, but c6fc50cb4028 is there starting in v14,
> so I don't think I understand why the situation after 041b96802ef is
> materially worse than what we've had for the last few releases. I
> think it is worse in the sense that, before, you could dodge the
> problem without defining USE_PREFETCH, and now you can't, but I don't
> think we can regard nonphysical block numbers as a supported scenario
> on that basis.

I agree with you but I did not understand one thing. If out-of-core
AMs are used, does not all block sampling logic (BlockSampler_Init(),
BlockSampler_Next() etc.) need to be edited as well since these
functions assume block numbers are actual physical on-disk location,
right? I mean if the block number is something different than the
actual physical on-disk location, the acquire_sample_rows() function
looks wrong to me before c6fc50cb4028 as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: gcc 12.1.0 warning

2024-04-15 Thread Nazir Bilal Yavuz

Hi,

On Sat, 13 Apr 2024 at 05:25, Andres Freund  wrote:
>
> Hi,
>
> On 2023-10-27 13:09:01 +0300, Nazir Bilal Yavuz wrote:
> > I was testing 'upgrading CI Debian images to bookworm'. I tested the
> > bookworm on REL_15, REL_16 and upstream. REL_16 and upstream finished
> > successfully but the CompilerWarnings task failed on REL_15 with the
> > same error.
>
> Is that still the case?  I briefly tried to repro this outside of CI and
> couldn't reproduce the warning.
>
> I'd really like to upgrade the CI images, it doesn't seem great to continue
> using oldstable.
>
>
> > gcc version: 12.2.0
> >
> > CI Run: https://cirrus-ci.com/task/6151742664998912
>
> Unfortunately the logs aren't accessible anymore, so I can't check the precise
> patch level of the compiler and/or the precise invocation used.

I am able to reproduce this. I regenerated the debian bookworm image
and ran CI on REL_15_STABLE with this image.

CI Run: https://cirrus-ci.com/task/4978799442395136

--

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Speed up clean meson builds by ~25%

2024-04-08 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin  wrote:
>
> Hello Michael,
>
> 08.04.2024 08:23, Michael Paquier wrote:
> > On Fri, Apr 05, 2024 at 06:19:20PM +0200, Jelte Fennema-Nio wrote:
> >> Agreed. While not a full solution, I think this patch is still good to
> >> address some of the pain: Waiting 10 seconds at the end of my build
> >> with only 1 of my 10 cores doing anything.
> >>
> >> So while this doesn't decrease CPU time spent it does decrease
> >> wall-clock time significantly in some cases, with afaict no downsides.
> > Well, this is also painful with ./configure.  So, even if we are going
> > to move away from it at this point, we still need to support it for a
> > couple of years.  It looks to me that letting the clang folks know
> > about the situation is the best way forward.
> >
>
> As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
> is fixed already.
> Perhaps it's worth rechecking...
>
> [1] 
> https://www.postgresql.org/message-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8%40gmail.com

I had this problem on my local computer. My build times are:

gcc: 20s
clang-15: 24s
clang-16: 105s
clang-17: 111s
clang-18: 25s

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
> >
> > Hi,
> >
> > On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> > >
> > > I had been planning to commit v14 this morning but got cold feet with
> > > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > > neither did I.  I have now removed it, and it seems much better.  No
> > > other significant changes, just parameter types and inlining details.
> > > For example:
> > >
> > >  * read_stream_begin_relation() now takes a Relation, likes its name says
> > >  * StartReadBuffers()'s operation takes smgr and optional rel
> > >  * ReadBuffer_common() takes smgr and optional rel
> >
> > Read stream objects can be created only using Relations now. There
> > could be read stream users which do not have a Relation but
> > SMgrRelations. So, I created another constructor for the read streams
> > which use SMgrRelations instead of Relations. Related patch is
> > attached.
>
> After sending this, I realized that I forgot to add persistence value
> to the new constructor. While working on it I also realized that
> current code sets persistence in PinBufferForBlock() function and this
> function is called for each block, which can be costly. So, I moved
> setting persistence to the out of PinBufferForBlock() function.
>
> Setting persistence outside of the PinBufferForBlock() function (0001)
> and creating the new constructor that uses SMgrRelations (0002) are
> attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,1

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 	   smgr->smgr_rlocator.locator.relNumber,
 	   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 			 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+		persistence = R

Re: Streaming I/O, vectored I/O (WIP)

2024-04-07 Thread Nazir Bilal Yavuz

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro  wrote:
>
> I had been planning to commit v14 this morning but got cold feet with
> the BMR-based interface.  Heikki didn't like it much, and in the end,
> neither did I.  I have now removed it, and it seems much better.  No
> other significant changes, just parameter types and inlining details.
> For example:
>
>  * read_stream_begin_relation() now takes a Relation, likes its name says
>  * StartReadBuffers()'s operation takes smgr and optional rel
>  * ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 38b57ec7e54a54a0c7b0117a0ecaaf68c643e1b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Sun, 7 Apr 2024 20:16:00 +0300
Subject: [PATCH] Add a way to create read stream object by using SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h |  8 +++
 src/backend/storage/aio/read_stream.c | 74 ++-
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..601a7fcf92b 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,13 @@ extern ReadStream *read_stream_begin_relation(int flags,
 			  ReadStreamBlockNumberCB callback,
 			  void *callback_private_data,
 			  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+   BufferAccessStrategy strategy,
+   SMgrRelation smgr,
+   ForkNumber forknum,
+   ReadStreamBlockNumberCB callback,
+   void *callback_private_data,
+   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..a9a3b0de6c9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-		   BufferAccessStrategy strategy,
-		   Relation rel,
-		   ForkNumber forknum,
-		   ReadStreamBlockNumberCB callback,
-		   void *callback_private_data,
-		   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+	   BufferAccessStrategy strategy,
+	   Relation rel,
+	   SMgrRelation smgr,
+	   ForkNumber forknum,
+	   ReadStreamBlockNumberCB callback,
+	   void *callback_private_data,
+	   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +423,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +432,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,7 +546,7 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.smgr = smgr;
 		stream->ios[i].op.smgr_persistence = 0;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
@@ -557,6 +555,56 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() f

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-04-07 Thread Nazir Bilal Yavuz

Hi Andrey,

On Sun, 7 Apr 2024 at 08:29, Andrey M. Borodin  wrote:
>
>
>
> > On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz  wrote:
> >
> > I did not have the time to check other things you mentioned but I
> > tested the read performance. The table size is 5.5GB, I did 20 runs in
> > total.
>
> Hi Nazir!
>
> Do you plan to review anything else? Or do you think it worth to look at by 
> someone else? Or is the patch Ready for Committer? If so, please swap CF 
> entry [0] to status accordingly, currently it's "Waiting on Author".

Thanks for reminding me! I think this needs review by someone else
(especially the prefetch part) so I changed it to 'Needs review'.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-04 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  wrote:
>
>
> I've reviewed the patches inline below and attached a patch that has
> some of my ideas on top of your patch.

Thank you!

>
> > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Wed, 3 Apr 2024 15:14:15 +0300
> > Subject: [PATCH v6] Use streaming read API in ANALYZE
> >
> > ANALYZE command gets random tuples using BlockSampler algorithm. Use
> > streaming reads to get these tuples by using BlockSampler algorithm in
> > streaming read API prefetch logic.
> > ---
> >  src/include/access/heapam.h  |  6 +-
> >  src/backend/access/heap/heapam_handler.c | 22 +++---
> >  src/backend/commands/analyze.c   | 85 
> >  3 files changed, 42 insertions(+), 71 deletions(-)
> >
> > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > index a307fb5f245..633caee9d95 100644
> > --- a/src/include/access/heapam.h
> > +++ b/src/include/access/heapam.h
> > @@ -25,6 +25,7 @@
> >  #include "storage/bufpage.h"
> >  #include "storage/dsm.h"
> >  #include "storage/lockdefs.h"
> > +#include "storage/read_stream.h"
> >  #include "storage/shm_toc.h"
> >  #include "utils/relcache.h"
> >  #include "utils/snapshot.h"
> > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> > struct 
> > GlobalVisState *vistest);
> >
> >  /* in heap/heapam_handler.c*/
> > -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> > -   
> >  BlockNumber blockno,
> > -   
> >  BufferAccessStrategy bstrategy);
> > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> > +   
> >  ReadStream *stream);
> >  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
> > 
> >  TransactionId OldestXmin,
> > 
> >  double *liverows, double *deadrows,
> > diff --git a/src/backend/access/heap/heapam_handler.c 
> > b/src/backend/access/heap/heapam_handler.c
> > index 0952d4a98eb..d83fbbe6af3 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > Relation NewHeap,
> >  }
> >
> >  /*
> > - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> > - * with SO_TYPE_ANALYZE option.
> > + * Prepare to analyze block returned from streaming object.  If the block 
> > returned
> > + * from streaming object is valid, true is returned; otherwise false is 
> > returned.
> > + * The scan has been started with SO_TYPE_ANALYZE option.
> >   *
> >   * This routine holds a buffer pin and lock on the heap page.  They are 
> > held
> >   * until heapam_scan_analyze_next_tuple() returns false.  That is until 
> > all the
> >   * items of the heap page are analyzed.
> >   */
> > -void
> > -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> > -
> > BufferAccessStrategy bstrategy)
> > +bool
> > +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
> >  {
> >   HeapScanDesc hscan = (HeapScanDesc) scan;
> >
> > @@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, 
> > BlockNumber blockno,
> >* doing much work per tuple, the extra lock traffic is probably 
> > better
> >* avoided.
>
> Personally I think heapam_scan_analyze_next_block() should be inlined.
> It only has a few lines. I would find it clearer inline. At the least,
> there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
> a TableScanDesc instead of a HeapScanDesc.

I agree.

>
> >*/
> > - hscan->rs_cblock = blockno;
> > - hscan->rs_cindex = FirstOffsetNumber;
> > - hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
> > -

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
>
> On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > Streaming API has been committed but the committed version has a minor
> > change, the read_stream_begin_relation function takes Relation instead
> > of BufferManagerRelation now. So, here is a v5 which addresses this
> > change.
>
> I'm getting a repeatable segfault / assertion failure with this:
>
> postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> CREATE TABLE
> postgres=# insert into tengiga select g, repeat('x', 900) from
> generate_series(1, 140) g;
> INSERT 0 140
> postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> SET
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> "heapam_handler.c", Line: 1079, PID: 262232
> postgres: heikki postgres [local]
> ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> postgres: heikki postgres [local]
> ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> postgres: heikki postgres [local]
> ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> postgres: heikki postgres [local]
> ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> postgres: heikki postgres [local]
> ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> postgres: heikki postgres [local]
> ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> postgres: heikki postgres [local]
> ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 8d396a42

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi Jakub,

Thank you for looking into this and doing a performance analysis.

On Wed, 3 Apr 2024 at 11:42, Jakub Wartak  wrote:
>
> On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
> [..]
> > v4 is rebased on top of v14 streaming read API changes.
>
> Hi Nazir, so with streaming API committed, I gave a try to this patch.
> With autovacuum=off and 30GB table on NVMe (with standard readahead of
> 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
> default) created using: create table t as select repeat('a', 100) || i
> || repeat('b', 500) as filler from generate_series(1, 4500) as i;
>
> on master, effect of mainteance_io_concurency [default 10] is like
> that (when resetting the fs cache after each ANALYZE):
>
> m_io_c = 0:
> Time: 3137.914 ms (00:03.138)
> Time: 3094.540 ms (00:03.095)
> Time: 3452.513 ms (00:03.453)
>
> m_io_c = 1:
> Time: 2972.751 ms (00:02.973)
> Time: 2939.551 ms (00:02.940)
> Time: 2904.428 ms (00:02.904)
>
> m_io_c = 2:
> Time: 1580.260 ms (00:01.580)
> Time: 1572.132 ms (00:01.572)
> Time: 1558.334 ms (00:01.558)
>
> m_io_c = 4:
> Time: 938.304 ms
> Time: 931.772 ms
> Time: 920.044 ms
>
> m_io_c = 8:
> Time: 666.025 ms
> Time: 660.241 ms
> Time: 648.848 ms
>
> m_io_c = 16:
> Time: 542.450 ms
> Time: 561.155 ms
> Time: 539.683 ms
>
> m_io_c = 32:
> Time: 538.487 ms
> Time: 541.705 ms
> Time: 538.101 ms
>
> with patch applied:
>
> m_io_c = 0:
> Time: 3106.469 ms (00:03.106)
> Time: 3140.343 ms (00:03.140)
> Time: 3044.133 ms (00:03.044)
>
> m_io_c = 1:
> Time: 2959.817 ms (00:02.960)
> Time: 2920.265 ms (00:02.920)
> Time: 2911.745 ms (00:02.912)
>
> m_io_c = 2:
> Time: 1581.912 ms (00:01.582)
> Time: 1561.444 ms (00:01.561)
> Time: 1558.251 ms (00:01.558)
>
> m_io_c = 4:
> Time: 908.116 ms
> Time: 901.245 ms
> Time: 901.071 ms
>
> m_io_c = 8:
> Time: 619.870 ms
> Time: 620.327 ms
> Time: 614.266 ms
>
> m_io_c = 16:
> Time: 529.885 ms
> Time: 526.958 ms
> Time: 528.474 ms
>
> m_io_c = 32:
> Time: 521.185 ms
> Time: 520.713 ms
> Time: 517.729 ms
>
> No difference to me, which seems to be good. I've double checked and
> patch used the new way
>
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
> -> mdreadv -> FileReadV -> pg_preadv (inlined)
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
> -> ...
>
> I gave also io_combine_limit to 32 (max, 256kB) a try and got those
> slightly better results:
>
> [..]
> m_io_c = 16:
> Time: 494.599 ms
> Time: 496.345 ms
> Time: 973.500 ms
>
> m_io_c = 32:
> Time: 461.031 ms
> Time: 449.037 ms
> Time: 443.375 ms
>
> and that (last one) apparently was able to push it to ~50-60k still
> random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
> still reading random , so I assume no merging was done:
>
> Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
> w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
> drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
> nvme0n1   61212.00591.82 0.00   0.000.10 9.90
> 2.00  0.02 0.00   0.000.0012.000.00  0.00
> 0.00   0.000.00 0.000.000.006.28  85.20
>
> So in short it looks good to me.

My results are similar to yours, also I realized a bug while working
on your benchmarking cases. I will share the cause and the fix soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Wed, 3 Apr 2024 at 12:11, Daniel Gustafsson  wrote:
>
> > On 6 Mar 2024, at 09:59, Daniel Gustafsson  wrote:
>
> > Nothing sticks out from reading through these patches, they seem quite 
> > ready to
> > me.
>
> Took another look at this today and committed it. Thanks!

Thanks for the commit!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz

Hi,

On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz  wrote:
>
> v4 is rebased on top of v14 streaming read API changes.

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 01:22:59 +0300
Subject: [PATCH v5] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h  |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c   | 85 
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b632fe953c4..4e35caeb42e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-		   BlockNumber blockno,
-		   BufferAccessStrategy bstrategy);
+		   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 		   TransactionId OldestXmin,
 		   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c86000d245b..0533d9660c4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..105285c3ea2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+   void *user_data,
+   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(_global_prng_state);
 	nblocks = BlockSampler_Init(, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the sam

Re: Use streaming read API in ANALYZE

2024-04-02 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review!

On Wed, 27 Mar 2024 at 23:15, Melanie Plageman
 wrote:
>
> On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> > >
> > >
> > > The new version of the streaming read API [1] is posted. I updated the
> > > streaming read API changes patch (0001), using the streaming read API
> > > in ANALYZE patch (0002) remains the same. This should make it easier
> > > to review as it can be applied on top of master
> > >
> > >
> >
> > The new version of the streaming read API is posted [1]. I rebased the
> > patch on top of master and v9 of the streaming read API.
> >
> > There is a minimal change in the 'using the streaming read API in ANALYZE
> > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> > to copy exactly the same behavior as before. Also, some benchmarking
> > results:
> >
> > I created a 22 GB table and set the size of shared buffers to 30GB, the
> > rest is default.
> >
> > ╔═══╦═╦╗
> > ║   ║  Avg Timings in ms  ║║
> > ╠═══╬══╦══╬╣
> > ║   ║  master  ║  patched ║ percentage ║
> > ╠═══╬══╬══╬╣
> > ║ Both OS cache and ║  ║  ║║
> > ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> > ╠═══╬══╬══╬╣
> > ║   OS cache is loaded but  ║  ║  ║║
> > ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> > ╠═══╬══╬══╬╣
> > ║ Shared buffers are loaded ║  ║  ║║
> > ║   ║  89.2846 ║  84.6952 ║%5.1║
> > ╚═══╩══╩══╩╝
> >
> > Any kind of feedback would be appreciated.
>
> Thanks for working on this!
>
> A general review comment: I noticed you have the old streaming read
> (pgsr) naming still in a few places (including comments) -- so I would
> just make sure and update everywhere when you rebase in Thomas' latest
> version of the read stream API.

Done.

>
> > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Mon, 19 Feb 2024 14:30:47 +0300
> > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
> >
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> > *index_expr)
> >   return stats;
> >  }
> >
> > +/*
> > + * Prefetch callback function to get next block number while using
> > + * BlockSampling algorithm
> > + */
> > +static BlockNumber
> > +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> > +   
> > void *user_data,
> > +   
> > void *per_buffer_data)
>
> I don't think you need the pg_ prefix

Done.

>
> > +{
> > + BlockSamplerData *bs = user_data;
> > + BlockNumber *current_block = per_buffer_data;
>
> Why can't you just do BufferGetBlockNumber() on the buffer returned from
> the read stream API instead of allocating per_buffer_data for the block
> number?

Done.

>
> > +
> > + if (BlockSampler_HasMore(bs))
> > + *current_block = BlockSampler_Next(bs);
> > + else
> > + *current_block = InvalidBlockNumber;
> > +
> > + return *current_block;
>
>
> I think we'd like to keep the read stream code in heapam-specific code.
> Instead of doing streaming_read_buffer_begin() here, you could put this
> in heap_beginscan() or initscan() guarded by
> scan->rs_base.rs_flags & SO_TYPE_ANALYZE

In the recent changes [1], heapam_scan_analyze_next_[block | tuple]
are removed from tableam. They are directly called from
heapam-specific code now. So, IMO, no need to do this now.

v4 is rebased on top of v14 streaming read API changes.

[1] 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v4 1/2] v14 Streamin

Re: Building with meson on NixOS/nixpkgs

2024-04-01 Thread Nazir Bilal Yavuz

Hi,

>From your prior reply:

On Thu, 21 Mar 2024 at 23:44, Wolfgang Walther  wrote:
>
> Nazir Bilal Yavuz:
> > 0001 & 0002: Adding code comments to explain why they have fallback
> > could be nice.
> > 0003: Looks good to me.
>
> Added some comments in the attached.

Comments look good, thanks.

On Fri, 29 Mar 2024 at 21:48,  wrote:
>
> In v3, I added another small patch for meson, this one about proper
> handling of -Dlibedit_preferred when used together with -Dreadline=enabled.

You are right. I confirm the bug and your proposed patch fixes this.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-03-26 Thread Nazir Bilal Yavuz

Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
>
>
> The new version of the streaming read API [1] is posted. I updated the
> streaming read API changes patch (0001), using the streaming read API
> in ANALYZE patch (0002) remains the same. This should make it easier
> to review as it can be applied on top of master
>
>

The new version of the streaming read API is posted [1]. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══╦═╦╗
║   ║  Avg Timings in ms  ║║
╠═══╬══╦══╬╣
║   ║  master  ║  patched ║ percentage ║
╠═══╬══╬══╬╣
║ Both OS cache and ║  ║  ║║
║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
╠═══╬══╬══╬╣
║   OS cache is loaded but  ║  ║  ║║
║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
╠═══╬══╬══╬╣
║ Shared buffers are loaded ║  ║  ║║
║   ║  89.2846 ║  84.6952 ║%5.1║
╚═══╩══╩══╩╝

Any kind of feedback would be appreciated.

[1]:
https://www.postgresql.org/message-id/CA%2BhUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ%2B-yTrD9WBM-GmAcg%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 7278c354d80e4d1f21c6fa0d810a723789f2d722 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v3 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h  |  45 +-
 src/include/storage/streaming_read.h  |  50 ++
 src/backend/storage/Makefile  |   2 +-
 src/backend/storage/aio/Makefile  |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/streaming_read.c  | 678 +
 src/backend/storage/buffer/bufmgr.c   | 706 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/storage/meson.build   |   1 +
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml  |  14 +
 src/tools/pgindent/typedefs.list  |   2 +
 13 files changed, 1309 insertions(+), 237 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers

Re: Streaming I/O, vectored I/O (WIP)

2024-03-18 Thread Nazir Bilal Yavuz

Hi,

On Sat, 16 Mar 2024 at 02:53, Thomas Munro  wrote:
>
> I am planning to push the bufmgr.c patch soon.  At that point the new
> API won't have any direct callers yet, but the traditional
> ReadBuffer() family of functions will internally reach
> StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
> ZeroBuffer() or nothing as appropriate.  Any more thoughts or
> objections?  Naming, semantics, correctness of buffer protocol,
> sufficiency of comments, something else?

+if (StartReadBuffers(bmr,
+ ,
+ forkNum,
+ blockNum,
+ ,
+ strategy,
+ flags,
+ ))
+WaitReadBuffers();

I think we need to call WaitReadBuffers when 'mode !=
RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I
missing something?

Couple of nitpicks:

It would be nice to explain what the PrepareReadBuffer function does
with a comment.

+if (nblocks == 0)
+return;/* nothing to do */
It is guaranteed that nblocks will be bigger than 0. Can't we just use
Assert(operation->io_buffers_len > 0);?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Building with meson on NixOS/nixpkgs

2024-03-18 Thread Nazir Bilal Yavuz

Hi,

Thank you for the patches!

On Sat, 16 Mar 2024 at 14:48, Wolfgang Walther  wrote:
>
> To build on NixOS/nixpkgs I came up with a few small patches to
> meson.build. All of this works fine with Autoconf/Make already.

I do not have NixOS but I confirm that patches cleanly apply to master
and do pass CI. I have a small feedback:

0001 & 0002: Adding code comments to explain why they have fallback
could be nice.
0003: Looks good to me.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-15 Thread Nazir Bilal Yavuz

Hi,

On Thu, 7 Mar 2024 at 15:26, Cédric Villemain
 wrote:
>
> On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:
> >
> > I did not test read performance but I am planning to do that soon.

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

When the OS cache is cleared:

Master -> Median: 2266.293ms, Avg: 2265.5038ms
Patched -> Median: 2166.493ms, Avg: 2183.6208ms

When the buffers are in the OS cache:

Master -> Median: 585.719ms, Avg: 583.5032ms
Patched -> Median: 533.071ms, Avg: 532.7119ms

Patched version is better on both. ~4% when the OS cache is cleared,
~%9 when the buffers are in the OS cache.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-07 Thread Nazir Bilal Yavuz

Hi,

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
 wrote:
>
> Hi Nazir,
>
>
> thank you for your review. I comment below.
>
>
> On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:
> >> 2. The second one does implement smgrprefetch with range and loops by
> >> default per segment to still have a check for interrupts.
> > It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
> > page of posix_fadvise [1] states that: "The amount of data read may be
> > decreased by the kernel depending on virtual memory load. (A few
> > megabytes will usually be fully satisfied, and more is rarely
> > useful.)". It is trying to prefetch 1GB data now. That could explain
> > your observation about differences between nr_cache numbers.
>
>  From an "adminsys" point of view I will find beneficial to get a single
> syscall per file, respecting the logic and behavior of underlying system
> call.

I agree.

> The behavior is 100% OK, and in fact it might a bad idea to prefetch
> block by block as the result is just to put more pressure on a system if
> it is already under pressure.
>
> Though there are use cases and it's nice to be able to do that too at
> this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

> About [1], it's very old statement about resources. And Linux manages a
> part of the problem for us here I think [2]:
>
> /*
>   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
>   * memory at once.
>   */
> void force_page_cache_ra()

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.

> >> Q: posix_fadvise may not work exactly the way you think it does, or does
> >> it ?
> >>
> >>
> >> In details, and for the question:
> >>
> >> However,  if instead you provide a real range, or the magic len=0 to
> >> posix_fadvise, then blocks are "more" loaded according to effective vm
> >> pressure (which is not the case on the previous example).
> >> As a result only a small part of the relation might be loaded, and this
> >> is probably not what end-users expect despite being probably a good
> >> choice (you can still free cache beforehand to help the kernel).
>
> I think it's a matter of documenting well the feature, and if at all
> possible, as usual, not let users be negatively impacted by default.
>
>
> >> An example, below I'm using vm_relation_cachestat() which provides linux
> >> cachestat output, and vm_relation_fadvise() to unload cache, and
> >> pg_prewarm for the demo:
> >>
> >> # clear cache: (nr_cache is the number of file system pages in cache,
> >> not postgres blocks)
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -+-+--+--
> >> 0 |   32768 |65536 |0
> >> 32768 |   32768 |65536 |0
> >> 65536 |   32768 |65536 |0
> >> 98304 |   32768 |65536 |0
> >>131072 |1672 | 3344 |0
> >> ```
> >>
> >> # load full relation with pg_prewarm (patched)
> >>
> >> ```
> >> postgres=# select pg_prewarm('foo','prefetch');
> >> pg_prewarm
> >> 
> >>   132744
> >> (1 row)
> >> ```
> >>
> >> # Checking results:
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -+-+--+--
> >> 0 |   32768 |65536 |  320
> >> 32768 |   32768 |65536 |0
> >> 65536 |   32768 |65536 |0
> >> 98304 |   32768 |65536 |0
> >>131072 |1672 | 3344 |  320  <-- segment 1
> >>
> >> ```
> >>
> >> # Load block by block and check:
> >>
> >> ```
> >> postgres=# select from generat

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-05 Thread Nazir Bilal Yavuz

think it's because of generate_series or whatever, but I have
> the exact same behavior with pgfincore.
> I can compare loading and unloading duration for similar "async" work,
> here each call is from block 0 with len of 132744 and a range of 1 block
> (i.e. posix_fadvise on 8kB at a time).
> So they have exactly the same number of operations doing DONTNEED or
> WILLNEED, but distinct duration on the first "load":
>
> ```
>
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
> vm_relation_fadvise
> -
>
> (1 row)
>
> Time: 25.202 ms
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> -
>
> (1 row)
>
> Time: 1523.636 ms (00:01.524) <- not free !
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> -
>
> (1 row)
>
> Time: 24.967 ms
> ```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Use streaming read API in ANALYZE

2024-02-28 Thread Nazir Bilal Yavuz

Hi,

On Mon, 19 Feb 2024 at 18:13, Nazir Bilal Yavuz  wrote:
>
> I worked on using the currently proposed streaming read API [1] in ANALYZE. 
> The patch is attached. 0001 is the not yet merged streaming read API code 
> changes that can be applied to the master, 0002 is the actual code.
>
> The blocks to analyze are obtained by using the streaming read API now.
>
> - Since streaming read API is already doing prefetch, I removed the #ifdef 
> USE_PREFETCH code from acquire_sample_rows().
>
> - Changed 'while (BlockSampler_HasMore())' to 'while (nblocks)' because 
> the prefetch mechanism in the streaming read API will advance 'bs' before 
> returning buffers.
>
> - Removed BlockNumber and BufferAccessStrategy from the declaration of 
> scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.
>
> I counted syscalls of analyzing ~5GB table. It can be seen that the patched 
> version did ~1300 less read calls.
>
> Patched:
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  39.670.012128   0 29809   pwrite64
>  36.960.011299   0 28594   pread64
>  23.240.007104   0 27611   fadvise64
>
> Master (21a71648d3):
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  38.940.016457   0 29816   pwrite64
>  36.790.015549   0 29850   pread64
>  23.910.010106   0 29848   fadvise64
>
>
> Any kind of feedback would be appreciated.
>
> [1]: 
> https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

The new version of the streaming read API [1] is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

[1]: 
https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 21d9043501284c6bae996522ff2f3ac693f81986 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v2 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h |  45 ++
 src/include/storage/streaming_read.h |  52 ++
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 612 
 src/backend/storage/buffer/bufmgr.c  | 687 +++
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/tools/pgindent/typedefs.list |   3 +
 10 files changed, 1202 insertions(+), 233 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	Bloc

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-02-26 Thread Nazir Bilal Yavuz

Hi,

On Fri, 23 Feb 2024 at 15:34, Daniel Gustafsson  wrote:
>
> > On 23 Feb 2024, at 13:09, Nazir Bilal Yavuz  wrote:
>
> > Does errmsg_internal() need to be used all the time when turning elogs
> > into ereports? errmsg_internal()'s comment says that "This should be
> > used for "can't happen" cases that are probably not worth spending
> > translation effort on.". Is it enough to check if the error message
> > has a translation, and then decide the use of errmsg_internal() or
> > errmsg()?
>
> If it's an elog then it won't have a translation as none are included in the
> translation set.  If the errmsg is generic enough to be translated anyways via
> another (un)related ereport call then we of course use that, but ideally not
> create new ones.

Thanks for the explanation.

All of your feedback is addressed in v2.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From fa45a69731da1488560b2749023e9b9573231c2b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 23 Feb 2024 16:49:10 +0300
Subject: [PATCH v2 1/2] (xlog.c) Add missing error codes to PANIC/FATAL error
 reports

---
 src/backend/access/transam/xlog.c | 77 +--
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1162d55bff..d295cef9606 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1354,7 +1354,9 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 	}
 
 	if (CurrPos != EndPos)
-		elog(PANIC, "space reserved for WAL record does not match what was written");
+		ereport(PANIC,
+errcode(ERRCODE_DATA_CORRUPTED),
+errmsg_internal("space reserved for WAL record does not match what was written"));
 }
 
 /*
@@ -4040,7 +4042,8 @@ ValidateXLOGDirectoryStructure(void)
 	if (stat(XLOGDIR, _buf) != 0 ||
 		!S_ISDIR(stat_buf.st_mode))
 		ereport(FATAL,
-(errmsg("required WAL directory \"%s\" does not exist",
+(errcode_for_file_access(),
+ errmsg("required WAL directory \"%s\" does not exist",
 		XLOGDIR)));
 
 	/* Check for archive_status */
@@ -4050,7 +4053,8 @@ ValidateXLOGDirectoryStructure(void)
 		/* Check for weird cases where it exists but isn't a directory */
 		if (!S_ISDIR(stat_buf.st_mode))
 			ereport(FATAL,
-	(errmsg("required WAL directory \"%s\" does not exist",
+	(errcode_for_file_access(),
+	 errmsg("required WAL directory \"%s\" does not exist",
 			path)));
 	}
 	else
@@ -4059,7 +4063,8 @@ ValidateXLOGDirectoryStructure(void)
 (errmsg("creating missing WAL directory \"%s\"", path)));
 		if (MakePGDirectory(path) < 0)
 			ereport(FATAL,
-	(errmsg("could not create missing directory \"%s\": %m",
+	(errcode_for_file_access(),
+	 errmsg("could not create missing directory \"%s\": %m",
 			path)));
 	}
 
@@ -4296,7 +4301,8 @@ ReadControlFile(void)
 
 	if (ControlFile->pg_control_version != PG_CONTROL_VERSION && ControlFile->pg_control_version % 65536 == 0 && ControlFile->pg_control_version / 65536 != 0)
 		ereport(FATAL,
-(errmsg("database files are incompatible with server"),
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("database files are incompatible with server"),
  errdetail("The database cluster was initialized with PG_CONTROL_VERSION %d (0x%08x),"
 		   " but the server was compiled with PG_CONTROL_VERSION %d (0x%08x).",
 		   ControlFile->pg_control_version, ControlFile->pg_control_version,
@@ -4305,7 +4311,8 @@ ReadControlFile(void)
 
 	if (ControlFile->pg_control_version != PG_CONTROL_VERSION)
 		ereport(FATAL,
-(errmsg("database files are incompatible with server"),
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("database files are incompatible with server"),
  errdetail("The database cluster was initialized with PG_CONTROL_VERSION %d,"
 		   " but the server was compiled with PG_CONTROL_VERSION %d.",
 		   ControlFile->pg_control_version, PG_CONTROL_VERSION),
@@ -4320,7 +4327,8 @@ ReadControlFile(void)
 
 	if (!EQ_CRC32C(crc, ControlFile->crc))
 		ereport(FATAL,
-(errmsg("incorrect checksum in control file")));
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("incorrect checksum in control file")));
 
 	/*
 	 * Do compatibility checking immediately.  If the database isn't
@@ -4329,68 +4337,78 @@ ReadControlFile(void)
 	 */
 	if (ControlFile->catalog_version_no != CATALOG_VERSION_NO)
 		ereport(FATAL,
-(errmsg("database files are incompatible with server"),
+(errcode(ERRCODE_OBJECT_NOT_IN_

Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-02-23 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review!

On Thu, 22 Feb 2024 at 16:55, Daniel Gustafsson  wrote:
>
> > On 6 Dec 2023, at 14:03, Nazir Bilal Yavuz  wrote:
>
> > There is an ongoing thread [1] for adding missing SQL error codes to
> > PANIC and FATAL error reports in xlogrecovery.c file. I did the same
> > but for xlog.c and relcache.c files.
>
> -   elog(PANIC, "space reserved for WAL record does not match what was 
> written");
> +   ereport(PANIC,
> +   (errcode(ERRCODE_DATA_CORRUPTED),
> +errmsg("space reserved for WAL record does not match 
> what was written")));
>
> elogs turned into ereports should use errmsg_internal() to keep the strings
> from being translated.

Does errmsg_internal() need to be used all the time when turning elogs
into ereports? errmsg_internal()'s comment says that "This should be
used for "can't happen" cases that are probably not worth spending
translation effort on.". Is it enough to check if the error message
has a translation, and then decide the use of errmsg_internal() or
errmsg()?

> -   elog(FATAL, "could not write init file");
> +   ereport(FATAL,
> +   (errcode_for_file_access(),
> +errmsg("could not write init file")));
>
> Is it worthwhile adding %m on these to get a little more help when debugging
> errors that shouldn't happen?

I believe it is worthwhile, so I will add.

> -   elog(FATAL, "could not write init file");
> +   ereport(FATAL,
> +   (errcode_for_file_access(),
>
> The extra parenthesis are no longer needed, I don't know if we have a policy 
> to
> remove them when changing an ereport call but we should at least not introduce
> new ones.
>
> -   elog(FATAL, "cannot read pg_class without having selected a 
> database");
> +   ereport(FATAL,
> +   (errcode(ERRCODE_INTERNAL_ERROR),
>
> ereport (and thus elog) already defaults to ERRCODE_INTERNAL_ERROR for ERROR 
> or
> higher, so unless there is a better errcode an elog() call if preferrable 
> here.

I did not know these, thanks.

> > I couldn't find a suitable error code for the "cache lookup failed for
> > relation" error in relcache.c and this error comes up in many places.
> > Would it be reasonable to create a new error code specifically for
> > this?
>
> We use ERRCODE_UNDEFINED_OBJECT for similar errors elsewhere, perhaps we can
> use that for these as well?

It looks okay to me, ERRCODE_UNDEFINED_OBJECT is mostly used for the
'not exist' errors and it seems the main reason for the 'cache lookup
failed for relation' error is because heap tuple does not exist
anymore.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Use streaming read API in ANALYZE

2024-02-19 Thread Nazir Bilal Yavuz

Hi,

I worked on using the currently proposed streaming read API [1] in ANALYZE.
The patch is attached. 0001 is the not yet merged streaming read API code
changes that can be applied to the master, 0002 is the actual code.

The blocks to analyze are obtained by using the streaming read API now.

- Since streaming read API is already doing prefetch, I removed the #ifdef
USE_PREFETCH code from acquire_sample_rows().

- Changed 'while (BlockSampler_HasMore())' to 'while (nblocks)' because
the prefetch mechanism in the streaming read API will advance 'bs' before
returning buffers.

- Removed BlockNumber and BufferAccessStrategy from the declaration of
scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.

I counted syscalls of analyzing ~5GB table. It can be seen that the patched
version did ~1300 less read calls.

Patched:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 39.670.012128   0 29809   pwrite64
 36.960.011299   0 28594   pread64
 23.240.007104   0 27611   fadvise64

Master (21a71648d3):

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 38.940.016457   0 29816   pwrite64
 36.790.015549   0 29850   pread64
 23.910.010106   0 29848   fadvise64


Any kind of feedback would be appreciated.

[1]:
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 509e55997c084f831fcbcb46cabe79d4f97aa93e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h |  22 +
 src/include/storage/streaming_read.h |  52 ++
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 472 ++
 src/backend/storage/buffer/bufmgr.c  | 592 +++
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/tools/pgindent/typedefs.list |   2 +
 10 files changed, 953 insertions(+), 223 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+ForkNumber forkNum,
+BlockNumber blockNum,
+BufferAccessStrategy strategy,
+bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+Buffer *buffers,
+ForkNumber forknum,
+BlockNumber blocknum,
+int nblocks,
+bool zero_on_error,
+BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern

Re: Show WAL write and fsync stats in pg_stat_io

2024-02-18 Thread Nazir Bilal Yavuz

Hi,

On Thu, 18 Jan 2024 at 04:22, Michael Paquier  wrote:
>
> On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > I agree with your points. While the other I/O related work is
> > happening we can discuss what we should do in the variable op_byte
> > cases. Also, this is happening only for WAL right now but if we try to
> > extend pg_stat_io in the future, that problem possibly will rise
> > again. So, it could be good to come up with a general solution, not
> > only for WAL.
>
> Okay, I've marked the patch as RwF for this CF.

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Nazir Bilal Yavuz

Hi,

On Tue, 13 Feb 2024 at 09:17, Bharath Rupireddy
 wrote:
>
> Hi,
>
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -
> https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
> the tests hit the Assert(false); added. Which means, the zero-padding
> comment around WALRead() call sites isn't quite right.
>
> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

I am curious about the same. The page size and disk block size could
be different, so the reason could be explicitly fetching the whole
page from the disk as you said. Is this the reason or are there any
other benefits of always reading XLOG_BLCKSZ instead of reading the
sufficient part? I tried to search in older threads and code comments
but I could not find an explanation.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Simplify documentation related to Windows builds

2024-02-08 Thread Nazir Bilal Yavuz

Hi,

On Tue, 30 Jan 2024 at 11:02, Michael Paquier  wrote:
>
> On Fri, Jan 19, 2024 at 06:11:40AM -0500, Andrew Dunstan wrote:
> > FYI Strawberry was a bit stuck for a while at 5.32, but they are now up to
> > 5.38. See <https://strawberryperl.com/releases.html>
> >
> > I agree we shouldn't be recommending any particular perl distro, especially
> > not ASPerl which now has annoying license issues.
>
> The more I think about this thread, the more I'd tend to wipe out most
> of "windows-requirements" for the sole reason that it is the far-west
> regarding the various ways it is possible to get the dependencies we
> need for the build and at runtime.  We could keep it minimal with the
> set of requirements we are listing under meson in terms of versions:
> https://www.postgresql.org/docs/devel/install-requirements.html
>
> Then we could have one sentence recommending one, at most two
> facilities used the buildfarm, like https://www.msys2.org/ or
> chocolatey as these group basically all the dependencies we need for a
> meson build (right?) while linking back to the meson page about the
> version requirements.

I tested both msys2 and chocolatey on the fresh Windows containers and
I confirm that Postgres can be built using these. I tested the
dependencies that are required to build and run Postgres. If more
dependencies are required to be checked, I can test again.

As these will be continuously tested by the buildfarm, I agree that
what you suggested looks better.

> One issue I have with the meson page listing the requirements is that
> we don't directly mention Diff, but that's critical for the tests.

I think that is because most distros come with a preinstalled
diffutils package. It is mentioned under the Windows requirements page
[1] since it does not come preinstalled. However, I agree that it
could be good to mention it under the meson page listing the
requirements.

[1] 
https://www.postgresql.org/docs/devel/installation-platform-notes.html#WINDOWS-REQUIREMENTS

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Streaming I/O, vectored I/O (WIP)

2024-02-07 Thread Nazir Bilal Yavuz

Hi,

Thanks for working on this!

On Wed, 10 Jan 2024 at 07:14, Thomas Munro  wrote:

> Thanks!  I committed the patches up as far as smgr.c before the
> holidays.  The next thing to commit would be the bufmgr.c one for
> vectored ReadBuffer(), v5-0001.  Here's my response to your review of
> that, which triggered quite a few changes.
>
> See also new version of the streaming_read.c patch, with change list
> at end.  (I'll talk about v5-0002, the SMgrRelation lifetime one, over
> on the separate thread about that where Heikki posted a better
> version.)

I have a couple of comments / questions.

0001-Provide-vectored-variant-of-ReadBuffer:

- Do we need to pass the hit variable to ReadBuffer_common()? I think
it can be just declared in the ReadBuffer_common() now.


0003-Provide-API-for-streaming-reads-of-relations:

- Do we need to re-think about getting a victim buffer logic?
StrategyGetBuffer() function errors if it can not find any unpinned
buffers, this can be more common in the async world since we pin
buffers before completing the read (while looking ahead).

- If the returned block from the callback is an invalid block,
pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
be cases like the returned block being an invalid block but we should
continue to read after this invalid block?

- max_pinned_buffers and pinned_buffers_trigger variables are set in
the initialization part (in the
pg_streaming_read_buffer_alloc_internal() function) then they do not
change. In some cases there could be no acquirable buffers to pin
while initializing the pgsr (LimitAdditionalPins() set
max_pinned_buffers to 1) but while the read is continuing there could
be chances to create larger reads (other consecutive reads are
finished while this read is continuing). Do you think that trying to
reset max_pinned_buffers and pinned_buffers_trigger to have higher
values after the initialization to have larger reads make sense?

+/* Is there a head range that we can't extend? */
+head_range = >ranges[pgsr->head];
+if (head_range->nblocks > 0 &&
+(!need_complete ||
+ !head_range->need_complete ||
+ head_range->blocknum + head_range->nblocks != blocknum))
+{
+/* Yes, time to start building a new one. */
+head_range = pg_streaming_read_new_range(pgsr);

- I think if both need_complete and head_range->need_complete are
false, we can extend the head range regardless of the consecutiveness
of the blocks.


0006-Allow-streaming-reads-to-ramp-up-in-size:

- ramp_up_pin_limit variable is declared as an int but we do not check
the overflow while doubling it. This could be a problem in longer
reads.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 2 Feb 2024 at 12:11, Artur Zakirov  wrote:
>
> On Fri, 2 Feb 2024 at 09:41, Nazir Bilal Yavuz  wrote:
> > You seem right, nice catch. Also, this change makes the check in
> >
> > snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
> >  basedir,
> >  PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
> >  "pg_xlog" : "pg_wal");
> >
> > redundant. PQserverVersion(conn) will always be higher than
> > MINIMUM_VERSION_FOR_PG_WAL.
>
> Thank you both for the comments. Indeed, that part now looks redundant.
> I've attached a patch to remove checking MINIMUM_VERSION_FOR_PG_WAL.

Thanks for the update. The patch looks good to me.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Checking MINIMUM_VERSION_FOR_WAL_SUMMARIES

2024-02-02 Thread Nazir Bilal Yavuz

Hi,

On Fri, 2 Feb 2024 at 03:11, Artur Zakirov  wrote:
>
> Hi hackers,
>
> during reading the source code of new incremental backup functionality
> I noticed that the following condition can by unintentional:
>
> /*
>  * For newer server versions, likewise create pg_wal/summaries
>  */
> if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_WAL_SUMMARIES)
> {
> ...
>
> if (pg_mkdir_p(summarydir, pg_dir_create_mode) != 0 &&
> errno != EEXIST)
> pg_fatal("could not create directory \"%s\": %m", summarydir);
> }
>
> This is from src/bin/pg_basebackup/pg_basebackup.c.
>
> Is the condition correct? Shouldn't it be ">=". Otherwise the function
> will create "/summaries" only for older PostgreSQL versions.

You seem right, nice catch. Also, this change makes the check in

snprintf(summarydir, sizeof(summarydir), "%s/%s/summaries",
 basedir,
 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
     "pg_xlog" : "pg_wal");

redundant. PQserverVersion(conn) will always be higher than
MINIMUM_VERSION_FOR_PG_WAL.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: 003_extrafiles.pl test fails on Windows with the newer Perl versions

2024-01-31 Thread Nazir Bilal Yavuz

Hi,

On Wed, 31 Jan 2024 at 01:18, Andrew Dunstan  wrote:
>
> Pushed to all live branches. Thanks for the patch.

Thanks for the push!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: compiling postgres on windows arm using meson

2024-01-30 Thread Nazir Bilal Yavuz

Hi,

On Thu, 18 Jan 2024 at 05:07, Dave Cramer  wrote:
>
> Greetings,
> Getting the following error
>
> [1146/2086] Generating src/backend/postgres.def with a custom command 
> (wrapped by meson to set PATH)
> FAILED: src/backend/postgres.def
> "C:\Program Files\Meson\meson.exe" "--internal" "exe" "--unpickle" 
> "C:\Users\davec\projects\postgresql\build\meson-private\meson_exe_perl.EXE_53b41ebc2e76cfc92dd6a2af212140770543faae.dat"
> while executing ['c:\\perl\\bin\\perl.EXE', 
> '../src/backend/../tools/msvc_gendef.pl', '--arch', 'aarch64', '--tempdir', 
> 'src/backend/postgres.def.p', '--deffile', 'src/backend/postgres.def', 
> 'src/backend/postgres_lib.a', 'src/common/libpgcommon_srv.a', 
> 'src/port/libpgport_srv.a']
> --- stdout ---
>
> --- stderr ---
> Usage: msvc_gendef.pl --arch  --deffile  --tempdir  
> files-or-directories
> arch: x86 | x86_64
> deffile: path of the generated file
> tempdir: directory for temporary files
> files or directories: object files or directory containing object files
>
> log attached

>From the docs [1]: PostgreSQL will only build for the x64 architecture
on 64-bit Windows.

So, I think that is expected.

[1] 
https://www.postgresql.org/docs/current/install-windows-full.html#INSTALL-WINDOWS-FULL-64-BIT

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: 003_extrafiles.pl test fails on Windows with the newer Perl versions

2024-01-30 Thread Nazir Bilal Yavuz

Hi,

On Tue, 30 Jan 2024 at 14:21, Nazir Bilal Yavuz  wrote:
>
> It looks like File::Find converts backslashes to slashes in the newer
> Perl versions. I tried to find the related commit and found this:
> https://github.com/Perl/perl5/commit/414f14df98cb1c9a20f92c5c54948b67c09f072d

I forgot to mention that I used Strawberry Perl and these errors
started with 'Perl v5.36.1.1'.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

003_extrafiles.pl test fails on Windows with the newer Perl versions

2024-01-30 Thread Nazir Bilal Yavuz

Hi,

I was trying to install newer Perl versions to Windows CI images and
found that 003_extrafiles.pl test fails on Windows with:

(0.183s) not ok 2 - file lists match
(0.000s) #   Failed test 'file lists match'
#   at C:/cirrus/src/bin/pg_rewind/t/003_extrafiles.pl line 81.
(0.000s) # Structures begin differing at:
#  $got->[0] =
'C:/cirrus/build/testrun/pg_rewind/003_extrafiles/data/t_003_extrafiles_primary_local_data/pgdata/tst_both_dir'
# $expected->[0] =
'C:\cirrus\build/testrun/pg_rewind/003_extrafiles\data/t_003_extrafiles_primary_local_data/pgdata/tst_both_dir'

(0.263s) not ok 5 - file lists match
(0.000s) #   Failed test 'file lists match'
#   at C:/cirrus/src/bin/pg_rewind/t/003_extrafiles.pl line 81.
(0.000s) # Structures begin differing at:
#  $got->[0] =
'C:/cirrus/build/testrun/pg_rewind/003_extrafiles/data/t_003_extrafiles_primary_remote_data/pgdata/tst_both_dir'
# $expected->[0] =
'C:\cirrus\build/testrun/pg_rewind/003_extrafiles\data/t_003_extrafiles_primary_remote_data/pgdata/tst_both_dir'

It looks like File::Find converts backslashes to slashes in the newer
Perl versions. I tried to find the related commit and found this:
https://github.com/Perl/perl5/commit/414f14df98cb1c9a20f92c5c54948b67c09f072d

To solve this, I did the same conversion for Windows before comparing
the paths. And to support all Perl versions, I decided to always
convert them on Windows regardless of the Perl's version. The fix is
attached.

I looked at other File::Find appearances in the code but they do not
compare the paths. So, I do not think there is any need to fix them.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From b28f48fe7d98d3dd7d2fcf652bfa5c8a4cd1c2d6 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 30 Jan 2024 13:12:41 +0300
Subject: [PATCH v1] Fix 003_extrafiles.pl test for the Windows

File::Find converts backslashes to slashes in the newer Perl versions.
See: https://github.com/Perl/perl5/commit/414f14df98cb1c9a20f92c5c54948b67c09f072d

So, do the same conversion for Windows before comparing paths. To
support all Perl versions, always convert them on Windows regardless of
the Perl's version.
---
 src/bin/pg_rewind/t/003_extrafiles.pl | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl
index d54dcddcbb6..7e4c2771ce7 100644
--- a/src/bin/pg_rewind/t/003_extrafiles.pl
+++ b/src/bin/pg_rewind/t/003_extrafiles.pl
@@ -78,6 +78,19 @@ sub run_test
 		},
 		$test_primary_datadir);
 	@paths = sort @paths;
+
+	# File::Find converts backslashes to slashes in the newer Perl
+	# versions. To support all Perl versions, do the same conversion
+	# for Windows before comparing the paths.
+	if ($PostgreSQL::Test::Utils::windows_os)
+	{
+		for my $filename (@paths)
+		{
+			$filename =~ s{\\}{/}g;
+		}
+		$test_primary_datadir =~ s{\\}{/}g;
+	}
+
 	is_deeply(
 		\@paths,
 		[
-- 
2.43.0

Re: Simplify documentation related to Windows builds

2024-01-19 Thread Nazir Bilal Yavuz

Hi,

On Sun, 31 Dec 2023 at 09:13, Michael Paquier  wrote:
>
> Hi all,
>
> As promised around thread [1] that has moved the docs related to
> Windows into a new sub-section for Visual, here is a follow-up to
> improve portions of its documentation, for discussion in the next CF.
>
> Some of my notes:
> - How much does command line editing work on Windows?  When it came to
> VS, I always got the impression that this never worked.  Andres in [2]
> mentioned otherwise because meson makes that easier?

I do not know that either.

> - ActiveState perl should not be recommended IMO, as being able to use
> a perl binary requires one to *register* into their service for tokens
> that can be used to kick perl sessions, last time I checked.  Two
> alternatives:
> -- MinGW perl binary.
> -- strawberry perl (?) and Chocolatey.
> -- MSYS

I agree. Also, its installation & use steps are complicated IMO. It is
not like install it, add it to PATH and forget.

> - http://www.mingw.org/ is a dead end.  This could be replaced by
> links to https://www.mingw-w64.org/ instead?

Correct.

> At the end, the main issue that I have with this part of the
> documentation is the lack of consistency leading to a confusing user
> experience in the builds of Windows.  My recent impressions were that
> Andrew has picked up Chocolatey in some (all?) of his buildfarm
> animals with Strawberry Perl.  I've had a good experience with it,
> FWIW, but Andres has also mentioned me a couple of weeks ago while in
> Prague that Strawberry could lead to unpredictible results (sorry I
> don't remember all the exact details).

Postgres CI uses Strawberry Perl [1] as well but it is directly
installed from the strawberryperl.com and its version is locked to
'5.26.3.1' for now.

> Between MSYS2, mingw-w64 and Chocolatey, there are a lot of options
> available to users.  So shouldn't we try to recommend only of them,
> then align the buildfarm and the CI to use one of them?  Supporting
> more than one, at most two, would be OK for me, my end goal would be
> to get rid entirely of the list of build dependencies in this "Visual"
> section, because that's just a duplicate of what meson lists, except
> that meson should do a better job at detecting dependencies than what
> the now-dead MSVC scripts did.  If we support two, the CI and the
> buildfarm should run them.

I agree.

> I am attaching a patch that's an embryon of work (little time for
> hacking as of life these days, still I wanted to get the discussion
> started), but let's discuss which direction we should take moving
> forward for 17~.

The current changes look good.

 Both Bison and
Flex
 are included in the msys tool
suite, available
-from http://www.mingw.org/wiki/MSYS;> as
part of the
-MinGW compiler suite.
+from https://www.msys2.org/;>.

Since we are changing that part, I think we need to change 'You will
need to add the directory containing flex.exe and bison.exe to the
PATH environment variable. In the case of MinGW, the directory is the
\msys\1.0\bin subdirectory of your MinGW installation directory.'
sentence to its msys2 version. My initial testing showed that the
directory is the '\usr\bin' subdirectory of the msys2 installation
directory in my environment.

[1] 
https://github.com/anarazel/pg-vm-images/blob/main/scripts/windows_install_perl.ps1

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-17 Thread Nazir Bilal Yavuz

Hi,

On Mon, 15 Jan 2024 at 09:27, Michael Paquier  wrote:
>
> On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> > On Thu, 11 Jan 2024 at 17:28, Melanie Plageman  
> > wrote:
> >> Even if we made a separate view for WAL I/O stats, we would still have
> >> this issue of variable sized I/O vs block sized I/O and would probably
> >> end up solving it with separate columns for the number of bytes and
> >> number of operations.
> >
> > Yes, I think it is more about flexibility and not changing the already
> > published pg_stat_io view.
>
> I don't know.  Adding more columns or changing op_bytes with an extra
> mode that reflects on what the other columns mean is kind of the same
> thing to me: we want pg_stat_io to report more modes so as all I/O can
> be evaluated from a single view, but the complication is now that
> everything is tied to BLCKSZ.
>
> IMHO, perhaps we'd better put this patch aside until we are absolutely
> *sure* of what we want to achieve when it comes to WAL, and I am
> afraid that this cannot happen until we're happy with the way we
> handle WAL reads *and* writes, including WAL receiver or anything that
> has the idea of pulling its own page callback with
> XLogReaderAllocate() in the backend.  Well, writes should be
> relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
> reads are the unknown part.
>
> That also seems furiously related to the work happening with async I/O
> or the fact that we may want to have in the view a separate meaning
> for cached pages or pages read directly from disk.  The worst thing
> that we would do is rush something into the tree and then have to deal
> with the aftermath of what we'd need to deal with in terms of
> compatibility depending on the state of the other I/O related work
> when the new view is released.  That would not be fun for the users
> and any hackers who would have to deal with that (aka mainly me if I
> were to commit something), because pg_stat_io could mean something in
> version N, still mean something entirely different in version N+1.

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Create shorthand for including all extra tests

2024-01-15 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jan 2024 at 23:48, Peter Eisentraut  wrote:
>
> On 05.09.23 19:26, Nazir Bilal Yavuz wrote:
> > Thanks for the feedback! I updated the patch, 'needs-private-lo'
> > option enables kerberos, ldap, load_balance and ssl extra tests now.
>
> As was discussed, I don't think "needs private lo" is the only condition
> for these tests.  At least kerberos and ldap also need extra software
> installed, and load_balance might need editing the system's hosts file.
> So someone would still need to familiarize themselves with these tests
> individually before setting a global option like this.
>
> Also, if we were to create test groupings like this, I think the
> implementation should be different.  The way you have it, there is a
> sort of central registry of all affected tests in
> src/test/perl/PostgreSQL/Test/Utils.pm and a mapping of groups to tests.
>   I would prefer a more decentralized approach where each test decides
> on its own whether to run, with pseudo-code conditionals like
>
> if (!(PG_TEST_EXTRA contains "ldap" or PG_TEST_EXTRA contains
> "needs-private-lo"))
>skip_all
>
> Anyway, at the moment, I don't see a sensible way to group these things
> beyond what we have now (effectively, "ldap" is already a group, because
> it affects more than one test suite).  Right now, we have six possible
> values, which is probably just about doable to keep track of manually.
> If we get a lot more, then we need to look into this again, but maybe
> then we'll also have more patterns to group things around.

I see your point. It looks like the best option is to reevaluate this
if there are more PG_TEST_EXTRA options.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: speed up a logical replica setup

2024-01-12 Thread Nazir Bilal Yavuz

Hi,

On Fri, 12 Jan 2024 at 09:32, Hayato Kuroda (Fujitsu)
 wrote:
>
> Good, all warnings were removed. However, the patch failed to pass tests on 
> FreeBSD twice.
>  I'm quite not sure the ERROR, but is it related with us?

FreeBSD errors started after FreeBSD's CI image was updated [1]. I do
not think error is related to this.

[1] https://cirrus-ci.com/task/4700394639589376

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-12 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
 wrote:
>
> On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz  wrote:
> >
> > On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  
> > > > wrote:
> > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  
> > > >> wrote:
> > > >> Oh, I understand it now. Yes, that makes sense.
> > > >> I thought removing op_bytes completely ( as you said "This patch
> > > >> extends it with two more operation sizes, and there are even cases
> > > >> where it may be a variable" ) from pg_stat_io view then adding
> > > >> something like {read | write | extend}_bytes and {read | write |
> > > >> extend}_calls could be better, so that we don't lose any information.
> > > >
> > > > Upthread, Michael says:
> > > >
> > > >> I find the use of 1 in this context a bit confusing, because when
> > > >> referring to a counter at N, then it can be understood as doing N
> > > >> times a operation,
> > > >
> > > > I didn't understand this argument, so I'm not sure if I agree or
> > > > disagree with it.
> > >
> > > Nazir has mentioned upthread one thing: what should we do for the case
> > > where a combination of (io_object,io_context) does I/O with a
> > > *variable* op_bytes, because that may be the case for the WAL
> > > receiver?  For this case, he has mentioned that we should set op_bytes
> > > to 1, but that's something I find confusing because it would mean that
> > > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > > would be to use op_bytes = -1 or NULL for the variable case instead,
> > > with reads, writes and extends referring to a number of bytes rather
> > > than a number of operations.
> >
> > I agree but we can't do this only for the *variable* cases since
> > B_WAL_RECEIVER and other backends use the same
> > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> > is, if two backends use the same pgstat_count_io_op_time() function
> > call in the code; they must count the same thing (number of calls,
> > bytes, etc.). It could be better to count the number of bytes for all
> > the IOOBJECT_WAL IOs.
>
> I'm a bit confused by this. pgstat_count_io_op_time() can check
> MyBackendType. In fact, you do this to ban the wal receiver already.
> It is true that you would need to count all wal receiver normal
> context wal object IOOps in the variable way, but I don't see how
> pgstat_count_io_op_time() is the limiting factor as long as the
> callsite is always doing either the number of bytes or the number of
> calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

> > > > I think these are the three proposals for handling WAL reads:
> > > >
> > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > > number of calls to pg_pread() or similar
> > > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > > calls to pg_pread() or similar
> > >
> > > 3) could be a number of bytes, actually.
> >
> > One important

Re: make pg_ctl more friendly

2024-01-11 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jan 2024 at 06:33, Junwang Zhao  wrote:
>
> Hi Nazir,
>
> On Tue, Jan 9, 2024 at 9:23 PM Nazir Bilal Yavuz  wrote:
> >
> > v2-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch:
> >
> > -   "Examine the log output.\n"),
> > +   "Examine the log output\n"),
> >   progname);
> >
> > I don't think that this is needed.
> There seems to be no common sense for the ending dot when using
> write_stderr, so I will leave this not changed.
>
> >
> > Other than that, the patch looks good and I confirm that after PITR 
> > shutdown:
> >
> > "PITR shutdown"
> > "update configuration for startup again if needed"
> >
> > message shows up, instead of:
> >
> > "pg_ctl: could not start server"
> > "Examine the log output.".
> >
> > nitpick: It would be better if the order of the error message cases
> > and enums is the same ( i.e. 'POSTMASTER_RECOVERY_SHUTDOWN' before
> > 'POSTMASTER_FAILED' in enum )
> Agreed, fixed in V3

Thank you for the update. It looks good to me.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-11 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
>
> On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > I have code review feedback as well, but I've saved that for my next email.
>
> Ah, cool.
>
> > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  wrote:
> >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
> >> Oh, I understand it now. Yes, that makes sense.
> >> I thought removing op_bytes completely ( as you said "This patch
> >> extends it with two more operation sizes, and there are even cases
> >> where it may be a variable" ) from pg_stat_io view then adding
> >> something like {read | write | extend}_bytes and {read | write |
> >> extend}_calls could be better, so that we don't lose any information.
> >
> > Upthread, Michael says:
> >
> >> I find the use of 1 in this context a bit confusing, because when
> >> referring to a counter at N, then it can be understood as doing N
> >> times a operation,
> >
> > I didn't understand this argument, so I'm not sure if I agree or
> > disagree with it.
>
> Nazir has mentioned upthread one thing: what should we do for the case
> where a combination of (io_object,io_context) does I/O with a
> *variable* op_bytes, because that may be the case for the WAL
> receiver?  For this case, he has mentioned that we should set op_bytes
> to 1, but that's something I find confusing because it would mean that
> we are doing read, writes or extends 1 byte at a time.  My suggestion
> would be to use op_bytes = -1 or NULL for the variable case instead,
> with reads, writes and extends referring to a number of bytes rather
> than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

> > I think these are the three proposals for handling WAL reads:
> >
> > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > number of calls to pg_pread() or similar
> > 3) setting op_bytes to NULL and the number of reads is the number of
> > calls to pg_pread() or similar
>
> 3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

> > Looking at the patch, I think it is still doing 2.
>
> The patch disables stats for the WAL receiver, while the startup
> process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> discard the variable case.
>
> > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > all those IOOps for which it would be relevant. It kind of stinks but
> > it would give us the freedom to document exactly what a single IOOp
> > means for each combination of BackendType, IOContext, IOObject, and
> > IOOp (as relevant) and still have an accurate number in the *bytes
> > columns. Everyone will probably hate us if we do that, though.
> > Especially because having bytes for the existing IOObjects is an
> > existing feature.
>
> An issue I have with this one is that having multiple tuples for
> each (object,context) if they have multiple op_bytes leads to
> potentially a lot of bloat in the view.  That would be up to 8k extra
> tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

> > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > directly). Now, WAL reads are not from permanent storage anymore. Are
> > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > question for some of the WAL receiver functions. Should we track any
> > I/O other than permanent storage I/O? Or did I miss this being
> > addressed upthread?
>
> That's a good point.  I guess that this should just be a different
> IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

> > In terms of what I/O we should track in a streaming/asynchronous
> > world, the options would be:
> >
> > 1) track read/write syscalls
> > 2) track blocks of BLCKSZ submitted to the kernel
> > 3) track bytes submitted to the kernel
> > 4) track merged I/Os (after doing any merging in the a

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-10 Thread Nazir Bilal Yavuz

Hi,

On Wed, 10 Jan 2024 at 08:25, Michael Paquier  wrote:
>
> On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> >
> > I thought removing op_bytes completely ( as you said "This patch
> > extends it with two more operation sizes, and there are even cases
> > where it may be a variable" ) from pg_stat_io view then adding
> > something like {read | write | extend}_bytes and {read | write |
> > extend}_calls could be better, so that we don't lose any information.
>
> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios.  This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?

Yes, you are right.

> >> Yeah, a limitation like that may be acceptable for now.  Tracking the
> >> WAL writer and WAL sender activities can be relevant in a lot of cases
> >> even if we don't have the full picture for the WAL receiver yet.
> >
> > I added that and disabled B_WAL_RECEIVER backend with comments
> > explaining why. v8 is attached.
>
> I can see that's what you have been adding here, which should be OK:
>
> > -if (track_io_timing)
> > +/*
> > + * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp 
> > IOs
> > + * but these IOs are not countable for now because IOOP_READ IOs' 
> > op_bytes
> > + * (number of bytes per unit of I/O) might not be the same all the 
> > time.
> > + * The current implementation requires that the op_bytes must be the 
> > same
> > + * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> > + * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> > + * now.
> > + */
> > +if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > +return;
>
> This could be worded better, but that's one of these nits from me I
> usually tweak when committing stuff.

Thanks for doing that! Do you have any specific comments that can help
improve it?

> > +/*
> > + * Decide if IO timings need to be tracked.  Timings associated to
> > + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> > + * else rely on track_io_timing.
> > + */
> > +static bool
> > +pgstat_should_track_io_time(IOObject io_object)
> > +{
> > +if (io_object == IOOBJECT_WAL)
> > +return track_wal_io_timing;
> > +
> > +return track_io_timing;
> > +}
>
> One thing I was also considering is if eliminating this routine would
> make pgstat_count_io_op_time() more readable the result, but I cannot
> get to that.

I could not think of a way to eliminate pgstat_should_track_io_time()
route without causing performance regressions. What do you think about
moving inside of 'pgstat_should_track_io_time(io_object) if check' to
another function and call this function from
pgstat_count_io_op_time()? This does not change anything but IMO it
increases the readability.

> >  if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> >  {
> > -
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
> > @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext 
> > io_context, IOOp io_op,
> >  }
> >  else if (io_op == IOOP_READ)
> >  {
> > -
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +if (io_object != IOOBJECT_WAL)
> > +
> > pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >  if (io_object == IOOBJECT_RELATION)
> >  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, 
> > io_time);
> >  else if (io_object == IOOBJECT_TEMP_RELATION)
>
> A second thing is if this would be better with more switch/cases, say:
> switch (io_op):
> {
> case IOOP_EXTEND:
> case IOOP_WRITE:
> switch (io_object):
> {
> case WAL:
>

Re: make pg_ctl more friendly

2024-01-09 Thread Nazir Bilal Yavuz

Hi,

Thank you for working on this! I agree that the current message is not friendly.

On Thu, 9 Nov 2023 at 10:19, Junwang Zhao  wrote:
>
> On Thu, Nov 9, 2023 at 3:08 PM Junwang Zhao  wrote:
> >
> > After a PITR shutdown, the db state should be *shut down in recovery*, try 
> > the
> > patch attached.
> >
>
> previous patch has some format issues, V2 attached.

v2-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch:

-   "Examine the log output.\n"),
+   "Examine the log output\n"),
  progname);

I don't think that this is needed.

Other than that, the patch looks good and I confirm that after PITR shutdown:

"PITR shutdown"
"update configuration for startup again if needed"

message shows up, instead of:

"pg_ctl: could not start server"
"Examine the log output.".

nitpick: It would be better if the order of the error message cases
and enums is the same ( i.e. 'POSTMASTER_RECOVERY_SHUTDOWN' before
'POSTMASTER_FAILED' in enum )

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-05 Thread Nazir Bilal Yavuz

Hi,

On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:
>
> On 1/4/24 2:23 PM, Andres Freund wrote:
>
> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
>
> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
>  nskippable_blocks
>
> I think these may lead to worse code - the compiler has to reload
> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
> can't guarantee that they're not changed within one of the external functions
> called in the loop body.
>
> Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder 
> if the concept of skipping should go away in the context of vector IO? 
> Instead of thinking about "we can skip this range of blocks", why not 
> maintain a list of "here's the next X number of blocks that we need to 
> vacuum"?

Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
"the next X number of blocks that need to be vacuumed" will be
prefetched by calculating the unskippable blocks ( using the
lazy_scan_skip() function ) and the X will be determined by Postgres
itself. Do you have something different in your mind?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-03 Thread Nazir Bilal Yavuz

Hi,

On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>
> On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
> >> I am not sure while the whole point of the exercise is to have all the
> >> I/O related data in a single view.  Something that I've also found a
> >> bit disturbing yesterday while looking at your patch is the fact that
> >> the operation size is guessed from the context and object type when
> >> querying the view because now everything is tied to BLCKSZ.  This
> >> patch extends it with two more operation sizes, and there are even
> >> cases where it may be a variable.  Could it be a better option to
> >> extend pgstat_count_io_op_time() so as callers can themselves give the
> >> size of the operation?
> >
> > Do you mean removing the op_bytes column and tracking the number of
> > bytes in reads, writes, and extends? If so, that makes sense to me but
> > I don't want to remove the number of operations; I believe that has a
> > value too. We can extend the pgstat_count_io_op_time() so it can both
> > track the number of bytes and the number of operations.
>
> Apologies if my previous wording sounded confusing.  The idea I had in
> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

> > Also, it is not directly related to this patch but vectored IO [1] is
> > coming soon; so the number of operations could be wrong since vectored
> > IO could merge a couple of operations.
>
> Hmm.  I have not checked this patch series so I cannot say for sure,
> but we'd likely just want to track the number of bytes if a single
> operation has a non-equal size rather than registering in pg_stat_io N
> rows with different op_bytes, no?

Yes, that is correct.

> I am looping in Thomas Munro in CC for comments.

Thanks for doing that.

> > If we want to entirely disable it, we can add
> >
> > if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > return;
> >
> > to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> > calls are done by this function, then we can disable it at
> > pgstat_tracks_io_bktype().
>
> Yeah, a limitation like that may be acceptable for now.  Tracking the
> WAL writer and WAL sender activities can be relevant in a lot of cases
> even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From dc258a5b168f15c9771f174924261b151193 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Jan 2024 15:36:19 +0300
Subject: [PATCH v8] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/pgstat.h  |   6 +-
 src/backend/access/transam/xlog.c |  58 +--
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/catalog/system_views.sql  |  15 ++-
 src/backend/utils/activity/pgstat_io.c| 119 --
 src/backend/utils/adt/pgstatfuncs.c   |  24 ++---
 src/test/regress/expected/rules.out   |  27 +++--
 src/test/regress/expected/stats.out   |  53 ++
 src/test/regress/sql/stats.sql|  25 +
 doc/src/sgml/monitoring.sgml  |  29 --
 11 files changed, 2

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-26 Thread Nazir Bilal Yavuz

Hi,

On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
>
> On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:
> > Maybe it is better to create a pg_stat_io_wal view like you said
> > before. We could remove unused columns and add op_bytes for each
> > writes and reads. Also, we can track both the number of bytes and the
> > number of the operations. This doesn't fully solve the problem but it
> > will be easier to modify it to meet our needs.
>
> I am not sure while the whole point of the exercise is to have all the
> I/O related data in a single view.  Something that I've also found a
> bit disturbing yesterday while looking at your patch is the fact that
> the operation size is guessed from the context and object type when
> querying the view because now everything is tied to BLCKSZ.  This
> patch extends it with two more operation sizes, and there are even
> cases where it may be a variable.  Could it be a better option to
> extend pgstat_count_io_op_time() so as callers can themselves give the
> size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.
Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

>
> The whole patch is kind of itself complicated enough, so I'd be OK to
> discard the case of the WAL receiver for now.  Now, if we do so, the
> code stack of pgstat_io.c should handle WAL receivers as something
> entirely disabled until all the known issues are solved.  There is
> still a lot of value in tracking WAL data associated to the WAL
> writer, normal backends and WAL senders.

Why can't we add comments and leave it as it is? Is it because this
could cause misunderstandings?

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-26 Thread Nazir Bilal Yavuz

Hi,

On Tue, 26 Dec 2023 at 03:06, Michael Paquier  wrote:
>
> On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  
> > wrote:
> >> If there is any combination of BackendType and IOContext which will
> >> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> >> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> >> tracking reads and write IOOps in number of bytes (instead of number of
> >> pages). I don't actually know if there is a clear separation by
> >> BackendType for these different cases.
> >
> > Using op_bytes as 1 solves this problem but since it will be different
> > from the rest of the pg_stat_io view it could be hard to understand.
> > There is no clear separation by backends as it can be seen from the 
> > walsender.
>
> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation, but it would be much less than that.  Another
> solution would be to use NULL (as a synonym of "I don't know") and
> then document that in this case all the bigint counters of pg_stat_io
> track the number of bytes rather than the number of operations?

Yes, that makes sense.

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-25 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier  wrote:
>
> On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> > pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> > receiver considered as something correct in the list of backend types,
> > while the intention is to *not* add it to pg_stat_io?  I have tried to
> > switche to the correct behavior of returning false for a
> > B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> > freezes on its shutdown sequence.  Something weird is going on here.
> > Could you look at it?  See the XXX comment in the attached, which is
> > the same behavior as v6-0002.  It looks to me that the patch has
> > introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> > incorrect way to avoid the root issue.
>
> Ah, that's because it would trigger an assertion failure:
> TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
>  io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
> postgres: standby_local: walreceiver
> (ExceptionalCondition+0xa8)[0x560d1b4dd38a]
>
> And the backtrace just tells that this is the WAL receiver
> initializing a WAL segment:
> #5  0x560d1b3322c8 in pgstat_count_io_op_n
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> cnt=1) at pgstat_io.c:89
> #6  0x560d1b33254a in pgstat_count_io_op_time
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> start_time=..., cnt=1) at pgstat_io.c:181
> #7  0x560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
> added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/0001", '0'
> , "3") at xlog.c:3115
> #8  0x560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
> xlog.c:3215

Correct.

>
> Wouldn't it be simpler to just bite the bullet in this case and handle
> WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) =>  nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman  wrote:
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Remove MSVC scripts from the tree

2023-12-19 Thread Nazir Bilal Yavuz

Hi,

On Tue, 19 Dec 2023 at 18:24, Peter Eisentraut  wrote:
>
> On 18.12.23 14:52, Peter Eisentraut wrote:
> >> 2) I had seen that if sed/gzip is not available meson build will fail:
> >> 2.a)
> >> Program gsed sed found: NO
> >> meson.build:334:6: ERROR: Program 'gsed sed' not found or not executable
> >
> > Yes, this would need to be improved.  Currently, sed is only required if
> > either selinux or dtrace is enabled, which isn't supported on Windows.
> > But we should adjust the build scripts to not fail the top-level setup
> > run unless those options are enabled.
> >
> >> 2.b)
> >> Program gzip found: NO
> >> meson.build:337:7: ERROR: Program 'gzip' not found or not executable
> >
> > gzip is only required for certain test suites, so again we should adjust
> > the build scripts to not fail the build but instead skip the tests as
> > appropriate.
>
> Here are patches for these two issues.  More testing would be appreciated.

0001-meson-Require-sed-only-when-needed:

+sed = find_program(get_option('SED'), 'sed', native: true,
+   required: get_option('dtrace').enabled() or
get_option('selinux').enabled())

dtrace is disabled as default but selinux is set to auto. So, meson
could find selinux ( because of the auto ) and fail to find sed, then
compilation will fail with:
contrib/sepgsql/meson.build:34:19: ERROR: Tried to use not-found
external program in "command"

I think we need to require sed when dtrace or selinux is found, not by
looking at the return value of the get_option().enabled().

Second patch looks good to me.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2023-12-14 Thread Nazir Bilal Yavuz

Hi,

Sorry for the late reply.

On Fri, 6 Oct 2023 at 17:07, Tom Lane  wrote:
>
> As a quick cross-check, I searched our commit log to see how many
> README-only commits there were so far this year.  I found 11 since
> January.  (Several were triggered by the latest round of pgindent
> code and process changes, so maybe this is more than typical.)
>
> Not sure what that tells us about the value of changing the CI
> logic, but it does seem like it could be worth the one-liner
> change needed to teach buildfarm animals to ignore READMEs.

I agree that it could be worth implementing this logic on buildfarm animals.

In case we want to implement the same logic on the CI, I added a new
version of the patch; it skips CI completely if the changes are only
in the README files.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 6c268233d13262a31965baf2dbb42913d83dab1d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 14 Dec 2023 16:23:28 +0300
Subject: [PATCH v3] If the changes are only in the README files, skip all the
 tasks.

If the changes are only in the README files, skip all the tasks to save
CI time.
---
 .cirrus.tasks.yml | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index e137769850..dfd12f8b04 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -4,6 +4,12 @@
 # further details, see src/tools/ci/README
 
 
+# If the changes are only in the README files, skip all the tasks.
+#
+# This skip is overriden in 'SanityCheck' task. So, the same check is
+# added 'SanityCheck' task too.
+skip: changesIncludeOnly('**/README')
+
 env:
   # The lower depth accelerates git clone. Use a bit of depth so that
   # concurrent tasks and retrying older jobs have a chance of working.
@@ -55,11 +61,12 @@ on_failure_meson: _failure_meson
 task:
   name: SanityCheck
 
-  # If a specific OS is requested, don't run the sanity check. This shortens
-  # push-wait-for-ci cycle time a bit when debugging operating system specific
-  # failures. Uses skip instead of only_if, as cirrus otherwise warns about
-  # only_if conditions not matching.
-  skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*'
+  # Don't run the sanity check if the changes are only in the README files or
+  # if a specific OS is requested. The latter shortens push-wait-for-ci cycle
+  # time a bit when debugging operating system specific failures. Uses skip
+  # instead of only_if, as cirrus otherwise warns about only_if conditions not
+  # matching.
+  skip: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:.*' || changesIncludeOnly('**/README')
 
   env:
 CPUS: 4
-- 
2.43.0

Re: Add code indentation check to cirrus-ci (was Re: Add BF member koel-like indentation checks to SanityCheck CI)

2023-12-14 Thread Nazir Bilal Yavuz

Hi,

You may want to check out the WIP patch [1] about adding meson targets
to run pgindent by Andres.

[1] 
https://www.postgresql.org/message-id/20231019044907.ph6dw637loqg3lqk%40awork3.anarazel.de

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-12 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback! The new version of the patch is attached.

On Tue, 5 Dec 2023 at 09:16, Michael Paquier  wrote:
>
> -   if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> +   if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
>
> Unrelated diff.

Done.

>
> +   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +   io_op == IOOP_FSYNC)
> +   PendingWalStats.wal_sync += cnt;
>
> Nah, I really don't think that adding this dependency within
> pg_stat_io is a good idea.
>
> -   PendingWalStats.wal_sync++;
> +   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
> +   io_start, 1);
>
> This is the only caller where this matters, and the count is always 1.

I reverted that, pgstat_count_io_op_n doesn't count
PendingWalStats.wal_sync now.

>
> +   no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
> +   bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
> +   bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
> +   bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
> +   bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
> +
> +   if (no_wal_normal_read &&
> +   (io_object == IOOBJECT_WAL &&
> +io_op == IOOP_READ))
> +   return false;
>
> This may be more readable if an enum is applied, without a default
> clause so as it would not be forgotten if a new type is added, perhaps
> in its own little routine.

Done.

>
> -   if (track_io_timing)
> +   if (track_io_timing || track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(io_start);
> else
>
> This interface from pgstat_prepare_io_time() is not really good,
> because we could finish by setting io_start in the existing code paths
> calling this routine even if track_io_timing is false when
> track_wal_io_timing is true.  Why not changing this interface a bit
> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
> argument of the function depending on what we expect to trigger the
> timings?

Done in 0001.

>
> -   /* Convert counters from microsec to millisec for display */
> -   values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 
> 1000.0);
> -   values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 
> 1000.0);
> +   /*
> +* There is no need to calculate timings for both pg_stat_wal and
> +* pg_stat_io. So, fetch timings from pg_stat_io to make stats 
> gathering
> +* cheaper. Note that, since timings are fetched from pg_stat_io;
> +* pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
> +*
> +* Convert counters from microsec to millisec for display
> +*/
> +   values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_WRITE));
> +   values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_FSYNC));
>
> Perhaps it is simpler to remove these columns from pg_stat_get_wal()
> and plug an SQL upgrade to the view definition of pg_stat_wal?

Done in 0003 but I am not sure if that is what you expected.

> Finding a good balance between the subroutines, the two GUCs, the
> contexts, the I/O operation type and the objects is the tricky part of
> this patch.  If the dependency to PendingWalStats is removed and if
> the interface of pgstat_prepare_io_time is improved, things are a bit
> cleaner, but it feels like we could do more..  Nya.

I agree. The patch is not logically complicated but it is hard to
select the best way.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From b7bf7b92fa274775136314ecfde90fa32ed435cb Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v6 6/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests

---
 src/test/regress/expected/stats.out | 12 
 src/test/regress/sql/stats.sql  |  8 
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
  t

Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2023-12-06 Thread Nazir Bilal Yavuz

Hi,

There is an ongoing thread [1] for adding missing SQL error codes to
PANIC and FATAL error reports in xlogrecovery.c file. I did the same
but for xlog.c and relcache.c files.

I couldn't find a suitable error code for the "cache lookup failed for
relation" error in relcache.c and this error comes up in many places.
Would it be reasonable to create a new error code specifically for
this?

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAPMWgZ8g17Myb5ZRE5aTNowUohafk4j48mZ_5_Zn9JnR5p2u0w%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft


v1-0001-xlog.c-Add-missing-error-codes-to-PANIC-FATAL-err.patch
Description: Binary data


v1-0002-relcache.c-Add-missing-error-codes-to-PANIC-FATAL.patch
Description: Binary data

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-01 Thread Nazir Bilal Yavuz

Hi,

Thanks for all the feedback. I am sharing the new version of the patchset.

Current status of the patchset is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write, fsync stats and their
tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats and their tests are added.
- Documentation is updated.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

Updates & Discussion items:
- Try to set op_bytes for BackendType / IOContext: I think we don't
need this now, we will need this when we add streaming replication WAL
IOs.

- Decide which 'BackendType / IOContext / IOOp' should not be tracked:
-- IOOBJECT_WAL / IOCONTEXT_INIT + IOCONTEXT_NORMAL / write and fsync
IOs can be done on every backend that tracks IO statistics. Because of
that and since we have a pgstat_tracks_io_bktype(bktype) check, I
didn't add another check for this.
-- I found that only the standalone backend and startup backend do
IOOBJECT_WAL / IOCONTEXT_NORMAL / read IOs. So, I added a check for
that but I am not sure if there are more backends that do WAL reads on
WAL recovery.

- For the IOOBJECT_WAL / IOCONTEXT_INIT and IOOBJECT_WAL /
IOCONTEXT_NORMAL / read tests, I used initial WAL IOs to check these
stats. I am not sure if that is the correct way or enough to test
these stats.

- To not calculate WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since
these timings are fetched from pg_stat_io, pg_stat_reset_shared('io')
will reset pg_stat_wal's timings too.

- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 4ad85b11d418ae78237ed70eced6e3b46d086ef5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 1 Dec 2023 10:03:21 +0300
Subject: [PATCH v5 3/4] Add IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 19 +++
 src/test/regress/sql/stats.sql  | 10 ++
 2 files changed, 29 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4d3a515bdd..4adda9e479 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,25 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+--
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+--
+ t
+(1 row)
+
 -
 -- Test that resetting stats works for reset timestamp
 -
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index aa48e65dc8..72e864a0d2 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,16 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync.
+-- When the servers starts, the initial WAL file must be created,
+-- so check these stats before stats get resetted.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
 -
 -- Test that resetting stats works for reset timestamp
 -
-- 
2.43.0

From c7ae6c12cd02806d9d8201d738920179985cee7a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Fri, 10 Nov 2023 14:52:22 +0300
Subject: [PATCH v5 2/4] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / write and fsync
 tests

---
 src/test/regress/expected/stats.out | 26 ++
 src/test/regress/sql/stats.sql  | 11 +++
 2 files changed, 37 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..4d3a515bdd 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1255,6 +1255,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-01 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 Nov 2023 at 10:34, Bharath Rupireddy
 wrote:
>
> Is there any plan to account for WAL read stats in the WALRead()
> function which will cover walsenders i.e. WAL read by logical and
> streaming replication, WAL read by pg_walinspect and so on? I see the
> patch already covers WAL read stats by recovery in XLogPageRead(), but
> not other page_read callbacks which will end up in WALRead()
> eventually. If added, the feature at
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> can then extend it to cover WAL read from WAL buffer stats.

Yes, I am planning to create a patch for that after this patch is
done. Thanks for informing!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

SSL tests fail on OpenSSL v3.2.0

2023-11-27 Thread Nazir Bilal Yavuz

Hi,

SSL tests fail on OpenSSL v3.2.0. I tested both on macOS (CI) and
debian (my local) and both failed with the same errors. To trigger
these errors on CI, you may need to clear the repository cache;
otherwise macOS won't install the v3.2.0 of the OpenSSL.

001_ssltests:
psql exited with signal 6 (core dumped): 'psql: error: connection to
server at "127.0.0.1", port 56718 failed: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
SSL SYSCALL error: Connection reset by peer' while running 'psql -XAtq
-d sslkey=invalid sslcert=invalid sslrootcert=invalid sslcrl=invalid
sslcrldir=invalid user=ssltestuser dbname=trustdb hostaddr=127.0.0.1
host=common-name.pg-ssltest.test sslrootcert=invalid sslmode=require
-f - -w' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm


002_scram:
psql exited with signal 6 (core dumped): 'psql: error: connection to
server at "127.0.0.1", port 54531 failed: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
SSL SYSCALL error: Connection reset by peer' while running 'psql -XAtq
-d dbname=trustdb sslmode=require sslcert=invalid sslrootcert=invalid
hostaddr=127.0.0.1 host=localhost user=ssltestuser -f - -w' at
/Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 1997.


003_sslinfo:
psql exited with signal 6 (core dumped): 'psql: error: connection to
server at "127.0.0.1", port 59337 failed: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
SSL SYSCALL error: Connection reset by peer' while running 'psql -XAtq
-d sslkey=invalid sslcert=invalid sslrootcert=invalid sslcrl=invalid
sslcrldir=invalid sslrootcert=ssl/root+server_ca.crt sslmode=require
dbname=certdb hostaddr=127.0.0.1 host=localhost user=ssltestuser
sslcert=ssl/client_ext.crt
sslkey=/Users/admin/pgsql/build/testrun/ssl/003_sslinfo/data/tmp_test_q11O/client_ext.key
-f - -w' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
line 1997.

macOS CI run: https://cirrus-ci.com/task/5128008789393408

I couldn't find the cause yet but just wanted to inform you.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-20 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier  wrote:
>
> On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> > There are some differences between pg_stat_wal and pg_stat_io while
> > collecting WAL stats. For example in the XLogWrite() function in the
> > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> > is not something we want for pg_stat_io since pg_stat_io counts the
> > number of blocks rather than the system calls, so instead incremented
> > pg_stat_io by npages.
> >
> > Could that cause a problem since pg_stat_wal's behaviour will be
> > changed? Of course, as an alternative we could change pg_stat_io's
> > behaviour but in the end either pg_stat_wal's or pg_stat_io's
> > behaviour will be changed.
>
> Yep, that could be confusing for existing applications that track the
> information of pg_stat_wal.  The number of writes is not something
> that can be correctly shared between both.  The timings for the writes
> and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

>
> This slightly relates to pgstat_count_io_op_n() in your latest patch,
> where it feels a bit weird to see an update of
> PendingWalStats.wal_sync sit in the middle of a routine dedicated to
> pg_stat_io..  I am not completely sure what's the right balance here,
> but I would try to implement things so as pg_stat_io paths does not
> need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Nazir Bilal Yavuz

Hi,

On Thu, 9 Nov 2023 at 18:27, Tristan Partin  wrote:
>
> Can you try with Meson v1.2.3?

I tried with Meson v1.2.3 and upstream, both failed with the same error.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Nazir Bilal Yavuz

Hi,

Adding Kyotaro to CC because Kyotaro reported a similar issue before [1].

On Thu, 9 Nov 2023 at 11:59, Shlok Kyal  wrote:
>
> Hi,
> I am trying to build postgres with meson on Windows. And I am stuck in
> the process.
>
> Steps I followed:
>
> 1. I clone postgres repo
>
> 2.Installed meson and ninja
> pip install meson ninja
>
> 3. Then running following command:
> meson setup build --buildtype debug
>
> 4. Then I ran
> cd build
> ninja
>
> Got following error
> D:\project\repo\pg_meson\postgres\build>C:\Users\kyals\AppData\Roaming\Python\Python311\Scripts\ninja
> ninja: error: 'src/backend/postgres_lib.a.p/meson_pch-c.obj', needed
> by 'src/backend/postgres.exe', missing and no known rule to make it.

I am able to reproduce the error. This error was introduced at meson
v1.2.0, v1.1.0 and before work successfully. It seems meson tries to
use pch files although Postgres is compiled with b_pch=false.
This error occurs when Developer Powershell for VS is used and
Postgres is compiled with b_pch=false option (which is the default on
Postgres). If the -Db_pch=true option or the default powershell is
used, Postgres gets built successfully.

[1] 
https://www.postgresql.org/message-id/20231018.113148.1275969479525954369.horikyota@gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz

Hi,

On Wed, 8 Nov 2023 at 04:19, Andres Freund  wrote:
>
> Hi,
>
> On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> > By the way, if the write/sync quantities and times begin to be tracked
> > by pg_stat_io, I'd see a pretty good argument in removing the
> > equivalent columns in pg_stat_wal.  It looks like this would reduce
> > the confusion related to the handling of PendingWalStats added in
> > pgstat_io.c, for one.
>
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-09 Thread Nazir Bilal Yavuz

Hi,

Thanks for all the feedback!

On Wed, 8 Nov 2023 at 08:59, Michael Paquier  wrote:
>
> By the way, note that the patch is failing to apply, and that I've
> switched it as waiting on author on 10/26.

Here is an updated patchset in attachment. Rebased on the latest HEAD
and changed 'pgstat_report_wal(false)' to 'pgstat_flush_io(false)' in
xlogrecovery.c. I will share the new version of the patchset once I
address the feedback.

Regards,
Nazir Bilal Yavuz
Microsoft
From e5db5cd6d8c47cadde0539f06bbee22368d17a41 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 26 Oct 2023 12:12:32 +0300
Subject: [PATCH v4 1/2] Show WAL stats (except streaming replication WAL) on
 pg_stat_io

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.
---
 src/backend/access/transam/xlog.c |  60 -
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/utils/activity/pgstat_io.c| 144 --
 src/backend/utils/adt/pgstatfuncs.c   |  10 +-
 src/include/pgstat.h  |   8 +-
 5 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..d265b8c032 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2256,38 +2256,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time();
 			do
 			{
 errno = 0;
 
-/* Measure I/O timing to write WAL data */
-if (track_wal_io_timing)
-	INSTR_TIME_SET_CURRENT(start);
-else
-	INSTR_TIME_SET_ZERO(start);
-
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
 pgstat_report_wait_end();
 
-/*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
-if (track_wal_io_timing)
-{
-	instr_time	end;
-
-	INSTR_TIME_SET_CURRENT(end);
-	INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-}
-
 PendingWalStats.wal_write++;
 
 if (written <= 0)
@@ -2312,6 +2296,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+	IOOP_WRITE, io_start, npages);
+
 			npages = 0;
 
 			/*
@@ -3005,6 +2992,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3048,6 +3036,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* start timing writes for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3083,6 +3074,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+			io_start, 1);
+
 	if (save_errno)
 	{
 		/*
@@ -3099,6 +3093,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
  errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	/* start timing fsyncs for stats */
+	io_start = pgstat_prepare_io_time();
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3111,6 +3108,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+			IOOP_FSYNC, io_start, 1);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ -8282,7 +8282,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8295,11 +8295,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time();
 
 	pgstat_report_wait_start(WA

Re: Remove unnecessary 'always:' from CompilerWarnings task

2023-11-08 Thread Nazir Bilal Yavuz

Hi,

Thanks for the review.

On Wed, 8 Nov 2023 at 10:31, Peter Eisentraut  wrote:
>
> On 05.09.23 12:25, Nazir Bilal Yavuz wrote:
> > There are multiple 'always:' keywords under the CompilerWarnings task.
> > Instead of that, we can use one 'always:' and move the instructions
> > under this. So, I removed unnecessary ones and rearranged indents
> > according to that change.
>
> I'm not sure this change is beneficial.  The way the code is currently
> arranged, it's a bit easier to move or change individual blocks, and
> it's also easier to read the file, because the "always:" is next to each
> "script" and doesn't scroll off the screen.

That makes sense. I am planning to withdraw this soon if there are no
other objections.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-11-06 Thread Nazir Bilal Yavuz

Hi,

On Wed, 25 Oct 2023 at 07:13, Michael Paquier  wrote:
>
> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.

I liked the idea; thanks for working on this!

What do you think about creating a function for updating the already
created injection point's callback or name (mostly callback)? For now,
you need to drop and recreate the injection point to change the
callback or the name.

Here is my code correctness review:

diff --git a/meson_options.txt b/meson_options.txt
+option('injection_points', type: 'boolean', value: true,
+  description: 'Enable injection points')
+

It is enabled by default while building with meson.


diff --git a/src/backend/utils/misc/injection_point.c
b/src/backend/utils/misc/injection_point.c
+LWLockRelease(InjectionPointLock);
+
+/* If not found, do nothing? */
+if (!found)
+return;

It would be good to log a warning message here.


I tried to compile that with -Dwerror=true -Dinjection_points=false
and got some errors (warnings):

injection_point.c: In function ‘InjectionPointShmemSize’:
injection_point.c:59:1: error: control reaches end of non-void
function [-Werror=return-type]

injection_point.c: At top level:
injection_point.c:32:14: error: ‘InjectionPointHashByName’ defined but
not used [-Werror=unused-variable]

test_injection_points.c: In function ‘test_injection_points_run’:
test_injection_points.c:69:21: error: unused variable ‘name’
[-Werror=unused-variable]


The test_injection_points test runs and passes although I set
-Dinjection_points=false. That could be misleading, IMO the test
should be skipped if Postgres is not compiled with the injection
points.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-11-06 Thread Nazir Bilal Yavuz

Hi,

On Tue, 31 Oct 2023 at 16:57, Nazir Bilal Yavuz  wrote:
> On Thu, 26 Oct 2023 at 09:28, Michael Paquier  wrote:
> >
> > And perhaps just putting that everything that calls
> > pgstat_count_io_op_time() under track_io_timing is just natural?
> > What's the performance regression you would expect if both WAL and
> > block I/O are controlled by that, still one would expect only one of
> > them?
>
> I will check these and I hope I will come back with something meaningful.

I applied the patches on upstream postgres and then run pgbench for each
available clock sources couple of times:
# Set fsync = off and track_io_timing = on
# pgbench -i -s 100 test
# pgbench -M prepared -c16 -j8 -f <( echo "SELECT
pg_logical_emit_message(true, \:client_id::text, '1234567890');") -T60 test

Results are:

╔═╦═══╦╗
║ ║  track_wal_io_timing  ║║
╠═╬═══╦═══╬╣
║  clock  ║   on  ║  off  ║ change ║
║ sources ║   ║   ║║
╠═╬═══╬═══╬╣
║   tsc   ║   ║   ║║
║ ║ 514814.459170 ║ 519826.284139 ║   %1   ║
╠═╬═══╬═══╬╣
║   hpet  ║   ║   ║║
║ ║ 132116.272121 ║ 141820.548447 ║   %7   ║
╠═╬═══╬═══╬╣
║ acpi_pm ║   ║   ║║
║ ║ 394793.092255 ║ 403723.874719 ║   %2   ║
╚═╩═══╩═══╩╝

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

2023-10-31 Thread Nazir Bilal Yavuz

Hi,

Thank you for the feedback!

On Thu, 26 Oct 2023 at 09:28, Michael Paquier  wrote:
>
> On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> > Any kind of feedback would be appreciated.
>
> This was registered in the CF, so I have given it a look.  Note that
> 0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
> applied.
>
> +pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
> +{
> +   /*
> +* io times of IOOBJECT_WAL IOObject needs to be tracked when
> +* 'track_wal_io_timing' is set regardless of 'track_io_timing'.
> +*/
> +   if (io_object == IOOBJECT_WAL)
> +   return track_wal_io_timing;
> +
> +   return track_io_timing;
>
> I can see the temptation to do that, but I have mixed feelings about
> the approach of mixing two GUCs in a code path dedicated to pg_stat_io
> where now we only rely on track_io_timing.  The result brings
> confusion, while making pg_stat_io, which is itself only used for
> block-based operations, harder to read.
>
> The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
> is quite tempting, actually, creating a neat separation between the
> existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
> that provides more details about the contexts and backend types for
> the WAL stats with its relevant fields:
> https://www.postgresql.org/message-id/caakru_bm55pj3pprw0nd_-pawhlrkou69r816aeztbba-n1...@mail.gmail.com
>
> And perhaps just putting that everything that calls
> pgstat_count_io_op_time() under track_io_timing is just natural?
> What's the performance regression you would expect if both WAL and
> block I/O are controlled by that, still one would expect only one of
> them?

I will check these and I hope I will come back with something meaningful.

>
> +  /*  Report pending statistics to the cumulative stats system */
> +  pgstat_report_wal(false);
>
> This is hidden in 0001, still would be better if handled as a patch on
> its own and optionally backpatch it as we did for the bgwriter with
> e64c733bb1?

I thought about it again and found the use of
'pgstat_report_wal(false);' here wrong. This was mainly for flushing
WAL stats because of the WAL reads but pg_stat_wal doesn't have WAL
read stats, so there is no need to flush WAL stats here. I think this
should be replaced with 'pgstat_flush_io(false);'.

>
> Side note: I think that we should spend more efforts in documenting
> what IOContext and IOOp mean.  Not something directly related to this
> patch, still this patch or things similar make it a bit harder which
> part of it is used for what by reading pgstat.h.

I agree.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-10-27 Thread Nazir Bilal Yavuz

Hi,

On Thu, 19 Oct 2023 at 08:26, Michael Paquier  wrote:
>
> On Wed, Oct 18, 2023 at 02:56:42PM +0900, Michael Paquier wrote:
> > Thanks for the new versions.  I have applied 0001 and backpatched it
> > for now.  0002 and 0003 look in much cleaner shape than previously.
>
> 0002 and 0003 have now been applied.  I have split 0003 into two parts
> at the end, mainly on clarity grounds: one for the counters with
> EXPLAIN and a second for pg_stat_statements.
>
> There were a few things in the patch set.  Per my notes:
> - Some incorrect indentation.
> - The additions of show_buffer_usage() did not handle correctly the
> addition of a comma before/after the local timing block.  The code
> area for has_local_timing needs to check for has_temp_timing, while
> the area of has_shared_timing needs to check for (has_local_timing ||
> has_temp_timing).
> - explain.sgml was missing an update for the information related to
> the read/write timings of the local blocks.

Thanks for the changes, push and feedback!

>
> Remains what we should do about the "shared/local" string in
> show_buffer_usage() for v16 and v15, as "local" is unrelated to that.
> Perhaps we should just switch to "shared" in this case or just remove
> the string entirely?  Still that implies changing the output of
> EXPLAIN on a stable branch in this case, so there could be an argument
> for leaving this stuff alone.

I think switching it to 'shared' makes sense. That shouldn't confuse
existing monitoring queries much as the numbers won't change, right?
Also, if we keep 'shared/local' there could be similar complaints to
this thread in the future; so, at least adding comments can be
helpful.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: gcc 12.1.0 warning

2023-10-27 Thread Nazir Bilal Yavuz

Hi,

On Fri, 27 Oct 2023 at 12:34, Erik Rijkers  wrote:
>
> Hi,
>
> Not sure if these compiler-mutterings are worth reporting but I guess
> we're trying to get a silent compile.
>
> System: Debian 4.9.303-1 (2022-03-07) x86_64 GNU/Linux
> Compiling with gcc 12.1.0 causes the below 'warning' and 'note'.
> Compiling with --enable-cassert --enable-debug  is silent, no warnings)
>
> In function ‘guc_var_compare’,
>  inlined from ‘bsearch’ at
> /usr/include/x86_64-linux-gnu/bits/stdlib-bsearch.h:33:23,
>  inlined from ‘find_option’ at guc.c:5649:35:
> guc.c:5736:38: warning: array subscript ‘const struct config_generic[0]’
> is partly outside array bounds of ‘const char[8]’ [-Warray-bounds]
>   5736 | return guc_name_compare(confa->name, confb->name);
>| ~^~
>
> guc.c: In function ‘find_option’:
> guc.c:5636:25: note: object ‘name’ of size 8
>   5636 | find_option(const char *name, bool create_placeholders, bool
> skip_errors,
>| ^~~~
>
> (Compiling with gcc 6.3.0 does not complain.)

I was testing 'upgrading CI Debian images to bookworm'. I tested the
bookworm on REL_15, REL_16 and upstream. REL_16 and upstream finished
successfully but the CompilerWarnings task failed on REL_15 with the
same error.

gcc version: 12.2.0

CI Run: https://cirrus-ci.com/task/6151742664998912

Regards,
Nazir Bilal Yavuz
Microsoft

1 2 >

1 - 100 of 173 matches

Mail list logo