Re: Use COPY for populating all pgbench tables

2023-07-24 Thread Tristan Partin

Michael,

Once again I appreciate your patience with this patchset. Thanks for 
your help and reviews.


--
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-07-23 Thread Michael Paquier
On Sun, Jul 23, 2023 at 08:21:51PM +0900, Michael Paquier wrote:
> Cool.  I have applied the new tests for now to move on with this
> thread.

I have done a few more things on this patch today, including
measurements with a local host and large scaling numbers.  One of my
hosts was taking for instance around 36.8s with scale=500 using the
INSERTs and 36.5s with the COPY when loading data (average of 4
runs) with -I dtg.

Greg's upthread point is true as well that for high latency the
server-side generation is the most adapted option, but I don't see a
reason to not switch to a COPY as this option is hidden behind -I,
just for the sake of improving the default option set.  So, at the
end, applied.
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-23 Thread Michael Paquier
On Fri, Jul 21, 2023 at 12:22:06PM -0500, Tristan Partin wrote:
> v7 looks good from my perspective. Thanks for working through this patch
> with me. Much appreciated.

Cool.  I have applied the new tests for now to move on with this
thread.
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-21 Thread Tristan Partin

On Thu Jul 20, 2023 at 9:14 PM CDT, Michael Paquier wrote:

Attached is a v7, with these tests (should be a patch on its own but
I'm lazy to split this morning) and some more adjustments that I have
done while going through the patch.  What do you think?


v7 looks good from my perspective. Thanks for working through this patch 
with me. Much appreciated.


--
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-07-20 Thread Michael Paquier
On Thu, Jul 20, 2023 at 02:22:51PM -0500, Tristan Partin wrote:
> Thanks for your testing Michael. I went ahead and added a test to make sure
> that this behavior doesn't regress accidentally, but I am struggling to get
> the test to fail using the previous version of this patch. Do you have any
> advice? This is my first time writing a test for Postgres. I can recreate
> the issue outside of the test script, but not within it for whatever reason.

We're all here to learn, and writing TAP tests is important these days
for a lot of patches.

+# Check that the pgbench_branches and pgbench_tellers filler columns are filled
+# with NULLs
+foreach my $table ('pgbench_branches', 'pgbench_tellers') {
+   my ($ret, $out, $err) = $node->psql(
+   'postgres',
+   "SELECT COUNT(1) FROM $table;
+SELECT COUNT(1) FROM $table WHERE filler is NULL;",
+   extra_params => ['--no-align', '--tuples-only']);
+
+   is($ret, 0, "psql $table counts status is 0");
+   is($err, '', "psql $table counts stderr is empty");
+   if ($out =~ m/^(\d+)\n(\d+)$/g) {
+   is($1, $2, "psql $table filler column filled with NULLs");
+   } else {
+   fail("psql $table stdout m/^(\\d+)\n(\\d+)$/g");
+   }
+}

This is IMO hard to parse, and I'd rather add some checks for the
accounts and history tables as well.  Let's use four simple SQL
queries like what I posted upthread (no data for history inserted
after initialization), as of the attached.  I'd be tempted to apply
that first as a separate patch, because we've never add coverage for
that and we have specific expectations in the code from this filler
column for tpc-b.  And this needs to cover both client-side and
server-side data generation.

Note that the indentation was a bit incorrect, so fixed while on it.

Attached is a v7, with these tests (should be a patch on its own but
I'm lazy to split this morning) and some more adjustments that I have
done while going through the patch.  What do you think?
--
Michael
From 211d66fc39338d522a72722fc3674e306826fd37 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 21 Jul 2023 11:07:31 +0900
Subject: [PATCH v7] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c| 155 +++
 src/bin/pgbench/t/001_pgbench_with_server.pl |  35 +
 doc/src/sgml/ref/pgbench.sgml|   9 +-
 3 files changed, 133 insertions(+), 66 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..539c2795e2 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -835,6 +835,8 @@ static void add_socket_to_set(socket_set *sa, int fd, int idx);
 static int	wait_on_socket_set(socket_set *sa, int64 usecs);
 static bool socket_has_input(socket_set *sa, int fd, int idx);
 
+/* callback used to build rows for COPY during data loading */
+typedef void (*initRowMethod) (PQExpBufferData *sql, int64 curr);
 
 /* callback functions for our flex lexer */
 static const PsqlScanCallbacks pgbench_callbacks = {
@@ -4859,17 +4861,45 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
 {
-	PQExpBufferData sql;
+	/* "filler" column uses NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column uses NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base,
+  initRowMethod init_row)
+{
+	int			n;
+	int			k;
+	int			chars = 0;
 	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	PQExpBufferData sql;
+	char		copy_statement[256];
+	const char *copy_statement_fmt = "copy %s from stdin";
+	int64		total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4908,24 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to 

Re: Use COPY for populating all pgbench tables

2023-07-20 Thread Tristan Partin

On Wed Jul 19, 2023 at 10:07 PM CDT, Michael Paquier wrote:

So this patch causes pgbench to not stick with its historical
behavior, and the change is incompatible with the comments because the
tellers and branches tables don't use NULL for their filler attribute
anymore.


Great find. This was a problem of me just not understanding the COPY 
command properly. Relevant documentation snippet:


Specifies the string that represents a null value. The default is \N 
(backslash-N) in text format, and an unquoted empty string in CSV 
format. You might prefer an empty string even in text format for cases 
where you don't want to distinguish nulls from empty strings. This 
option is not allowed when using binary format.


This new revision populates the column with the NULL value.


psql (17devel)
Type "help" for help.

tristan957=# select count(1) from pgbench_branches;
 count 
---

 1
(1 row)

tristan957=# select count(1) from pgbench_branches where filler is null;
 count 
---

 1
(1 row)


Thanks for your testing Michael. I went ahead and added a test to make 
sure that this behavior doesn't regress accidentally, but I am 
struggling to get the test to fail using the previous version of this 
patch. Do you have any advice? This is my first time writing a test for 
Postgres. I can recreate the issue outside of the test script, but not 
within it for whatever reason.


--
Tristan Partin
Neon (https://neon.tech)
From cf84b3ea2d7b583aaee3f80807b26ef4521db0f6 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v6] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 doc/src/sgml/ref/pgbench.sgml|   8 +-
 src/bin/pgbench/pgbench.c| 151 +++
 src/bin/pgbench/t/001_pgbench_with_server.pl |  18 +++
 3 files changed, 110 insertions(+), 67 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 850028557d..4424595cc6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -231,10 +231,10 @@ pgbench  options  d
 extensively through a COPY.
 pgbench uses the FREEZE option with version 14 or later
 of PostgreSQL to speed up
-subsequent VACUUM, unless partitions are enabled.
-Using g causes logging to print one message
-every 100,000 rows while generating data for the
-pgbench_accounts table.
+subsequent VACUUM. If partitions are enabled for
+the pgbench_accounts table, the FREEZE option is
+not enabled. Using g causes logging to print one
+message every 100,000 rows while generating data for all tables.


 With G (server-side data generation),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..ef01d049a1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4859,17 +4859,44 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\\N\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
 {
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
+{
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt = "copy %s from stdin";
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4905,24 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	

Re: Use COPY for populating all pgbench tables

2023-07-19 Thread Michael Paquier
On Wed, Jul 19, 2023 at 01:03:21PM -0500, Tristan Partin wrote:
> Didn't actually include the changes in the previous patch.

-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
 {
-   PQExpBufferData sql;
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t0\t\n",
+ curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+ curr + 1, curr / ntellers + 1);

Hmm.  Something's not right here, see:
=# select count(*) has_nulls from pgbench_accounts where filler is null;
 has_nulls 
---
 0
(1 row)
=# select count(*) > 0 has_nulls from pgbench_tellers where filler is null;
 has_nulls 
---
 f
(1 row)
=# select count(*) > 0 has_nulls from pgbench_branches where filler is null;
 has_nulls 
---
 f
(1 row)

Note as well this comment in initCreateTables():
/*
 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
 * fields in these table declarations were intended to comply with that.
 * The pgbench_accounts table complies with that because the "filler"
 * column is set to blank-padded empty string. But for all other tables
 * the columns default to NULL and so don't actually take any space.  We
 * could fix that by giving them non-null default values.  However, that
 * would completely break comparability of pgbench results with prior
 * versions. Since pgbench has never pretended to be fully TPC-B compliant
 * anyway, we stick with the historical behavior.
 */

So this patch causes pgbench to not stick with its historical
behavior, and the change is incompatible with the comments because the
tellers and branches tables don't use NULL for their filler attribute
anymore.
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-19 Thread Tristan Partin

Didn't actually include the changes in the previous patch.

--
Tristan Partin
Neon (https://neon.tech)
From 5b934691b88b3b2c5675bc778b0a10e9eeff3dbe Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v5] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 doc/src/sgml/ref/pgbench.sgml |   8 +-
 src/bin/pgbench/pgbench.c | 153 --
 2 files changed, 94 insertions(+), 67 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 850028557d..4424595cc6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -231,10 +231,10 @@ pgbench  options  d
 extensively through a COPY.
 pgbench uses the FREEZE option with version 14 or later
 of PostgreSQL to speed up
-subsequent VACUUM, unless partitions are enabled.
-Using g causes logging to print one message
-every 100,000 rows while generating data for the
-pgbench_accounts table.
+subsequent VACUUM. If partitions are enabled for
+the pgbench_accounts table, the FREEZE option is
+not enabled. Using g causes logging to print one
+message every 100,000 rows while generating data for all tables.


 With G (server-side data generation),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..8cc2e2bca4 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4859,17 +4859,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
 {
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
+{
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt = "copy %s from stdin";
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4907,24 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/* truncate away any old data */
-	initTruncateTables(con);
-
 	initPQExpBuffer();
 
 	/*
-	 * fill branches, tellers, accounts in that order in case foreign keys
-	 * already exist
+	 * Use COPY with FREEZE on v14 and later without partitioning.  Remember
+	 * that partitions only applies to pgbench_accounts.
 	 */
-	for (i = 0; i < nbranches * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_branches(bid,bbalance) values(%d,0)",
-		  i + 1);
-		executeStatement(con, sql.data);
-	}
+	if (PQserverVersion(con) >= 14) {
+		const bool is_pgbench_accounts = strcmp(table, "pgbench_accounts") == 0;
 
-	for (i = 0; i < ntellers * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_tellers(tid,bid,tbalance) values (%d,%d,0)",
-		  i + 1, i / ntellers + 1);
-		executeStatement(con, sql.data);
+		if (!is_pgbench_accounts || partitions == 0)
+			copy_statement_fmt = "copy %s from stdin with (freeze on)";
 	}
 
-	/*
-	 * accounts is big enough to be worth using COPY and tracking runtime
-	 */
-
-	/* use COPY with FREEZE on v14 and later without partitioning */
-	if (partitions == 0 && PQserverVersion(con) >= 14)
-		copy_statement = "copy pgbench_accounts from stdin with 

Re: Use COPY for populating all pgbench tables

2023-07-19 Thread Tristan Partin

On Wed Jul 12, 2023 at 10:52 PM CDT, Michael Paquier wrote:

On Wed, Jul 12, 2023 at 09:29:35AM -0500, Tristan Partin wrote:
> On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
>> This would use the freeze option only on pgbench_accounts when no
>> partitioning is defined, but my point was a bit different.  We could
>> use the FREEZE option on the teller and branch tables as well, no?
>> Okay, the impact is limited compared to accounts in terms of amount of
>> data loaded, but perhaps some people like playing with large scaling
>> factors where this could show a benefit in the initial data loading.
> 
> Perhaps, should they all be keyed off the same option? Seemed like in

> your previous comment you wanted multiple options. Sorry for not reading
> your comment correctly.

I would have though that --partition should only apply to the
pgbench_accounts table, while FREEZE should apply where it is possible
to use it, aka all the COPY queries except when pgbench_accounts is a
partition.  Would you do something different, like not applying FREEZE
to pgbench_tellers and pgbench_branches as these have much less tuples
than pgbench_accounts?


Michael,

I think I completely misunderstood you. From what I can tell, you want
to use FREEZE wherever possible. I think the new patch covers your
comment and fixes the documentation. I am hoping that I have finally
gotten what you are looking for.

--
Tristan Partin
Neon (https://neon.tech)
From 6fcc685fad9640818da7c8c5ff8a10d3fcf1aaed Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v4] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 155 ++
 1 file changed, 90 insertions(+), 65 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 087baa7d57..6db1500151 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4859,17 +4859,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
 {
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4878,50 +4907,22 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/* truncate away any old data */
-	initTruncateTables(con);
-
 	initPQExpBuffer();
 
 	/*
-	 * fill branches, tellers, accounts in that order in case foreign keys
-	 * already exist
+	 * Use COPY with FREEZE on v14 and later without partitioning.  Remember
+	 * that partitions only applies to pgbench_accounts.
 	 */
-	for (i = 0; i < nbranches * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_branches(bid,bbalance) values(%d,0)",
-		  i + 1);
-		executeStatement(con, sql.data);
-	}
-
-	for (i = 0; i < ntellers * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_tellers(tid,bid,tbalance) values (%d,%d,0)",
-		  i + 1, i / ntellers + 1);
-		executeStatement(con, sql.data);
-	}
-
-	/*
-	 * accounts is big enough to be worth using COPY and tracking runtime
-	 */
-
-	/* use COPY with FREEZE on v14 and later without 

Re: Use COPY for populating all pgbench tables

2023-07-12 Thread Michael Paquier
On Wed, Jul 12, 2023 at 09:29:35AM -0500, Tristan Partin wrote:
> On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
>> This would use the freeze option only on pgbench_accounts when no
>> partitioning is defined, but my point was a bit different.  We could
>> use the FREEZE option on the teller and branch tables as well, no?
>> Okay, the impact is limited compared to accounts in terms of amount of
>> data loaded, but perhaps some people like playing with large scaling
>> factors where this could show a benefit in the initial data loading.
> 
> Perhaps, should they all be keyed off the same option? Seemed like in
> your previous comment you wanted multiple options. Sorry for not reading
> your comment correctly.

I would have though that --partition should only apply to the
pgbench_accounts table, while FREEZE should apply where it is possible
to use it, aka all the COPY queries except when pgbench_accounts is a
partition.  Would you do something different, like not applying FREEZE
to pgbench_tellers and pgbench_branches as these have much less tuples
than pgbench_accounts?
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-12 Thread Tristan Partin
On Wed Jul 12, 2023 at 1:06 AM CDT, Michael Paquier wrote:
> On Tue, Jul 11, 2023 at 09:46:43AM -0500, Tristan Partin wrote:
> > On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
> >> This seems a bit incorrect because partitioning only applies to
> >> pgbench_accounts, no?  This change means that the teller and branch
> >> tables would not benefit from FREEZE under a pgbench compiled with
> >> this patch and a backend version of 14 or newer if --partitions is
> >> used.  So, perhaps "partitions" should be an argument of
> >> initPopulateTable() specified for each table? 
> > 
> > Whoops, looks like I didn't read the comment for what the partitions
> > variable means. It only applies to pgbench_accounts as you said. I don't
> > think passing it in as an argument would make much sense. Let me know
> > what you think about the change in this new version, which only hits the
> > first portion of the `if` statement if the table is pgbench_accounts.
>
> -   /* use COPY with FREEZE on v14 and later without partitioning */
> -   if (partitions == 0 && PQserverVersion(con) >= 14)
> -   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
> +   if (partitions == 0 && strcmp(table, "pgbench_accounts") == 0 && 
> PQserverVersion(con) >= 14)
> +   copy_statement_fmt = "copy %s from stdin with (freeze on)";
>
> This would use the freeze option only on pgbench_accounts when no
> partitioning is defined, but my point was a bit different.  We could
> use the FREEZE option on the teller and branch tables as well, no?
> Okay, the impact is limited compared to accounts in terms of amount of
> data loaded, but perhaps some people like playing with large scaling
> factors where this could show a benefit in the initial data loading.

Perhaps, should they all be keyed off the same option? Seemed like in
your previous comment you wanted multiple options. Sorry for not reading
your comment correctly.

> In passing, I have noticed the following sentence in pgbench.sgml:
>Using g causes logging to print one message
>every 100,000 rows while generating data for the   
>pgbench_accounts table.
> It would become incorrect as the same code path would be used for the
> teller and branch tables.

I will amend the documentation to mention all tables rather than being
specific to pgbench_accounts in the next patch revision pending what you
want to do about the partition CLI argument.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-07-12 Thread Michael Paquier
On Tue, Jul 11, 2023 at 09:46:43AM -0500, Tristan Partin wrote:
> On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
>> This seems a bit incorrect because partitioning only applies to
>> pgbench_accounts, no?  This change means that the teller and branch
>> tables would not benefit from FREEZE under a pgbench compiled with
>> this patch and a backend version of 14 or newer if --partitions is
>> used.  So, perhaps "partitions" should be an argument of
>> initPopulateTable() specified for each table? 
> 
> Whoops, looks like I didn't read the comment for what the partitions
> variable means. It only applies to pgbench_accounts as you said. I don't
> think passing it in as an argument would make much sense. Let me know
> what you think about the change in this new version, which only hits the
> first portion of the `if` statement if the table is pgbench_accounts.

-   /* use COPY with FREEZE on v14 and later without partitioning */
-   if (partitions == 0 && PQserverVersion(con) >= 14)
-   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+   if (partitions == 0 && strcmp(table, "pgbench_accounts") == 0 && 
PQserverVersion(con) >= 14)
+   copy_statement_fmt = "copy %s from stdin with (freeze on)";

This would use the freeze option only on pgbench_accounts when no
partitioning is defined, but my point was a bit different.  We could
use the FREEZE option on the teller and branch tables as well, no?
Okay, the impact is limited compared to accounts in terms of amount of
data loaded, but perhaps some people like playing with large scaling
factors where this could show a benefit in the initial data loading.

In passing, I have noticed the following sentence in pgbench.sgml:
   Using g causes logging to print one message
   every 100,000 rows while generating data for the   
   pgbench_accounts table.
It would become incorrect as the same code path would be used for the
teller and branch tables.
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-07-11 Thread Tristan Partin
On Tue Jul 11, 2023 at 12:03 AM CDT, Michael Paquier wrote:
> On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
>  static void
> -initGenerateDataClientSide(PGconn *con)
> +initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to NULL */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t0\t\n",
> + curr + 1);
> +}
> +
> +static void
> +initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to NULL */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t" INT64_FORMAT 
> "\t0\t\n",
> + curr + 1, curr / ntellers + 1);
> +}
> +
> +static void
> +initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
> +{
> +   /* "filler" column defaults to blank padded empty string */
> +   printfPQExpBuffer(sql,
> + INT64_FORMAT "\t" INT64_FORMAT 
> "\t0\t\n",
> + curr + 1, curr / naccounts + 1);
> +}
>
> These routines don't need a PGconn argument, by the way.

Nice catch!

> /* use COPY with FREEZE on v14 and later without partitioning */
> if (partitions == 0 && PQserverVersion(con) >= 14)
> -   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
> +   copy_statement_fmt = "copy %s from stdin with (freeze on)";
> else
> -   copy_statement = "copy pgbench_accounts from stdin";
> +   copy_statement_fmt = "copy %s from stdin";
>
> This seems a bit incorrect because partitioning only applies to
> pgbench_accounts, no?  This change means that the teller and branch
> tables would not benefit from FREEZE under a pgbench compiled with
> this patch and a backend version of 14 or newer if --partitions is
> used.  So, perhaps "partitions" should be an argument of
> initPopulateTable() specified for each table? 

Whoops, looks like I didn't read the comment for what the partitions
variable means. It only applies to pgbench_accounts as you said. I don't
think passing it in as an argument would make much sense. Let me know
what you think about the change in this new version, which only hits the
first portion of the `if` statement if the table is pgbench_accounts.

-- 
Tristan Partin
Neon (https://neon.tech)
From af242f39aefd4d9c77271ac897b53aaf17601f85 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v3] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 155 ++
 1 file changed, 90 insertions(+), 65 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 320d348a0f..b77b6c3704 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4869,17 +4869,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
 {
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4888,50 +4917,22 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, 

Re: Use COPY for populating all pgbench tables

2023-07-10 Thread Michael Paquier
On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
> Again, I forget to actually attach. Holy guacamole.

Looks rather OK seen from here, applied 0001 while browsing the
series.  I have a few comments about 0002.

 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t0\t\n",
+ curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t" INT64_FORMAT 
"\t0\t\n",
+ curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to blank padded empty string */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t" INT64_FORMAT 
"\t0\t\n",
+ curr + 1, curr / naccounts + 1);
+}

I was looking at that, and while this strikes me as a duplication for
the second and third ones, I'm OK with the use of a callback to fill
in the data, even if naccounts and ntellers are equivalent to the
"base" number given to initPopulateTable(), because the "filler"
column has different expectations for each table.  These routines
don't need a PGconn argument, by the way.

/* use COPY with FREEZE on v14 and later without partitioning */
if (partitions == 0 && PQserverVersion(con) >= 14)
-   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+   copy_statement_fmt = "copy %s from stdin with (freeze on)";
else
-   copy_statement = "copy pgbench_accounts from stdin";
+   copy_statement_fmt = "copy %s from stdin";

This seems a bit incorrect because partitioning only applies to
pgbench_accounts, no?  This change means that the teller and branch
tables would not benefit from FREEZE under a pgbench compiled with
this patch and a backend version of 14 or newer if --partitions is
used.  So, perhaps "partitions" should be an argument of
initPopulateTable() specified for each table? 
--
Michael


signature.asc
Description: PGP signature


Re: Use COPY for populating all pgbench tables

2023-06-14 Thread Tristan Partin
Again, I forget to actually attach. Holy guacamole.

-- 
Tristan Partin
Neon (https://neon.tech)
From 8c4eb4849b1282f1a0947ddcf3f599e384a5a428 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 09:21:55 -0500
Subject: [PATCH v2 1/2] Move constant into format string

If we are always going to write a 0, just put the 0 in the format
string.
---
 src/bin/pgbench/pgbench.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1d1670d4c2..320d348a0f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4947,8 +4947,8 @@ initGenerateDataClientSide(PGconn *con)
 
 		/* "filler" column defaults to blank padded empty string */
 		printfPQExpBuffer(,
-		  INT64_FORMAT "\t" INT64_FORMAT "\t%d\t\n",
-		  j, k / naccounts + 1, 0);
+		  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+		  j, k / naccounts + 1);
 		if (PQputline(con, sql.data))
 			pg_fatal("PQputline failed");
 
-- 
Tristan Partin
Neon (https://neon.tech)

From 8c573cbede228a5b5570d3ca37055bd40d4a4025 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH v2 2/2] Use COPY instead of INSERT for populating all tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 152 ++
 1 file changed, 87 insertions(+), 65 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 320d348a0f..9c57f7e4a7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4869,17 +4869,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PGconn *con, PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
 {
+	int			n;
+	int			k;
+	int			chars = 0;
+	PGresult	*res;
 	PQExpBufferData sql;
-	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4888,50 +4917,19 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/* truncate away any old data */
-	initTruncateTables(con);
-
 	initPQExpBuffer();
 
-	/*
-	 * fill branches, tellers, accounts in that order in case foreign keys
-	 * already exist
-	 */
-	for (i = 0; i < nbranches * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_branches(bid,bbalance) values(%d,0)",
-		  i + 1);
-		executeStatement(con, sql.data);
-	}
-
-	for (i = 0; i < ntellers * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_tellers(tid,bid,tbalance) values (%d,%d,0)",
-		  i + 1, i / ntellers + 1);
-		executeStatement(con, sql.data);
-	}
-
-	/*
-	 * accounts is big enough to be worth using COPY and tracking runtime
-	 */
-
 	/* use COPY with FREEZE on v14 and later without partitioning */
 	if (partitions == 0 && PQserverVersion(con) >= 14)
-		copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+		copy_statement_fmt = "copy %s from stdin with (freeze on)";
 	else
-		copy_statement = "copy pgbench_accounts from stdin";
+		copy_statement_fmt = "copy %s from stdin";
+
+	n = pg_snprintf(copy_statement, sizeof(copy_statement), copy_statement_fmt, table);
+	if (n >= sizeof(copy_statement))
+		pg_fatal("invalid buffer size: must be 

Re: Use COPY for populating all pgbench tables

2023-06-14 Thread Tristan Partin
Here is a v2. It cleans up the output when printing to a tty. The
last "x of y tuples" line gets overwritten now, so the final output
looks like:

dropping old tables...
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 0.14 s (drop tables 0.01 s, create tables 0.01 s, client-side generate 
0.05 s, vacuum 0.03 s, primary keys 0.03 s).

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-13 Thread Tristan Partin
On Thu Jun 8, 2023 at 11:38 AM CDT, Tristan Partin wrote:
> On Thu Jun 8, 2023 at 12:33 AM CDT, David Rowley wrote:
> > On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
> > >
> > > master:
> > >
> > > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 
> > > s))
> > > vacuuming...
> > > creating primary keys...
> > > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> > > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
> > >
> > > patchset:
> > >
> > > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 
> > > 243.82 s, remaining 0.00 s))
> > > vacuuming...
> > > creating primary keys...
> > > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> > > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).
> >
> > I've also previously found pgbench -i to be slow.  It was a while ago,
> > and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> > inside pgbench.
> >
> > On seeing your email, it makes me wonder if PG16's hex integer
> > literals might help here.  These should be much faster to generate in
> > pgbench and also parse on the postgres side.
> >
> > I wrote a quick and dirty patch to try that and I'm not really getting
> > the same performance increases as I'd have expected. I also tested
> > with your patch too and it does not look that impressive either when
> > running pgbench on the same machine as postgres.
>
> I didn't expect my patch to increase performance in all workloads. I was
> mainly aiming to fix high-latency connections. Based on your results
> that looks like a 4% reduction in performance of client-side data
> generation. I had thought maybe it is worth having a flag to keep the
> old way too, but I am not sure a 4% hit is really that big of a deal.
>
> > pgbench copy speedup
> >
> > ** master
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 
> > s)
> > vacuuming...
> > creating primary keys...
> > done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
> >
> > ** David's Patched
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 
> > s)
> > vacuuming...
> > creating primary keys...
> > done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
> >
> > ** Tristan's patch
> > drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> > 1 of 1 tuples (100%) of pgbench_accounts done (elapsed
> > 77.44 s, remaining 0.00 s)
> > vacuuming...
> > creating primary keys...
> > done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> > generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
> >
> > I'm interested to see what numbers you get.  You'd need to test on
> > PG16 however. I left the old code in place to generate the decimal
> > numbers for versions < 16.
>
> I will try to test this soon and follow up on the thread. I definitely
> see no problems with your patch as is though. I would be more than happy
> to rebase my patches on yours.

Finally got around to doing more benchmarking. Using an EC2 instance
hosted in Ireland, and my client laptop in Austin, Texas.

Workload: pgbench -i -s 500

master (9aee26a491)
done in 1369.41 s (drop tables 0.21 s, create tables 0.72 s, client-side 
generate 1336.44 s, vacuum 1.02 s, primary keys 31.03 s).
done in 1318.31 s (drop tables 0.21 s, create tables 0.72 s, client-side 
generate 1282.67 s, vacuum 1.02 s, primary keys 33.69 s).

copy
done in 307.42 s (drop tables 0.21 s, create tables 0.82 s, client-side 
generate 270.95 s, vacuum 1.02 s, primary keys 34.42 s).

david
done in 1311.14 s (drop tables 0.72 s, create tables 0.72 s, client-side 
generate 1274.98 s, vacuum 0.94 s, primary keys 33.79 s).
done in 1340.18 s (drop tables 0.14 s, create tables 0.59 s, client-side 
generate 1304.78 s, vacuum 0.92 s, primary keys 33.75 s).

copy + david
done in 348.70 s (drop tables 0.23 s, create tables 0.72 s, client-side 
generate 312.94 s, vacuum 0.92 s, primary keys 33.90 s).

I ran two tests for master and your patch David. For the last test, I
adapted your patch onto mine. I am still seeing the huge performance
gains on my branch.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-13 Thread Tristan Partin
I think I am partial to number 2. Removing a context switch always leads
to more productivity.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Gurjeet Singh
On Fri, Jun 9, 2023 at 6:20 PM Gurjeet Singh  wrote:
> On Fri, Jun 9, 2023 at 5:42 PM Gregory Smith  wrote:
> > On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh  wrote:
> >>
> >> > $ pgbench -i -I dtGvp -s 500
> >>
> >> The steps are severely under-documented in pgbench --help output.
> >
> > I agree it's not easy to find information.  I just went through double 
> > checking I had the order recently enough to remember what I did.  The man 
> > pages have this:
> >
> > > Each step is invoked in the specified order. The default is dtgvp.
> >
> > Which was what I wanted to read.  Meanwhile the --help output says:
> >
> > >  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
> >
> > %T$%%Which has the information without a lot of context for what it's used 
> > for.  I'd welcome some text expansion that added a minimal but functional 
> > improvement to that the existing help output; I don't have such text.
>
> Please see attached 2 variants of the patch. Variant 1 simply tells
> the reader to consult pgbench documentation. The second variant
> provides a description for each of the letters, as the documentation
> does. The committer can pick  the one they find suitable.

(I was short on time, so had to keep it short in the above email. To
justify this additional email, I have added 2 more options to choose
from. :)

The text ", in the specified order" is an important detail, that
should be included irrespective of the rest of the patch.

My preference would be to use the first variant, since the second one
feels too wordy for --help output. I believe we'll have to choose
between these two; the alternatives will not make anyone happy.

These two variants show the two extremes; bare minimum vs. everything
but the kitchen sink. So one may feel the need to find a middle ground
and provide a "sufficient, but not too much" variant. I have attempted
that in variants 3 and 4; see attached.

The third variant does away with the list of steps, and uses a
paragraph to describe the letters. And the fourth variant makes that
paragraph terse.

In the order of preference I'd choose variant 1, then 2. Variants 3
and 4 feel like a significant degradation from variant 2.

Attached samples.txt shows the snippets of --help output of each of
the variants/patches, for comparison.

Best regards,
Gurjeet
http://Gurje.et
 variant-1-brief-pointer-to-docs.patch
  -i, --initialize invokes initialization mode
  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
   run selected initialization steps, in the specified 
order
   see pgbench documentation for a description of these 
steps
  -F, --fillfactor=NUM set fill factor

 variant-2-detailed-description.patch
  -i, --initialize invokes initialization mode
  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
   run selected initialization steps, in the specified 
order
   d: Drop any existing pgbench tables
   t: Create the tables used by the standard pgbench 
scenario
   g: Generate data, client-side
   G: Generate data, server-side
   v: Invoke VACUUM on the standard tables
   p: Create primary key indexes on the standard tables
   f: Create foreign key constraints between the 
standard tables
  -F, --fillfactor=NUM set fill factor

 variant-3-details-not-list.patch
  -i, --initialize invokes initialization mode
  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
   run selected initialization steps, in the specified 
order.
   These steps operate on standard tables used by 
pgbench
   'd' drops the tables, 't' creates the tables, 'g' 
generates
   the data on client-side, 'G' generates data on 
server-side,
   'v' vaccums the tables, 'p' creates primary key 
constraints,
   and 'f' creates foreign key constraints
  -F, --fillfactor=NUM set fill factor

 variant-4-terse-details-not-list.patch
  -i, --initialize invokes initialization mode
  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
   run selected initialization steps, in the specified 
order.
   These steps operate on standard tables used by 
pgbench
   'd' drop table, 't' create tables, 'g' generate data
   client-side, 'G' generate data server-side, 'v' 
vaccum
   tables, 'p' create primary keys, 'f' create foreign 
keys
  -F, --fillfactor=NUM set fill factor
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 508ed218e8..c74a596b86 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -876,7 +876,8 @@ usage(void)
 		   "\nInitialization 

Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Gurjeet Singh
On Fri, Jun 9, 2023 at 5:42 PM Gregory Smith  wrote:
>
> On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh  wrote:
>>
>> > $ pgbench -i -I dtGvp -s 500
>>
>> The steps are severely under-documented in pgbench --help output.
>
>
> I agree it's not easy to find information.  I just went through double 
> checking I had the order recently enough to remember what I did.  The man 
> pages have this:
>
> > Each step is invoked in the specified order. The default is dtgvp.
>
> Which was what I wanted to read.  Meanwhile the --help output says:
>
> >  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")
>
> %T$%%Which has the information without a lot of context for what it's used 
> for.  I'd welcome some text expansion that added a minimal but functional 
> improvement to that the existing help output; I don't have such text.

Please see attached 2 variants of the patch. Variant 1 simply tells
the reader to consult pgbench documentation. The second variant
provides a description for each of the letters, as the documentation
does. The committer can pick  the one they find suitable.

Best regards,
Gurjeet
http://Gurje.et


variant-2-detailed-description.patch
Description: Binary data


variant-1-brief-pointer-to-docs.patch
Description: Binary data


Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Gregory Smith
On Fri, Jun 9, 2023 at 1:25 PM Gurjeet Singh  wrote:

> > $ pgbench -i -I dtGvp -s 500
>
> The steps are severely under-documented in pgbench --help output.
>

I agree it's not easy to find information.  I just went through double
checking I had the order recently enough to remember what I did.  The man
pages have this:

> Each step is invoked in the specified order. The default is dtgvp.

Which was what I wanted to read.  Meanwhile the --help output says:

>  -I, --init-steps=[dtgGvpf]+ (default "dtgvp")

%T$%%Which has the information without a lot of context for what it's used
for.  I'd welcome some text expansion that added a minimal but functional
improvement to that the existing help output; I don't have such text.


Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Tristan Partin
On Fri Jun 9, 2023 at 12:25 PM CDT, Gurjeet Singh wrote:
> On Fri, Jun 9, 2023 at 6:24 AM Gregory Smith  wrote:
> >
> > Unfortunately there's no simple command line option to change just that one 
> > thing about how pgbench runs.  You have to construct a command line that 
> > documents each and every step you want instead.  You probably just want 
> > this form:
> >
> > $ pgbench -i -I dtGvp -s 500
>
> The steps are severely under-documented in pgbench --help output.
> Grepping that output I could not find any explanation of these steps,
> so I dug through the code and found them in runInitSteps(). Just as I
> was thinking of writing a patch to remedy that, just to be sure, I
> checked online docs and sure enough, they are well-documented under
> pgbench [1].
>
> I think at least a pointer to the the pgbench docs should be mentioned
> in the pgbench --help output; an average user may not rush to read the
> code to find the explanation, but a hint to where to find more details
> about what the letters in --init-steps mean, would save them a lot of
> time.
>
> [1]: https://www.postgresql.org/docs/15/pgbench.html
>
> Best regards,
> Gurjeet
> http://Gurje.et

I think this would be nice to have if you wanted to submit a patch.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Gurjeet Singh
On Fri, Jun 9, 2023 at 6:24 AM Gregory Smith  wrote:
>
> Unfortunately there's no simple command line option to change just that one 
> thing about how pgbench runs.  You have to construct a command line that 
> documents each and every step you want instead.  You probably just want this 
> form:
>
> $ pgbench -i -I dtGvp -s 500

The steps are severely under-documented in pgbench --help output.
Grepping that output I could not find any explanation of these steps,
so I dug through the code and found them in runInitSteps(). Just as I
was thinking of writing a patch to remedy that, just to be sure, I
checked online docs and sure enough, they are well-documented under
pgbench [1].

I think at least a pointer to the the pgbench docs should be mentioned
in the pgbench --help output; an average user may not rush to read the
code to find the explanation, but a hint to where to find more details
about what the letters in --init-steps mean, would save them a lot of
time.

[1]: https://www.postgresql.org/docs/15/pgbench.html

Best regards,
Gurjeet
http://Gurje.et




Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Tristan Partin
David,

I think you should submit this patch standalone. I don't see any reason
this shouldn't be reviewed and committed when fully fleshed out.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Tristan Partin
On Fri Jun 9, 2023 at 8:24 AM CDT, Gregory Smith wrote:
> On Tue, May 23, 2023 at 1:33 PM Tristan Partin  wrote:
>
> > We (Neon) have noticed that pgbench can be quite slow to populate data
> > in regard to higher latency connections. Higher scale factors exacerbate
> > this problem. Some employees work on a different continent than the
> > databases they might be benchmarking. By moving pgbench to use COPY for
> > populating all tables, we can reduce some of the time pgbench takes for
> > this particular step.
> >
>
> When latency is continent size high, pgbench should be run with server-side
> table generation instead of using COPY at all, for any table.  The default
> COPY based pgbench generation is only intended for testing where the client
> and server are very close on the network.
>
> Unfortunately there's no simple command line option to change just that one
> thing about how pgbench runs.  You have to construct a command line that
> documents each and every step you want instead.  You probably just want
> this form:
>
> $ pgbench -i -I dtGvp -s 500
>
> That's server-side table generation with all the usual steps.  I use this
> instead of COPY in pgbench-tools so much now, basically whenever I'm
> talking to a cloud system, that I have a simple 0/1 config option to switch
> between the modes, and this long weird one is the default now.
>
> Try that out, and once you see the numbers my bet is you'll see extending
> which tables get COPY isn't needed by your use case anymore.  Basically, if
> you are close enough to use COPY instead of server-side generation, you are
> close enough that every table besides accounts will not add up to enough
> time to worry about optimizing the little ones.

Thanks for your input Greg. I'm sure you're correct that server-side data
generation would probably fix the problem. I guess I am trying to
understand if there are any downsides to just committing this anyway. I
haven't done any more testing, but David's email only showed a 4%
performance drop in the workload he tested. Combine this with his hex
patch and we would see an overall performance improvement when
everything is said and done.

It seems like this patch would still be good for client-side high scale
factor data generation (when server and client are close), but I would
need to do testing to confirm.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-09 Thread Gregory Smith
On Tue, May 23, 2023 at 1:33 PM Tristan Partin  wrote:

> We (Neon) have noticed that pgbench can be quite slow to populate data
> in regard to higher latency connections. Higher scale factors exacerbate
> this problem. Some employees work on a different continent than the
> databases they might be benchmarking. By moving pgbench to use COPY for
> populating all tables, we can reduce some of the time pgbench takes for
> this particular step.
>

When latency is continent size high, pgbench should be run with server-side
table generation instead of using COPY at all, for any table.  The default
COPY based pgbench generation is only intended for testing where the client
and server are very close on the network.

Unfortunately there's no simple command line option to change just that one
thing about how pgbench runs.  You have to construct a command line that
documents each and every step you want instead.  You probably just want
this form:

$ pgbench -i -I dtGvp -s 500

That's server-side table generation with all the usual steps.  I use this
instead of COPY in pgbench-tools so much now, basically whenever I'm
talking to a cloud system, that I have a simple 0/1 config option to switch
between the modes, and this long weird one is the default now.

Try that out, and once you see the numbers my bet is you'll see extending
which tables get COPY isn't needed by your use case anymore.  Basically, if
you are close enough to use COPY instead of server-side generation, you are
close enough that every table besides accounts will not add up to enough
time to worry about optimizing the little ones.

--
Greg Smith  greg.sm...@crunchydata.com
Director of Open Source Strategy


Re: Use COPY for populating all pgbench tables

2023-06-08 Thread Tristan Partin
On Thu Jun 8, 2023 at 12:33 AM CDT, David Rowley wrote:
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
> >
> > master:
> >
> > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 
> > s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
> >
> > patchset:
> >
> > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 
> > s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.

Do you have a link to some docs? I have not heard of the feature.
Definitely feels like a worthy cause.

> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.

I didn't expect my patch to increase performance in all workloads. I was
mainly aiming to fix high-latency connections. Based on your results
that looks like a 4% reduction in performance of client-side data
generation. I had thought maybe it is worth having a flag to keep the
old way too, but I am not sure a 4% hit is really that big of a deal.

> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.

I will try to test this soon and follow up on the thread. I definitely
see no problems with your patch as is though. I would be more than happy
to rebase my patches on yours.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Use COPY for populating all pgbench tables

2023-06-08 Thread Hannu Krosing
I guess that COPY will still be slower than  generating the data
server-side ( --init-steps=...G... ) ?

What I'd really like to see is providing all the pgbench functions
also on the server. Specifically the various random(...) functions -
random_exponential(...), random_gaussian(...), random_zipfian(...) so
that also custom data generationm could be easily done server-side
with matching distributions.

On Thu, Jun 8, 2023 at 7:34 AM David Rowley  wrote:
>
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
> >
> > master:
> >
> > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 
> > s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
> >
> > patchset:
> >
> > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 
> > s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.
>
> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.
>
> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.
>
> David




Re: Use COPY for populating all pgbench tables

2023-06-07 Thread David Rowley
On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
>
> master:
>
> 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
> vacuuming...
> creating primary keys...
> done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
>
> patchset:
>
> 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 
> s, remaining 0.00 s))
> vacuuming...
> creating primary keys...
> done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).

I've also previously found pgbench -i to be slow.  It was a while ago,
and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
inside pgbench.

On seeing your email, it makes me wonder if PG16's hex integer
literals might help here.  These should be much faster to generate in
pgbench and also parse on the postgres side.

I wrote a quick and dirty patch to try that and I'm not really getting
the same performance increases as I'd have expected. I also tested
with your patch too and it does not look that impressive either when
running pgbench on the same machine as postgres.

pgbench copy speedup

** master
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).

** David's Patched
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).

** Tristan's patch
drowley@amd3990x:~$ pgbench -i -s 1000 postgres
1 of 1 tuples (100%) of pgbench_accounts done (elapsed
77.44 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).

I'm interested to see what numbers you get.  You'd need to test on
PG16 however. I left the old code in place to generate the decimal
numbers for versions < 16.

David
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1d1670d4c2..189670ffb8 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4869,6 +4869,54 @@ initTruncateTables(PGconn *con)
 "pgbench_tellers");
 }
 
+static inline char *
+uint64_to_hex(char *buffer, uint64 num)
+{
+   char   *p = buffer;
+   char   *hexnum = "0123456789ABCDEF";
+
+   *p++ = '0';
+   *p++ = 'x';
+
+   if (num != 0)
+   {
+   unsigned long result;
+   char   *end;
+
+#if defined(_MSC_VER) && (defined(_M_AMD64) || defined(_M_ARM64))
+   _BitScanReverse64(, num);
+#elif defined(HAVE__BUILTIN_CLZ)
+
+#if defined(HAVE_LONG_INT_64)
+   result = 63 - __builtin_clzl(num);
+#elif defined(HAVE_LONG_LONG_INT_64)
+   result = 63 - __builtin_clzll(num);
+#else
+#error must have a working 64-bit integer datatype
+#endif
+
+#else
+   /* XXX what to do? */
+#error unable to find method to find left-most 1 bit
+#endif
+   end = [(result / 4) + 1];
+   p = [-1];
+
+   do {
+   *p-- = hexnum[num % 16];
+   num /= 16;
+   } while (num != 0);
+
+   return end;
+   }
+   else
+   {
+   *p++ = '0';
+   return p;
+   }
+
+}
+
 /*
  * Fill the standard tables with some data generated and sent from the client
  */
@@ -4880,6 +4928,7 @@ initGenerateDataClientSide(PGconn *con)
int i;
int64   k;
char   *copy_statement;
+   int server_version = PQserverVersion(con);
 
/* used to track elapsed time and estimate of the remaining time */
pg_time_usec_t start;
@@ -4928,7 +4977,7 @@ initGenerateDataClientSide(PGconn *con)
 */
 
/* use COPY with FREEZE on v14 and later without partitioning */
-   if (partitions == 0 && PQserverVersion(con) >= 14)
+   if (partitions == 0 && server_version >= 14)
copy_statement = "copy pgbench_accounts from stdin with (freeze 
on)";
else
copy_statement = "copy pgbench_accounts from stdin";
@@ -4945,13 +4994,39 @@ initGenerateDataClientSide(PGconn *con)
{
int64   j = k + 1;
 
-   /* "filler" column defaults to blank padded empty string */
-   printfPQExpBuffer(,
- INT64_FORMAT "\t" 
INT64_FORMAT "\t%d\t\n",
- 

Re: Use COPY for populating all pgbench tables

2023-06-07 Thread Tristan Partin
On Tue May 23, 2023 at 12:33 PM CDT, Tristan Partin wrote:
> I wanted to come with benchmarks, but unfortunately I won't have them
> until next month. I can follow-up in a future email.

I finally got around to benchmarking.

master:

$ ./build/src/bin/pgbench/pgbench -i -s 500 CONNECTION_STRING
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 s))
vacuuming...
creating primary keys...
done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).

patchset:

$ ./build/src/bin/pgbench/pgbench -i -s 500 CONNECTION_STRING
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 s, 
remaining 0.00 s))
vacuuming...
creating primary keys...
done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).

I am located in Austin, Texas. The server was located in Ireland. Just
wanted to put some distance between the client and server. To summarize,
that is about an 80% reduction in client-side data generation times.

Note that I used Neon to collect these results.

-- 
Tristan Partin
Neon (https://neon.tech)




Use COPY for populating all pgbench tables

2023-05-23 Thread Tristan Partin
Hello,

We (Neon) have noticed that pgbench can be quite slow to populate data
in regard to higher latency connections. Higher scale factors exacerbate
this problem. Some employees work on a different continent than the
databases they might be benchmarking. By moving pgbench to use COPY for
populating all tables, we can reduce some of the time pgbench takes for
this particular step. 

I wanted to come with benchmarks, but unfortunately I won't have them
until next month. I can follow-up in a future email.

-- 
Tristan Partin
Neon (https://neon.tech)
From 498f5c7fddceafed3f831963dfca12256746390c Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 09:21:55 -0500
Subject: [PATCH postgres v1 1/2] Move constant into format string

If we are always going to write a 0, just put the 0 in the format
string.
---
 src/bin/pgbench/pgbench.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 70ed034e70..0fbb9a2616 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4939,8 +4939,8 @@ initGenerateDataClientSide(PGconn *con)
 
 		/* "filler" column defaults to blank padded empty string */
 		printfPQExpBuffer(,
-		  INT64_FORMAT "\t" INT64_FORMAT "\t%d\t\n",
-		  j, k / naccounts + 1, 0);
+		  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+		  j, k / naccounts + 1);
 		if (PQputline(con, sql.data))
 			pg_fatal("PQputline failed");
 
-- 
-- 
Tristan Partin
Neon (https://neon.tech)

From 6a326d6e3197142ee03140679ebedc978ee1485c Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH postgres v1 2/2] Use COPY instead of INSERT for populating all
 tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 152 ++
 1 file changed, 88 insertions(+), 64 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 0fbb9a2616..6c7f143e86 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4861,17 +4861,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PGconn *con, PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
 {
-	PQExpBufferData sql;
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
+{
+	int 		n;
+	int64 		k;
 	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	PQExpBufferData sql;
+	bool		printed = false;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4880,50 +4909,19 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/* truncate away any old data */
-	initTruncateTables(con);
-
 	initPQExpBuffer();
 
-	/*
-	 * fill branches, tellers, accounts in that order in case foreign keys
-	 * already exist
-	 */
-	for (i = 0; i < nbranches * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_branches(bid,bbalance) values(%d,0)",
-		  i + 1);
-		executeStatement(con, sql.data);
-	}
-
-	for (i = 0; i < ntellers * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(,
-		  "insert into pgbench_tellers(tid,bid,tbalance) va