Hi,

On Mon, 2 Mar 2026 at 22:55, Nathan Bossart <[email protected]> wrote:
>
> On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote:
> > If anyone has any suggestions/ideas, please let me know!

I am able to fix the problem. My first assumption was that the
branching of SIMD code caused that problem, so I moved SIMD code to
the CopyReadLineTextSIMDHelper() function. Then I moved this
CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing
that we won't have any branching in the non-SIMD (scalar) code path.
This didn't solve the problem and then I realized that even though I
disable SIMD code path with 'if (false)', there is still regression
but if I comment all of the 'if (cstate->simd_enabled)' branch, then
there is no regression at all.

To find out more, I compared assembly outputs of both and found out
the possible reason. What I understood is that the compiler can't
promote a variable to register, instead these variables live in the
stack; which is slower. Please see the two different assembly outputs:

Slow code:

        c = copy_input_buf[input_buf_ptr++];
     db0:    48 8b 55 b8              mov    -0x48(%rbp),%rdx
     db4:    48 63 c6                 movslq %esi,%rax
     db7:    44 8d 66 01              lea    0x1(%rsi),%r12d
     dbb:    44 89 65 cc              mov    %r12d,-0x34(%rbp)
     dbf:    0f be 14 02              movsbl (%rdx,%rax,1),%edx

Fast code:

        c = copy_input_buf[input_buf_ptr++];
     d80:    49 63 c4                 movslq %r12d,%rax
     d83:    45 8d 5c 24 01           lea    0x1(%r12),%r11d
     d88:    41 0f be 04 06           movsbl (%r14,%rax,1),%eax

And the reason for that is sending the address of input_buf_ptr to a
CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to
this:

int            temp_input_buf_ptr = input_buf_ptr;
CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr);

Then there is no regression. However, I am still not completely sure
if that is the same problem in the v10, I am planning to spend more
time debugging this.

> A couple of random ideas:
>
> * Additional inlining for callers.  I looked around a little bit and didn't
> see any great candidates, so I don't have much faith in this, but maybe
> you'll see something I don't.

I agree with you. CopyReadLineText() is already quite a big function.

> * Disable SIMD if we are consistently getting small rows.  That won't help
> your "wide & CSV 1/3" case in all likelihood, but perhaps it'll help with
> the regression for narrow rows described elsewhere.

I implemented this, two consecutive small rows disables SIMD.

> * Surround the variable initializations with "if (simd_enabled)".
> Presumably compilers are smart enough to remove those in the non-SIMD paths
> already, but it could be worth a try.

Done.

> * Add simd_enabled function parameter to CopyReadLine(),
> NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and do the
> bool literal trick in CopyFrom{Text,CSV}OneRow().  That could encourage the
> compiler to do some additional optimizations to reduce branching.

I think we don't need this. At least the implementation with
CopyReadLineTextSIMDHelper() doesn't need this since branching will be
at the top and it will be once per line.

I think v11 looks better compared to v10. I liked the
CopyReadLineTextSIMDHelper() helper function. I also liked it being at
the top of CopyReadLineText(), not being in the scalar path. This
gives us more optimization options without affecting the scalar path.

Here are the new benchmark results, I benchmarked the changes with
both -O2 and -O3 and also both with and without 'changing
default_toast_compression to lz4' commit (65def42b1d5). Benchmark
results show that there is no regression and the performance
improvement is much bigger with 65def42b1d5, it is close to 2x for
text format and more than 2x for the csv format.

------------------------------

Benchmark results:

With 65def42b1d5:

+---------------------------------------------------------+
|                    Optimization: -O2                    |
+--------------------------+--------------+---------------+
|                          |     Text     |      CSV      |
+--------------------------+------+-------+-------+-------+
|           WIDE           | None |  1/3  |  None |  1/3  |
+--------------------------+------+-------+-------+-------+
|        Old Master        | 4220 |  4780 |  5930 |  8250 |
+--------------------------+------+-------+-------+-------+
| Old Master + 0001 + 0002 | 2520 |  4500 |  2520 |  7800 |
+--------------------------+------+-------+-------+-------+
|                          |      |       |       |       |
+--------------------------+------+-------+-------+-------+
|                          |     Text     |      CSV      |
+--------------------------+------+-------+-------+-------+
|          NARROW          | None |  1/3  |  None |  1/3  |
+--------------------------+------+-------+-------+-------+
|        Old Master        | 9920 | 10100 | 10200 | 10470 |
+--------------------------+------+-------+-------+-------+
| Old Master + 0001 + 0002 | 9970 | 10000 | 10180 | 10350 |
+--------------------------+------+-------+-------+-------+
|                                                         |
+---------------------------------------------------------+
|                                                         |
+---------------------------------------------------------+
|                    Optimization: -O3                    |
+--------------------------+--------------+---------------+
|                          |     Text     |      CSV      |
+--------------------------+------+-------+-------+-------+
|           WIDE           | None |  1/3  |  None |  1/3  |
+--------------------------+------+-------+-------+-------+
|        Old Master        | 4100 |  4900 |  6200 |  8300 |
+--------------------------+------+-------+-------+-------+
| Old Master + 0001 + 0002 | 2470 |  4440 |  2570 |  7700 |
+--------------------------+------+-------+-------+-------+
|                          |      |       |       |       |
+--------------------------+------+-------+-------+-------+
|                          |     Text     |      CSV      |
+--------------------------+------+-------+-------+-------+
|          NARROW          | None |  1/3  |  None |  1/3  |
+--------------------------+------+-------+-------+-------+
|        Old Master        | 9530 |  9690 |  9800 | 10080 |
+--------------------------+------+-------+-------+-------+
| Old Master + 0001 + 0002 | 9350 |  9450 |  9700 | 10000 |
+--------------------------+------+-------+-------+-------+

------------------------------

Without 65def42b1d5:

+----------------------------------------------------------+
|                     Optimization: -O2                    |
+--------------------------+---------------+---------------+
|                          |      Text     |      CSV      |
+--------------------------+-------+-------+-------+-------+
|           WIDE           |  None |  1/3  |  None |  1/3  |
+--------------------------+-------+-------+-------+-------+
|        Old Master        | 10550 | 11030 | 12250 | 14400 |
+--------------------------+-------+-------+-------+-------+
| Old Master + 0001 + 0002 |  8890 | 10700 |  8870 | 14070 |
+--------------------------+-------+-------+-------+-------+
|                          |       |       |       |       |
+--------------------------+-------+-------+-------+-------+
|                          |      Text     |      CSV      |
+--------------------------+-------+-------+-------+-------+
|          NARROW          |  None |  1/3  |  None |  1/3  |
+--------------------------+-------+-------+-------+-------+
|        Old Master        |  9921 | 10205 | 10123 | 10420 |
+--------------------------+-------+-------+-------+-------+
| Old Master + 0001 + 0002 |  9880 | 10070 | 10150 | 10400 |
+--------------------------+-------+-------+-------+-------+
|                                                          |
+----------------------------------------------------------+
|                                                          |
+----------------------------------------------------------+
|                     Optimization: -O3                    |
+--------------------------+---------------+---------------+
|                          |      Text     |      CSV      |
+--------------------------+-------+-------+-------+-------+
|           WIDE           |  None |  1/3  |  None |  1/3  |
+--------------------------+-------+-------+-------+-------+
|        Old Master        | 10500 | 11100 | 12600 | 14580 |
+--------------------------+-------+-------+-------+-------+
| Old Master + 0001 + 0002 |  8900 | 10660 |  8860 | 13990 |
+--------------------------+-------+-------+-------+-------+
|                          |       |       |       |       |
+--------------------------+-------+-------+-------+-------+
|                          |      Text     |      CSV      |
+--------------------------+-------+-------+-------+-------+
|          NARROW          |  None |  1/3  |  None |  1/3  |
+--------------------------+-------+-------+-------+-------+
|        Old Master        |  9600 |  9700 |  9800 | 10150 |
+--------------------------+-------+-------+-------+-------+
| Old Master + 0001 + 0002 |  9300 |  9470 |  9600 |  9880 |
+--------------------------+-------+-------+-------+-------+

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 7acaeb3201ae4ae279bf8b25641bea7f8cb92cbe Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Wed, 4 Mar 2026 17:28:54 +0300
Subject: [PATCH v11] Speed up COPY FROM text/CSV parsing using SIMD

This patch disables SIMD when SIMD encounters a special character which
is neither EOF nor EOL.

Author: Shinya Kato <[email protected]>
Author: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Kazar Ayoub <[email protected]>
Reviewed-by: Nathan Bossart <[email protected]>
Reviewed-by: Neil Conway <[email protected]>
Reviewed-by: Andrew Dunstan <[email protected]>
Reviewed-by: Manni Wood <[email protected]>
Reviewed-by: Mark Wong <[email protected]>
Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com
---
 src/backend/commands/copyfrom.c          |   4 +
 src/backend/commands/copyfromparse.c     | 222 ++++++++++++++++++++++-
 src/include/commands/copyfrom_internal.h |   4 +
 3 files changed, 223 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2f42f55e229..2aa52810ff1 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1747,6 +1747,10 @@ BeginCopyFrom(ParseState *pstate,
 	cstate->cur_attval = NULL;
 	cstate->relname_only = false;
 
+	/* Initialize SIMD */
+	cstate->simd_enabled = true;
+	cstate->simd_failed_first_vector = false;
+
 	/*
 	 * Allocate buffers for the input pipeline.
 	 *
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index fbd13353efc..70e1a5a0410 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -72,6 +72,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -158,6 +159,12 @@ static pg_attribute_always_inline bool NextCopyFromRawFieldsInternal(CopyFromSta
 																	 int *nfields,
 																	 bool is_csv);
 
+/* SIMD functions */
+#ifndef USE_NO_SIMD
+static bool CopyReadLineTextSIMDHelper(CopyFromState cstate, bool is_csv,
+									   bool *temp_hit_eof, int *temp_input_buf_ptr);
+#endif
+
 
 /* Low-level communications functions */
 static int	CopyGetData(CopyFromState cstate, void *databuf,
@@ -1310,6 +1317,182 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
 	return result;
 }
 
+#ifndef USE_NO_SIMD
+/*
+ * Use SIMD instructions to efficiently scan the input buffer for special
+ * characters (e.g., newline, carriage return, quote, and escape). This is
+ * faster than byte-by-byte iteration, especially on large buffers.
+ *
+ * Note that, SIMD may become slower when the input contains many special
+ * characters. To avoid this regression, we disable SIMD for the rest of the
+ * input once we encounter a special character which is neither EOF nor EOL.
+ * Also, SIMD is disabled when it encounters two consecutive short lines that
+ * SIMD can't create a full sized Vector, too.
+ */
+static bool
+CopyReadLineTextSIMDHelper(CopyFromState cstate, bool is_csv, bool *temp_hit_eof, int *temp_input_buf_ptr)
+{
+	char		quotec = '\0';
+	char		escapec = '\0';
+	char	   *copy_input_buf;
+	int			input_buf_ptr;
+	int			copy_buf_len;
+	bool		result = false;
+	bool		unique_escapec = false;
+	bool		first_vector = true;
+	Vector8		nl = vector8_broadcast('\n');
+	Vector8		cr = vector8_broadcast('\r');
+	Vector8		bs = vector8_broadcast('\\');
+	Vector8		quote = vector8_broadcast(0);
+	Vector8		escape = vector8_broadcast(0);
+
+	if (is_csv)
+	{
+		quotec = cstate->opts.quote[0];
+		escapec = cstate->opts.escape[0];
+
+		quote = vector8_broadcast(quotec);
+		if (quotec != escapec)
+		{
+			unique_escapec = true;
+			escape = vector8_broadcast(escapec);
+		}
+	}
+
+	/* For a little extra speed we copy these into local variables */
+	copy_input_buf = cstate->input_buf;
+	input_buf_ptr = cstate->input_buf_index;
+	copy_buf_len = cstate->input_buf_len;
+
+	while (true)
+	{
+		/* Load more data if needed */
+		if (sizeof(Vector8) >= copy_buf_len - input_buf_ptr)
+		{
+			REFILL_LINEBUF;
+
+			CopyLoadInputBuf(cstate);
+			/* update our local variables */
+			*temp_hit_eof = cstate->input_reached_eof;
+			input_buf_ptr = cstate->input_buf_index;
+			copy_buf_len = cstate->input_buf_len;
+
+			/*
+			 * If we are completely out of data, break out of the loop,
+			 * reporting EOF.
+			 */
+			if (INPUT_BUF_BYTES(cstate) <= 0)
+			{
+				result = true;
+				break;
+			}
+		}
+
+		if (copy_buf_len - input_buf_ptr > sizeof(Vector8))
+		{
+			Vector8		chunk;
+			Vector8		match = vector8_broadcast(0);
+
+			/* Load a chunk of data into a vector register */
+			vector8_load(&chunk, (const uint8 *) &copy_input_buf[input_buf_ptr]);
+
+			if (is_csv)
+			{
+				match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+				match = vector8_or(match, vector8_eq(chunk, quote));
+				if (unique_escapec)
+					match = vector8_or(match, vector8_eq(chunk, escape));
+			}
+			else
+			{
+				match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+				match = vector8_or(match, vector8_eq(chunk, bs));
+			}
+
+			/* Check if we found any special characters */
+			if (vector8_is_highbit_set(match))
+			{
+				/*
+				 * Found a special character. Advance up to that point and let
+				 * the scalar code handle it.
+				 */
+				uint32		mask;
+				int			advance;
+				char		c1,
+							c2;
+				bool		simd_hit_eol,
+							simd_hit_eof;
+
+				mask = vector8_highbit_mask(match);
+				advance = pg_rightmost_one_pos32(mask);
+
+				input_buf_ptr += advance;
+				c1 = copy_input_buf[input_buf_ptr];
+
+				/*
+				 * Since we stopped within the chunk and ((copy_buf_len -
+				 * input_buf_ptr) > sizeof(Vector8)) is true,
+				 * copy_input_buf[input_buf_ptr + 1] is guaranteed to be
+				 * readable.
+				 */
+				c2 = copy_input_buf[input_buf_ptr + 1];
+
+				simd_hit_eof = (c1 == '\\' && c2 == '.' && !is_csv);
+				simd_hit_eol = (c1 == '\r' || c1 == '\n');
+
+				/*
+				 * Do not disable SIMD when we hit EOL or EOF characters. In
+				 * practice, it does not matter for EOF because parsing ends
+				 * there, but we keep the behavior consistent.
+				 */
+				if (!(simd_hit_eof || simd_hit_eol))
+					cstate->simd_enabled = false;
+
+				/*
+				 * We encountered a EOL or EOF on the first vector. This means
+				 * lines are not long enough to skip fully sized vector. If
+				 * this happens two times consecutively, then disable the
+				 * SIMD.
+				 */
+				if (first_vector)
+				{
+					if (cstate->simd_failed_first_vector)
+						cstate->simd_enabled = false;
+
+					cstate->simd_failed_first_vector = true;
+				}
+
+				break;
+			}
+			else
+			{
+				/* No special characters found, so skip the entire chunk */
+				input_buf_ptr += sizeof(Vector8);
+				first_vector = false;
+			}
+		}
+
+		/*
+		 * Although we refill linebuf, there is not enough character to fill
+		 * full sized vector. This doesn't mean that we encountered a line
+		 * that is not enough to fill a full sized vector.
+		 *
+		 * Scalar code will handle the rest for this line. Then, SIMD will
+		 * continue from the next line.
+		 */
+		else
+		{
+			first_vector = false;
+			break;
+		}
+	}
+
+	cstate->simd_failed_first_vector = first_vector;
+	*temp_input_buf_ptr = input_buf_ptr;
+	return result;
+}
+#endif
+
 /*
  * CopyReadLineText - inner loop of CopyReadLine for text mode
  */
@@ -1338,6 +1521,38 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 			escapec = '\0';
 	}
 
+	/* input_buf_ptr will be used in the SIMD Helper function */
+	input_buf_ptr = cstate->input_buf_index;
+
+#ifndef USE_NO_SIMD
+	/* First try to run SIMD, then continue with the scalar path */
+	if (cstate->simd_enabled)
+	{
+		int			temp_input_buf_ptr = input_buf_ptr;
+		bool		temp_hit_eof = false;
+
+		result = CopyReadLineTextSIMDHelper(cstate, is_csv, &temp_hit_eof,
+											&temp_input_buf_ptr);
+		input_buf_ptr = temp_input_buf_ptr;
+		hit_eof = temp_hit_eof;
+
+		/* Short exit from SIMD */
+		if (result)
+		{
+			/*
+			 * Transfer any still-uncopied data to line_buf.
+			 */
+			REFILL_LINEBUF;
+
+			return result;
+		}
+	}
+#endif
+
+	/* For a little extra speed we copy these into local variables */
+	copy_input_buf = cstate->input_buf;
+	copy_buf_len = cstate->input_buf_len;
+
 	/*
 	 * The objective of this loop is to transfer the entire next input line
 	 * into line_buf.  Hence, we only care for detecting newlines (\r and/or
@@ -1359,14 +1574,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 	 * character to examine; any characters from input_buf_index to
 	 * input_buf_ptr have been determined to be part of the line, but not yet
 	 * transferred to line_buf.
-	 *
-	 * For a little extra speed within the loop, we copy input_buf and
-	 * input_buf_len into local variables.
 	 */
-	copy_input_buf = cstate->input_buf;
-	input_buf_ptr = cstate->input_buf_index;
-	copy_buf_len = cstate->input_buf_len;
-
 	for (;;)
 	{
 		int			prev_raw_ptr;
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index f892c343157..4a748df8ac8 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -89,6 +89,10 @@ typedef struct CopyFromStateData
 	const char *cur_attval;		/* current att value for error messages */
 	bool		relname_only;	/* don't output line number, att, etc. */
 
+	/* SIMD variables */
+	bool		simd_enabled;
+	bool		simd_failed_first_vector;
+
 	/*
 	 * Working state
 	 */
-- 
2.47.3

Reply via email to