Hi, Thank you for looking into this!
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <[email protected]> wrote: > > On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote: > > Thanks, done. > > I took a look at the v3 patches. Here are my high-level thoughts: > > + /* > + * Parse data and transfer into line_buf. To get benefit from inlining, > + * call CopyReadLineText() with the constant boolean variables. > + */ > + if (cstate->simd_continue) > + result = CopyReadLineText(cstate, is_csv, true); > + else > + result = CopyReadLineText(cstate, is_csv, false); > > I'm curious whether this actually generates different code, and if it does, > if it's actually faster. We're already branching on cstate->simd_continue > here. I had the same doubts before but my benchmark shows nice speedup. I used a test which is full of delimiters. The current code gives 2700 ms but when I changed these lines with the 'result = CopyReadLineText(cstate, is_csv, cstate->simd_continue);', the result was 2920 ms. I compiled code with both -O3 and -O2 and the results were similar. > > + /* Load a chunk of data into a vector register */ > + vector8_load(&chunk, (const uint8 *) > ©_input_buf[input_buf_ptr]); > > In other places, processing 2 or 4 vectors of data at a time has proven > faster. Have you tried that here? Sorry, I could not find the related code piece. I only saw the vector8_load() inside of hex_decode_safe() function and its comment says: /* * We must process 2 vectors at a time since the output will be half the * length of the input. */ But this does not mention any speedup from using 2 vectors at a time. Could you please show the related code? > > + /* \n and \r are not special inside quotes */ > + if (!in_quote) > + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, > cr)); > + > + if (is_csv) > + { > + match = vector8_or(match, vector8_eq(chunk, quote)); > + if (escapec != '\0') > + match = vector8_or(match, vector8_eq(chunk, escape)); > + } > + else > + match = vector8_or(match, vector8_eq(chunk, bs)); > > The amount of branching here catches my eye. Some branching might be > unavoidable, but in general we want to keep these SIMD paths as branch-free > as possible. You are right, I will check these branches and will try to remove as many branches as possible. > > + /* > + * Found a special character. Advance up to that point and > let > + * the scalar code handle it. > + */ > + int advance = pg_rightmost_one_pos32(mask); > + > + input_buf_ptr += advance; > + simd_total_advance += advance; > > Do we actually need to advance here? Or could we just fall through to the > scalar path? My suspicion is that this extra code doesn't gain us much. My testing shows that if we advance more than ~5 characters then SIMD is worth it, but if we advance less than ~5; then code causes a regression. I used this information while writing a heuristic. > > + if (simd_last_sleep_cycle == 0) > + simd_last_sleep_cycle = 1; > + else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2) > + simd_last_sleep_cycle = SIMD_SLEEP_MAX; > + else > + simd_last_sleep_cycle <<= 1; > + cstate->simd_current_sleep_cycle = simd_last_sleep_cycle; > + cstate->simd_last_sleep_cycle = simd_last_sleep_cycle; > > IMHO we should be looking for ways to simplify this should-we-use-SIMD > code. For example, perhaps we could just disable the SIMD path for 10K or > 100K lines any time a special character is found. I'm dubious that a lot > of complexity is warranted. I think this is a bit too harsh since SIMD is still worth it if SIMD can advance more than ~5 character average. I am trying to use SIMD as much as possible when it is worth it but what you said can remove the regression completely, perhaps that is the correct way. -- Regards, Nazir Bilal Yavuz Microsoft
