I was using COPY recently and was wondering why BINARY format is not much (if any) faster than the default format. Once I switched from mostly exporting ints to mostly exporting double precisions (7e6 rows of 100 columns, randomly generated), it was faster, but not by as much as I intuitively thought it should be.
Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL: PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit I saw that the hotspot was pq_begintypsend at 20%, which was twice the percentage as the next place winner (AllocSetAlloc). If I drill down into teh function, I see something like the below. I don't really speak assembly, but usually when I see an assembly instruction being especially hot and not being the inner most instruction in a loop, I blame it on CPU cache misses. But everything being touched here should already be well cached, since initStringInfo has just got done setting it up. And if not for that, then the by the 2nd invocation of appendStringInfoCharMacro it certainly should be in the cache, yet that one is even slower than the 1st appendStringInfoCharMacro. Why is this such a bottleneck? pq_begintypsend /usr/local/pgsql/bin/postgres 0.15 | push %rbx 0.09 | mov %rdi,%rbx | initStringInfo(buf); 3.03 | callq initStringInfo | /* Reserve four bytes for the bytea length word */ | appendStringInfoCharMacro(buf, '\0'); | movslq 0x8(%rbx),%rax 1.05 | lea 0x1(%rax),%edx 0.72 | cmp 0xc(%rbx),%edx | jge b0 2.92 | mov (%rbx),%rdx | movb $0x0,(%rdx,%rax,1) 13.76 | mov 0x8(%rbx),%eax 0.81 | mov (%rbx),%rdx 0.52 | add $0x1,%eax 0.12 | mov %eax,0x8(%rbx) 2.85 | cltq 0.01 | movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 10.65 | movslq 0x8(%rbx),%rax | lea 0x1(%rax),%edx 0.90 | cmp 0xc(%rbx),%edx | jge ca 0.54 | 42: mov (%rbx),%rdx 1.84 | movb $0x0,(%rdx,%rax,1) 13.88 | mov 0x8(%rbx),%eax 0.03 | mov (%rbx),%rdx | add $0x1,%eax 0.33 | mov %eax,0x8(%rbx) 2.60 | cltq 0.06 | movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 3.21 | movslq 0x8(%rbx),%rax 0.23 | lea 0x1(%rax),%edx 1.74 | cmp 0xc(%rbx),%edx | jge e0 0.21 | 67: mov (%rbx),%rdx 1.18 | movb $0x0,(%rdx,%rax,1) 9.29 | mov 0x8(%rbx),%eax 0.18 | mov (%rbx),%rdx | add $0x1,%eax 0.19 | mov %eax,0x8(%rbx) 3.14 | cltq 0.12 | movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 5.29 | movslq 0x8(%rbx),%rax 0.03 | lea 0x1(%rax),%edx 1.45 | cmp 0xc(%rbx),%edx | jge f6 0.41 | 8c: mov (%rbx),%rdx Cheers, Jeff