Re: Proposal: PqSendBuffer removal

2020-03-10 Thread Craig Ringer
faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund  wrote:
>
> Hi,
>
> On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov  wrote:
> >Thank you for your reply!
> >
> >Yes, you are right there will be a separate call to send the data, but
> >is
> >copying data each time more costly operation than just one syscall?
>
> Yes, it's very likely to be more expensive to execute a syscall in a lot of 
> cases. They've gotten a lot more expensive with all the security issues.

Good to know.

> >Besides, if we already have a ready message packet to be sent why
> >should we
> >wait?
>
> In a lot of cases we'll send a number of small messages after each other. We 
> don't want to send those out separately, that'd just increase overhead.

Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so
we'd tend to generate these messages as separate packets as well as
seperate syscalls, making it doubly undesirable.

We can dynamically twiddle TCP_NODELAY like many other applications do
to optimise the wire packets, but not the syscalls. It'd actually cost
extra syscalls.

> But in some paths/workloads the copy is quite noticable. I've mused before 
> whether we could extend StringInfo to handle cases like this. E.g. by having 
> StringInfo have two lengths. One that is the offset to the start of the 
> allocated memory (0 for plain StringInfos), and one for the length of the 
> string being built.
>
> Then we could get a StringInfo pointing directly to the current insertion 
> point in the send buffer.  To support growing it, enlargeStringInfo would 
> first subtract the offset to the start of the allocation, and then reallocate 
> that.
>
> I can imagine that bring useful in a number of places. And because there only 
> would be additional overhead when actually growing the StringInfo, I don't 
> think the cost would be measurable.

That sounds pretty sensible as it'd be minimally intrusive.

I've wondered whether we can further optimise some uses by having
libpq-be manage an iovec for us instead, much like we support iovec
for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a
libpq-managed struct. libpq would reserve the first few entries in the
managed iovec for the message header. Variants of pq_sendint64(...)
etc would add entries to the iovec and could be inlined since they'd
just be convenience routines. The message would be flushed by having
libpq call writev() on the iovec container.

We'd want a wrapper struct for the iovec so we could have libpq keep a
cursor for the next entry in the iovec. For libpq-fe it'd also contain
a map of which iovec entries need to be free()d; for libpq-be we'd
probably palloc(), maybe with a child memory context. To support a
stack-allocated iovec for when we know all the message fields in
advance we'd have an init function that takes the address  of the
preallocated iovec and its length limit.

We could support  We could also support a libpq-wrapped iovec where
libpq can realloc it if it fills.
stack-allocated iovec - so the caller would be responsible for
managing the max size


That way we can do zero-copy scatter-gather I/O for messages that
don't require binary-to-text-format transformations etc.

BTW, if we change StringInfo, I'd like to also official bless the
usage pattern where we wrap a buffer in a StringInfo so we can use
pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si,
const char * buf, size_t buflen) or something that assigns the
StringInfo values and sets maxlen = -1.  The only in-core usage I see
for this so far is in src/backend/replication/logical/worker.c but
it's used extremely heavily in pglogical etc. It'd just be a
convenience function that blesses and documents existing usage.

But like Tom I really want to first come back to the evidence. Why
should we bother? Are we solving an actual problem here? PostgreSQL is
immensely memory-allocation-happy and copy-happy. Shouldn't we be more
interested in things like reducing the cost of multiple copies and
transform passes of Datum values? Especially since that's an actual
operational pain point when you're working with multi-hundred-megabyte
bytea or text fields.

Can you come up with some profiling/performance numbers that track
time spent on memory copying in the areas you propose to target, plus
malloc overheads? With a tool like systemtap or perf it should not be
overly difficult to do so by making targeted probes that filter based
on callstack, or on file / line-range or function.




Re: Proposal: PqSendBuffer removal

2020-03-09 Thread Tom Lane
Andres Freund  writes:
> On 2020-03-07 13:54:57 -0500, Tom Lane wrote:
>> This seems way overcomplicated compared to what I suggested (ie,
>> just let internal_putbytes bypass the buffer in cases where the
>> data would get flushed next time through its loop anyway).

> Well, we quite frequently send out multiple messages in a row, without a
> flush inbetween. It'd be nice if we could avoid both copying buffers for
> each message, as well as allocating a new stringinfo.

That gets you right into the situation where trouble adding more messages
could corrupt/destroy messages that were supposedly already sent (but in
reality aren't flushed to the client quite yet).  I really think that
there is not enough win available here to justify introducing that kind
of fragility.

To be blunt, no actual evidence has been offered in this thread that
it's worth changing anything at all in this area.  All of the bytes
in question eventually have to be delivered to the client, which is
going to involve two kernel-space/user-space copy steps along with
who-knows-what network transmission overhead.  The idea that an
extra process-local memcpy or two is significant compared to that
seems like mostly wishful thinking to me.

regards, tom lane




Re: Proposal: PqSendBuffer removal

2020-03-09 Thread Andres Freund
Hi,

On 2020-03-07 13:54:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > What I'm thinking is that we'd have pg_beginmessage() (potentially a
> > different named version) initialize the relevant StringInfo basically as
> 
> > (StringInfoData){
> > .data = PqSendPointer,
> > .len = 0,
> > .alloc_offset = PqSendBuffer - PqSendBuffer
> > }
> 
> This seems way overcomplicated compared to what I suggested (ie,
> just let internal_putbytes bypass the buffer in cases where the
> data would get flushed next time through its loop anyway).

Well, we quite frequently send out multiple messages in a row, without a
flush inbetween. It'd be nice if we could avoid both copying buffers for
each message, as well as allocating a new stringinfo.

We've reduced the number of wholesale stringinfo reallocations with
pq_beginmessage_reuse(), which is e.g. significant when actually
returning tuples, and that was a noticable performance improvement.

I don't believe that the copy is a performance relevant factor solely
for messages that are individually too large to fit in the send
buffer. For one, there'll often be some pending send data from a
previous "round", which'd mean we'd need to call send() more often, or
use vectorized IO (i.e. switch to writev()). But also,


> What you're suggesting would be a lot more invasive and restrictive
> --- for example, why is it a good idea to have a hard-wired
> assumption that we can't build more than one message at once?

Well, we don't seem to have many (any?) places where that's not the
case. And having to use only one layer of buffering for outgoing data
does seem advantageous to me.  It'd not be hard to fall back to a
separate buffer just for the cases where there are multiple messages
built concurrently, if we want to support that.


> I'm also concerned that trying to do too much optimization here will
> break one of the goals of the existing code, which is to not get into
> a situation where an OOM failure causes a wire protocol violation
> because we've already sent part of a message but are no longer able to
> send the rest of it.  To ensure that doesn't happen, we must construct
> the whole message before we start to send it, and we can't let
> buffering of the last message be too entwined with construction of the
> next one.  Between that and the (desirable) arms-length separation
> between datatype I/O functions and the actual I/O, a certain amount of
> data copying seems unavoidable.

Sure. But I don't see why that requires two levels of buffering for
messages? If we were to build the message in the output buffer, resizing
as needed, we can send the data once the message is complete, or not at
all.

I don't think anything on the datatype I/O level would be affected?

While I think it'd be quite desirable to avoid e.g. the separate
stringinfo allocation for send functions, I think that's quite a
separate project. One which I have no really good idea to tackle.

Greetings,

Andres Freund


[1] Since I had looked it up:

We do a separate message for each of:
1) result description
2) each result row
3) ReadyForQuery

And we separately call through PQcommMethods for things like
pq_putemptymessage() and uses of pq_putmessage() not going through
pq_endmessage. The former is called a lot, especially when using the
extended query protocol (which we want clients to use!).


For a SELECT 1 in the simple protocol we end up calling putmessage via:
1) SendRowDescriptionMessage
2) printtup()
3) EndCommand()
4) ReadyForQuery()

For extended:
1) exec_parse_message()
2) exec_bind_message()
3) exec_describe_portal_message()
4) printtup()
5) EndCommand()
6) ReadyForQuery()




Re: Proposal: PqSendBuffer removal

2020-03-07 Thread Tom Lane
Andres Freund  writes:
> What I'm thinking is that we'd have pg_beginmessage() (potentially a
> different named version) initialize the relevant StringInfo basically as

> (StringInfoData){
> .data = PqSendPointer,
> .len = 0,
> .alloc_offset = PqSendBuffer - PqSendBuffer
> }

This seems way overcomplicated compared to what I suggested (ie,
just let internal_putbytes bypass the buffer in cases where the
data would get flushed next time through its loop anyway).
What you're suggesting would be a lot more invasive and restrictive
--- for example, why is it a good idea to have a hard-wired
assumption that we can't build more than one message at once?

I'm also concerned that trying to do too much optimization here will
break one of the goals of the existing code, which is to not get into
a situation where an OOM failure causes a wire protocol violation
because we've already sent part of a message but are no longer able to
send the rest of it.  To ensure that doesn't happen, we must construct
the whole message before we start to send it, and we can't let
buffering of the last message be too entwined with construction of the
next one.  Between that and the (desirable) arms-length separation
between datatype I/O functions and the actual I/O, a certain amount of
data copying seems unavoidable.

regards, tom lane




Re: Proposal: PqSendBuffer removal

2020-03-07 Thread Andres Freund
Hi,

On 2020-03-06 11:09:23 -0800, Aleksei Ivanov wrote:
> *>Then we could get a StringInfo pointing directly to the current insertion
> point in the send buffer.  To support growing it, enlargeStringInfo would
> first subtract the offset to the start of the allocation, and then
> reallocate that*.
>
> I thought about it yesterday and one issue with this approach is how would
> you known the length of the packet to be sent. As we can’t returned back in
> PqSendBuffer. Also realloc is quite expensive operation.

Could you expand a bit on what you see as the problem? Because I'm not
following?

What does any of this have to do with realloc performance? I mean, the
buffer would just scale up once, so the cost of that would be very
quickly amortized?

What I'm thinking is that we'd have pg_beginmessage() (potentially a
different named version) initialize the relevant StringInfo basically as

(StringInfoData){
.data = PqSendPointer,
.len = 0,
.alloc_offset = PqSendBuffer - PqSendBuffer
}

and that pq_endmessage would then advance the equivalent (see below [1]) of
what today is PqSendPointer to be PqSendPointer += StringInfo->len;

That'd mean that we'd never need to copy data in/out the send buffer
anymore, because we'd directly construct the message in the send
buffer. Pretty much all important FE/BE communication goes through
pq_beginmessage[_reuse()].

We'd have to add some defenses against building multiple messages at the
same time. But neither do I think that is common, nor does it seem hard
to defend againt: A simple counter should do the trick?

Regards,
Andres


[1] Obviously the above sketch doesn't quite work that way. We can't
just have stringinfo reallocate the buffer. Feels quite solvable though.




Re: Proposal: PqSendBuffer removal

2020-03-06 Thread Aleksei Ivanov
*>Then we could get a StringInfo pointing directly to the current insertion
point in the send buffer.  To support growing it, enlargeStringInfo would
first subtract the offset to the start of the allocation, and then
reallocate that*.

I thought about it yesterday and one issue with this approach is how would
you known the length of the packet to be sent. As we can’t returned back in
PqSendBuffer. Also realloc is quite expensive operation.

Previously I suggested to include offset into stringinfo and if it is large
enough we will have an opportunity to send it directly and it will not
required a lot of changes.


On Fri, Mar 6, 2020 at 10:45 Andres Freund  wrote:

> Hi,
>
> On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov 
> wrote:
> >Thank you for your reply!
> >
> >Yes, you are right there will be a separate call to send the data, but
> >is
> >copying data each time more costly operation than just one syscall?
>
> Yes, it's very likely to be more expensive to execute a syscall in a lot
> of cases. They've gotten a lot more expensive with all the security issues.
>
> >Besides, if we already have a ready message packet to be sent why
> >should we
> >wait?
>
> In a lot of cases we'll send a number of small messages after each other.
> We don't want to send those out separately, that'd just increase overhead.
>
>
> But in some paths/workloads the copy is quite noticable. I've mused before
> whether we could extend StringInfo to handle cases like this. E.g. by
> having StringInfo have two lengths. One that is the offset to the start of
> the allocated memory (0 for plain StringInfos), and one for the length of
> the string being built.
>
> Then we could get a StringInfo pointing directly to the current insertion
> point in the send buffer.  To support growing it, enlargeStringInfo would
> first subtract the offset to the start of the allocation, and then
> reallocate that.
>
> I can imagine that bring useful in a number of places. And because there
> only would be additional overhead when actually growing the StringInfo, I
> don't think the cost would be measurable.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>


Re: Proposal: PqSendBuffer removal

2020-03-06 Thread Andres Freund
Hi, 

On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov  wrote:
>Thank you for your reply!
>
>Yes, you are right there will be a separate call to send the data, but
>is
>copying data each time more costly operation than just one syscall?

Yes, it's very likely to be more expensive to execute a syscall in a lot of 
cases. They've gotten a lot more expensive with all the security issues. 

>Besides, if we already have a ready message packet to be sent why
>should we
>wait?

In a lot of cases we'll send a number of small messages after each other. We 
don't want to send those out separately, that'd just increase overhead.


But in some paths/workloads the copy is quite noticable. I've mused before 
whether we could extend StringInfo to handle cases like this. E.g. by having 
StringInfo have two lengths. One that is the offset to the start of the 
allocated memory (0 for plain StringInfos), and one for the length of the 
string being built.

Then we could get a StringInfo pointing directly to the current insertion point 
in the send buffer.  To support growing it, enlargeStringInfo would first 
subtract the offset to the start of the allocation, and then reallocate that. 

I can imagine that bring useful in a number of places. And because there only 
would be additional overhead when actually growing the StringInfo, I don't 
think the cost would be measurable.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Proposal: PqSendBuffer removal

2020-03-05 Thread Craig Ringer
On Fri, 6 Mar 2020 at 07:27, Aleksei Ivanov  wrote:
>
> > What do you mean "just one syscall"?  The entire point here is that it'd 
> > take more syscalls to send the same amount of data.
>
> I mean that it messages are large enough more than 2K we will need 4 syscalls 
> without copy it to the internal buffer, but currently we will copy 8K of 
> messages and send it using 1 call. I think that under some threshold of 
> packet length it is redundant to copy it to internal buffer and the data can 
> be sent directly.

I think what you're suggesting is more complex than you may expect.
PostgreSQL is single threaded and relies pretty heavily on the ability
to buffer internally. It also expects its network I/O to always
succeed. Just switching to directly doing nonblocking I/O is not very
feasible. Changing the network I/O paths may expose a lot more
opportunities for send vs receive deadlocks.

It also complicates the protocol's handling of message boundaries,
since failures and interruptions can occur at more points.

Have you measured anything that suggests that our admittedly
inefficient multiple handling of send buffers is
performance-significant compared to the vast amount of memory
allocation and copying we do all over the place elsewhere? Do you have
a concrete reason to want to remove this?

If I had to change this model I'd probably be looking at an
iovector-style approach, like we use with shm_mq. Assemble an array of
buffer descriptors pointing to short, usually statically allocated
buffers and populate one with each pqformat step. Then when the
message is assembled, use writev(2) or similar to dispatch it. Maybe
do some automatic early flushing if the buffer space overflows. But
that might need a protocol extension so we had a way to recover after
interrupted sending of a partial message...

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Proposal: PqSendBuffer removal

2020-03-05 Thread Aleksei Ivanov
*> What do you mean "just one syscall"?  The entire point here is that
it'd take more syscalls to send the same amount of data.*

I mean that it messages are large enough more than 2K we will need 4
syscalls without copy it to the internal buffer, but currently we will copy
8K of messages and send it using 1 call. I think that under some threshold
of packet length it is redundant to copy it to internal buffer and the data
can be sent directly.








*> It does strike me that with the v3 protocol, we do sometimes have
caseswhere internal_putbytes is reached with a fairly large "len".  If
we'veflushed out what's in PqSendBuffer to start with, and there's more
thana bufferload remaining in the source data, we could send the sourcedata
directly to output without copying it to the buffer first.That could
actually result in *fewer* kernel calls not more, if "len"is large enough.
But I strongly doubt that a code path that netsout to more kernel calls
will win.*

Yes, internal_putbytes can be updated to send data directly if the length
is more than internal buffer size.

On Thu, Mar 5, 2020 at 15:04 Tom Lane  wrote:

> Aleksei Ivanov  writes:
> > Yes, you are right there will be a separate call to send the data, but is
> > copying data each time more costly operation than just one syscall?
>
> What do you mean "just one syscall"?  The entire point here is that it'd
> take more syscalls to send the same amount of data.
>
> It does strike me that with the v3 protocol, we do sometimes have cases
> where internal_putbytes is reached with a fairly large "len".  If we've
> flushed out what's in PqSendBuffer to start with, and there's more than
> a bufferload remaining in the source data, we could send the source
> data directly to output without copying it to the buffer first.
> That could actually result in *fewer* kernel calls not more, if "len"
> is large enough.  But I strongly doubt that a code path that nets
> out to more kernel calls will win.
>
> regards, tom lane
>


Re: Proposal: PqSendBuffer removal

2020-03-05 Thread Tom Lane
Aleksei Ivanov  writes:
> Yes, you are right there will be a separate call to send the data, but is
> copying data each time more costly operation than just one syscall?

What do you mean "just one syscall"?  The entire point here is that it'd
take more syscalls to send the same amount of data.

It does strike me that with the v3 protocol, we do sometimes have cases
where internal_putbytes is reached with a fairly large "len".  If we've
flushed out what's in PqSendBuffer to start with, and there's more than
a bufferload remaining in the source data, we could send the source
data directly to output without copying it to the buffer first.
That could actually result in *fewer* kernel calls not more, if "len"
is large enough.  But I strongly doubt that a code path that nets
out to more kernel calls will win.

regards, tom lane




Re: Proposal: PqSendBuffer removal

2020-03-05 Thread Aleksei Ivanov
Thank you for your reply!

Yes, you are right there will be a separate call to send the data, but is
copying data each time more costly operation than just one syscall?

Besides, if we already have a ready message packet to be sent why should we
wait?

Waiting for your reply,
Best regards!



On Thu, Mar 5, 2020 at 13:10 Tom Lane  wrote:

> Aleksei Ivanov  writes:
> > I am really curious what was the original intention of using the
> > PqSendBuffer and is it possible to remove it now.
>
> > Currently all messages are copied from StringInfo to this buffer and
> sent,
> > which from my point of view is redundant operation.
>
> That would mean doing a separate send() kernel call for every few bytes,
> no?  I think the point of that buffer is to be sure we accumulate a
> reasonable number of bytes to pass to the kernel for each send().
>
> regards, tom lane
>


Re: Proposal: PqSendBuffer removal

2020-03-05 Thread Tom Lane
Aleksei Ivanov  writes:
> I am really curious what was the original intention of using the
> PqSendBuffer and is it possible to remove it now.

> Currently all messages are copied from StringInfo to this buffer and sent,
> which from my point of view is redundant operation.

That would mean doing a separate send() kernel call for every few bytes,
no?  I think the point of that buffer is to be sure we accumulate a
reasonable number of bytes to pass to the kernel for each send().

regards, tom lane