Re: [HACKERS] Libpq async issues

2001-01-24 Thread Alfred Perlstein

* Tom Lane <[EMAIL PROTECTED]> [010124 10:27] wrote:
> Alfred Perlstein <[EMAIL PROTECTED]> writes:
> > * Bruce Momjian <[EMAIL PROTECTED]> [010124 07:58] wrote:
> >> I have added this email to TODO.detail and a mention in the TODO list.
> 
> > The bug mentioned here is long gone,
> 
> Au contraire, the misdesign is still there.  The nonblock-mode code
> will *never* be reliable under stress until something is done about
> that, and that means fairly extensive code and API changes.

The "bug" is the one mentioned in the first paragraph of the email
where I broke _blocking_ connections for a short period.

I still need to fix async connections for myself (and of course
contribute it back), but I just haven't had the time.  If anyone
else wants it fixed earlier they can wait for me to do it, do it
themself, contract me to do it or hope someone else comes along
to fix it.

I'm thinking that I'll do what you said and have seperate paths
for writing/reading to the socket and API's to do so that give
the user the option of a boundry, basically:

 buffer this, but don't allow me to write until it's flushed

which would allow for larger than 8k COPY rows to go into the
backend.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."



Re: [HACKERS] Libpq async issues

2001-01-24 Thread Alfred Perlstein

* Bruce Momjian <[EMAIL PROTECTED]> [010124 07:58] wrote:
> 
> I have added this email to TODO.detail and a mention in the TODO list.

The bug mentioned here is long gone, however the problem with
issuing non-blocking COPY commands is still present (8k limit on
buffer size).  I hope to get to fix this sometime soon, but you
shouldn't worry about the "normal" path.

There's also a bug with PQCopyEnd(sp?) where it can still block because
it automagically calls into a routine that select()'s waiting for data.

It's on my TODO list as well, but a little behind a several thousand
line server I'm almost complete with.

-Alfred

> 
> > >> Um, I didn't have any trouble at all reproducing Patrick's complaint.
> > >> pg_dump any moderately large table (I used tenk1 from the regress
> > >> database) and try to load the script with psql.  Kaboom.
> > 
> > > This is after or before my latest patch?
> > 
> > Before.  I haven't updated since yesterday...
> > 
> > > I can't seem to reproduce this problem,
> > 
> > Odd.  Maybe there is something different about the kernel's timing of
> > message sending on your platform.  I see it very easily on HPUX 10.20,
> > and Patrick sees it very easily on whatever he's using (netbsd I think).
> > You might try varying the situation a little, say
> > psql mydb  > psql -f dumpfile mydb
> > psql mydb
> > \i dumpfile
> > and the same with -h localhost (to get a TCP/IP connection instead of
> > Unix domain).  At the moment (pre-patch) I see failures with the
> > first two of these, but not with the \i method.  -h doesn't seem to
> > matter for me, but it might for you.
> > 
> > > Telling me something is wrong without giving suggestions on how
> > > to fix it, nor direct pointers to where it fails doesn't help me
> > > one bit.  You're not offering constructive critism, you're not
> > > even offering valid critism, you're just waving your finger at
> > > "problems" that you say exist but don't pin down to anything specific.
> > 
> > I have been explaining it as clearly as I could.  Let's try it
> > one more time.
> > 
> > > I spent hours looking over what I did to pqFlush and pqPutnBytes
> > > because of what you said earlier when all the bug seems to have
> > > come down to is that I missed that the socket is set to non-blocking
> > > in all cases now.
> > 
> > Letting the socket mode default to blocking will hide the problems from
> > existing clients that don't care about non-block mode.  But people who
> > try to actually use the nonblock mode are going to see the same kinds of
> > problems that psql is exhibiting.
> > 
> > > The old sequence of events that happened was as follows:
> > 
> > >   user sends data almost filling the output buffer...
> > >   user sends another line of text overflowing the buffer...
> > >   pqFlush is invoked blocking the user until the output pipe clears...
> > >   and repeat.
> > 
> > Right.
> > 
> > > The nonblocking code allows sends to fail so the user can abort
> > > sending stuff to the backend in order to process other work:
> > 
> > >   user sends data almost filling the output buffer...
> > >   user sends another line of text that may overflow the buffer...
> > >   pqFlush is invoked, 
> > > if the pipe can't be cleared an error is returned allowing the user to
> > >   retry the send later.
> > > if the flush succeeds then more data is queued and success is returned
> > 
> > But you haven't thought through the mechanics of the "error is returned
> > allowing the user to retry" code path clearly enough.  Let's take
> > pqPutBytes for an example.  If it returns EOF, is that a hard error or
> > does it just mean that the application needs to wait a while?  The
> > application *must* distinguish these cases, or it will do the wrong
> > thing: for example, if it mistakes a hard error for "wait a while",
> > then it will wait forever without making any progress or producing
> > an error report.
> > 
> > You need to provide a different return convention that indicates
> > what happened, say
> > EOF (-1)=> hard error (same as old code)
> > 0   => OK
> > 1   => no data was queued due to risk of blocking
> > And you need to guarantee that the application knows what the state is
> > when the can't-do-it-yet return is made; note that I specified "no data
> > was queued" above.  If pqPutBytes might queue some of the data before
> > returning 1, the application is in trouble again.  While you apparently
> > foresaw that in recoding pqPutBytes, your code doesn't actually work.
> > There is the minor code bug that you fail to update "avail" after the
> > first pqFlush call, and the much more fundamental problem that you
> > cannot guarantee to have queued all or none of the data.  Think about
> > what happens if the passed nbytes is larger than the output buffer size.
> > You may pass the first pqFlush successfully, then get into the loop and
> > get a won't-block return from pqFlush

Re: [HACKERS] Libpq async issues

2001-01-24 Thread Tom Lane

Alfred Perlstein <[EMAIL PROTECTED]> writes:
> * Bruce Momjian <[EMAIL PROTECTED]> [010124 07:58] wrote:
>> I have added this email to TODO.detail and a mention in the TODO list.

> The bug mentioned here is long gone,

Au contraire, the misdesign is still there.  The nonblock-mode code
will *never* be reliable under stress until something is done about
that, and that means fairly extensive code and API changes.

regards, tom lane



[HACKERS] Libpq async issues

2001-01-24 Thread Bruce Momjian


I have added this email to TODO.detail and a mention in the TODO list.

> >> Um, I didn't have any trouble at all reproducing Patrick's complaint.
> >> pg_dump any moderately large table (I used tenk1 from the regress
> >> database) and try to load the script with psql.  Kaboom.
> 
> > This is after or before my latest patch?
> 
> Before.  I haven't updated since yesterday...
> 
> > I can't seem to reproduce this problem,
> 
> Odd.  Maybe there is something different about the kernel's timing of
> message sending on your platform.  I see it very easily on HPUX 10.20,
> and Patrick sees it very easily on whatever he's using (netbsd I think).
> You might try varying the situation a little, say
>   psql mydbpsql -f dumpfile mydb
>   psql mydb
>   \i dumpfile
> and the same with -h localhost (to get a TCP/IP connection instead of
> Unix domain).  At the moment (pre-patch) I see failures with the
> first two of these, but not with the \i method.  -h doesn't seem to
> matter for me, but it might for you.
> 
> > Telling me something is wrong without giving suggestions on how
> > to fix it, nor direct pointers to where it fails doesn't help me
> > one bit.  You're not offering constructive critism, you're not
> > even offering valid critism, you're just waving your finger at
> > "problems" that you say exist but don't pin down to anything specific.
> 
> I have been explaining it as clearly as I could.  Let's try it
> one more time.
> 
> > I spent hours looking over what I did to pqFlush and pqPutnBytes
> > because of what you said earlier when all the bug seems to have
> > come down to is that I missed that the socket is set to non-blocking
> > in all cases now.
> 
> Letting the socket mode default to blocking will hide the problems from
> existing clients that don't care about non-block mode.  But people who
> try to actually use the nonblock mode are going to see the same kinds of
> problems that psql is exhibiting.
> 
> > The old sequence of events that happened was as follows:
> 
> >   user sends data almost filling the output buffer...
> >   user sends another line of text overflowing the buffer...
> >   pqFlush is invoked blocking the user until the output pipe clears...
> >   and repeat.
> 
> Right.
> 
> > The nonblocking code allows sends to fail so the user can abort
> > sending stuff to the backend in order to process other work:
> 
> >   user sends data almost filling the output buffer...
> >   user sends another line of text that may overflow the buffer...
> >   pqFlush is invoked, 
> > if the pipe can't be cleared an error is returned allowing the user to
> >   retry the send later.
> > if the flush succeeds then more data is queued and success is returned
> 
> But you haven't thought through the mechanics of the "error is returned
> allowing the user to retry" code path clearly enough.  Let's take
> pqPutBytes for an example.  If it returns EOF, is that a hard error or
> does it just mean that the application needs to wait a while?  The
> application *must* distinguish these cases, or it will do the wrong
> thing: for example, if it mistakes a hard error for "wait a while",
> then it will wait forever without making any progress or producing
> an error report.
> 
> You need to provide a different return convention that indicates
> what happened, say
>   EOF (-1)=> hard error (same as old code)
>   0   => OK
>   1   => no data was queued due to risk of blocking
> And you need to guarantee that the application knows what the state is
> when the can't-do-it-yet return is made; note that I specified "no data
> was queued" above.  If pqPutBytes might queue some of the data before
> returning 1, the application is in trouble again.  While you apparently
> foresaw that in recoding pqPutBytes, your code doesn't actually work.
> There is the minor code bug that you fail to update "avail" after the
> first pqFlush call, and the much more fundamental problem that you
> cannot guarantee to have queued all or none of the data.  Think about
> what happens if the passed nbytes is larger than the output buffer size.
> You may pass the first pqFlush successfully, then get into the loop and
> get a won't-block return from pqFlush in the loop.  What then?
> You can't simply refuse to support the case nbytes > bufsize at all,
> because that will cause application failures as well (too long query
> sends it into an infinite loop trying to queue data, most likely).
> 
> A possible answer is to specify that a return of +N means "N bytes
> remain unqueued due to risk of blocking" (after having queued as much
> as you could).  This would put the onus on the caller to update his
> pointers/counts properly; propagating that into all the internal uses
> of pqPutBytes would be no fun.  (Of course, so far you haven't updated
> *any* of the internal callers to behave reasonably in case of a
> won't-block return; PQfn is just one example.)
> 
> An