Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-24 Thread johnlumby

Thanks for the replies and thoughts.

On 08/19/14 18:27, Heikki Linnakangas wrote:

On 08/20/2014 12:17 AM, John Lumby wrote:
I am attaching a new version of the patch for consideration in the 
current commit fest.


Thanks for working on this!

Relative to the one I submitted on 25 June in 
bay175-w412ff89303686022a9f16aa3...@phx.gbl
the method for handling aio completion using sigevent has been 
re-written to use

signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock 
method.


ISTM the patch is still allocating stuff in shared memory that really 
doesn't belong there. Namely, the BufferAiocb structs. Or at least 
parts of it; there's also a waiter queue there which probably needs to 
live in shared memory, but the rest of it does not.


Actually the reason the BufferAiocb ( the postgresql block corresponding 
to the aio's aiocb)
must be located in shared memory is that, as you said,  it acts as 
anchor for the waiter list.

See further comment below.



At least BufAWaitAioCompletion is still calling aio_error() on an AIO 
request that might've been initiated by another backend. That's not OK.


Yes,   you are right,  and I agree with this one  -
I will add a aio_error_return_code field in the BufferAiocb
and only the originator will set this from the real aiocb.



Please write the patch without atomic CAS operation. Just use a spinlock. 


Umm,   this is a new criticism I think.   I use CAS for things other 
than locking,
such as add/remove from shared queue.   I suppose maybe a spinlock on 
the entire queue
can be used equivalently,  but with more code (extra confusion) and 
worse performance
(coarser serialization).  What is your objection to using gcc's 
atomic ops?   Portability?




There's a patch in the commitfest to add support for that,


sorry,  support for what? There are already spinlocks in postgresql,
you mean some new kind?please point me at it with hacker msgid or 
something.


but it's not committed yet, and all those 
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult 
to read. Same with all the other #ifdefs; please just pick a method 
that works.


Ok,  yes,   the ifdefs are unpleasant.I will do something about that.
Ideally they would be entirely confined to header files and only macro 
functions
in C files  -  maybe I can do that.   And eventually when the dust has 
settled

eliminate obsolete ifdef blocks altogether.

Also, please split prefetching of regular index scans into a separate 
patch. It's orthogonal to doing async I/O;


actually not completely orthogonal, see next

we could prefetch regular index scans with posix_fadvise too, and AIO 
should be useful for prefetching other stuff.


On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.
Several people have asked to split this patch into several smaller ones 
and I

have thought about it. It would introduce some awkward dependencies.
E.g.  splitting the scanner code  (index,  relation-heap) into separate 
patch

from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for 
that function

in the scanner patch,   but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more 
complexity

and breakage opportunities.



- Heikki


One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for 
wait/post.

The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler ,  and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory,   then an additional 
allocation

scheme is needed (fixed number?   palloc()in'g extra ones as needed?  ...)
which adds complexity,   and probably some inefficiency since a shared 
pool is usually
more efficient (allows higher maximum per process etc),   and there 
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some 
(small) overhead.


I understood your objection to use of shared memory as being that you 
don't want
a non-originator to access the originator's aiocb using 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-05-31 Thread johnlumby

On 05/31/14 20:44, johnlumby wrote:

On 05/30/14 09:36, Claudio Freire wrote:

Good point. I have included the guts of your little test program
(modified to do polling) into the existing autoconf test program
that decides on the
#define  USE_AIO_ATOMIC_BUILTIN_COMP_SWAP.
See config/c-library.m4.
I hope this goes some way to answer your concern about robustness,
as at least now if the implementation changes in some way that
renders the polling ineffective,  it will be caught in configure.



I meant to add that by including this test,  which involves a fork(),
in the autoconf tester,   on Windows
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP would always by un-defined.
(But could then be defined manually if someone wanted to give it a try)



[HACKERS] Re: proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule

2012-09-22 Thread johnlumby

On 09/20/12 16:34, Tom Lane wrote:

John Lumby johnlu...@hotmail.com writes:

On Fri, 22 Jun 2012 09:55:13, Robert Haas wrote:

I do notice that the RETURNING clause of the INSERT can't reference
NEW, which seems like a restriction that we probably ought to lift,
but it doesn't seem to have much to do with your patch.

The main use of my proposal is to be able to return the value of the
sequence assigned to the NEW.id column, so yes that is a serious
restriction.

I think both of you are confused.  What the RETURNING clause can see is
the inserted row's actual values.  You can certainly get the assigned
sequence ID out of that.  I would argue that being able to see the NEW.*
expressions is at best secondary, because that data doesn't necessarily
have anything to do with what went into the table (consider the
possibility that a BEFORE trigger changed it).


I think this part of the discussion was a bit of a
(probably confused) red herring going off on a tangent.


However, even if that restriction is lifted, it will not help with the
case where the rule is an invocation of a function, which is the case
I need.

What you're requesting seems pretty much nonsensical to me.  The point
of being able to write a RETURNING clause in a rule is to emulate what
would happen with RETURNING on a regular table.  As an example, suppose
that I have

create table t (id serial, data1 text, data2 text);

and for whatever reason I write

insert into t(data1, data2) values('foo', 'bar') returning id, data2;

I should get back the generated sequence value and the data2 value, but
*not* the data1 value.  Anything else is just wrong.  Now, if t has a
rule ON INSERT DO INSTEAD SELECT somefunction(), how is that going to
happen?  The function doesn't know what the RETURNING clause looks like.
If we had a notional inserted-row-value then the executor could do the
RETURNING computation based on that, but there's no way to make a
connection between whatever the function does internally and the data
for RETURNING to chew on.


Well since you raise the question  --  surely the function could return
a tuple of the correct row type and the executor could then pick out
whatever the actual statement requested. This actually seems to
make my proposal more general and useful.   And answers the point
you make about doesn't play nice with RETURNING in your next para.



The whole concept of ON INSERT DO [INSTEAD/ALSO] SELECT seems pretty
shaky to me, as it *necessarily* involves a command substitution that
causes an INSERT to act in a strange fashion that the client application
will need special code to cope with.  I won't argue to take the feature
out, because people do use it in custom applications --- but it doesn't
play nice with RETURNING, and I don't think it can be made to.  It's
pretty much a legacy method of doing business IMO.

It seems to me that instead of lobbying to throw another kluge on top
of that pile, you'd be better off looking for alternative solutions.
Have you tried implementing this as an INSTEAD OF trigger, and not using
rules at all?  That mechanism works just fine with RETURNING, and it
seems to me that it would let you do whatever you could do inside a
custom function.  It would certainly be enough for the
dynamic-partition-redirection problem.


It took me a little while to realize your implicit suggestion that
I should rename my inheritance-parent (true name 'history')
 as 'something_else' and then
   CREATE VIEW history as select * from something_else
amd then create the instead trigger on the view.
(This *is* what you are suggesting,  right?)
I tried t and yes indeed it does exactly what I want  -
for the INSERT.Now I also have to define instead triggers
for update and delete.  And are there any other considerations
for changing the table into a view?I mean,  any other ways in which
SQL or client interfaces could perceive some difference?

Anyhow,  yes,   this does indeed serve as a solution to the problem
without needing any kluges or hacks,  so thank you.
But it gives me (and anyone else who tries it) more work than
one simple RULE on the table without needing to add the view.
By the way  -  what is the reason for the restiction
that INSTEAD OF triggers cannot be defined on real tables,
only on views?  Could this be lifted?

John Lumby



regards, tom lane






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] prefetching and asynchronous io

2012-09-10 Thread johnlumby

On 08/18/2012 10:11 AM, John Lumby wrote:


I've recently tried extending the postgresql prefetch mechanism on linux
to use the posix (i.e. librt)
aio_read and friends where possible. In other words,   in
PrefetchBuffer(),   try getting a buffer
and issuing aio_read before falling back to fposix_advise(). It
gives me about 8% improvement
in throughput relative to the fposix-advise variety, for a workload of
16 highly-disk-read-intensive applications running to 16 backends.
For my test each application runs a query chosen to have plenty of
bitmap heap scans.

I can provide more details on my changes if interested.

On whether this technique might improve sort performance  :

First,  the disk access pattern for sorting is mostly sequential
(although I think
the sort module does some tricky work with reuse of pages in its
logtape files
which maybe is random-like),  and there are several claims on the net
that linux buffered file  handling
already does a pretty good job of read-ahead for a sequential access
pattern
without any need for the application to help it.
I can half-confirm that in that I tried adding calls to PrefetchBuffer
in regular heap scan
and did not see much improvement.But I am still pursuing that area.

But second,  it would be easy enough to add some fposix_advise calls to
sort and see whether
that helps.(Can't make use of PrefetchBuffer since sort does not use
the regular relation buffer pool)



I have also added prefetching calls to regular index scans
(non-bitmap,  non-index-only,  for btree only)
and see a 25% reduction in total elapsed time for a heavy
index-scan workload. That 25% is with just the
basic posix_fadvise,  and then extending that with asynch io
gives a small extra improvement again.

John


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] asynchronous disk io (was : tuplesort memory usage)

2012-08-17 Thread johnlumby


Date: Fri, 17 Aug 2012 00:26:37 +0100 From: Peter Geoghegan 
pe...@2ndquadrant.com To: Jeff Janes jeff.ja...@gmail.com Cc: 
pgsql-hackers pgsql-hackers@postgresql.org Subject: Re: tuplesort 
memory usage: grow_memtuples Message-ID: 
caeylb_vezpkdx54vex3x30oy_uoth89xoejjw6aucjjiujs...@mail.gmail.com 
On 27 July 2012 16:39, Jeff Janes jeff.ja...@gmail.com wrote:

  Can you suggest a benchmark that will usefully exercise this patch?


  I think the given sizes below work on most 64 bit machines.

  [...]

I think this patch (or at least your observation about I/O waits
within vmstat) may point to a more fundamental issue with our sort
code: Why are we not using asynchronous I/O in our implementation?
There are anecdotal reports of other RDBMS implementations doing far
better than we do here, and I believe asynchronous I/O, pipelining,
and other such optimisations have a lot to do with that. It's
something I'd hoped to find the time to look at in detail, but
probably won't in the 9.3 cycle. One of the more obvious ways of
optimising an external sort is to use asynchronous I/O so that one run
of data can be sorted or merged while other runs are being read from
or written to disk. Our current implementation seems naive about this.
There are some interesting details about how this is exposed by POSIX
here:

http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html


I've recently tried extending the postgresql prefetch mechanism on linux 
to use the posix (i.e. librt)
aio_read and friends where possible. In other words,   in 
PrefetchBuffer(),   try getting a buffer
and issuing aio_read before falling back to fposix_advise(). It 
gives me about 8% improvement
in throughput relative to the fposix-advise variety, for a workload of 
16 highly-disk-read-intensive applications running to 16 backends.
For my test each application runs a query chosen to have plenty of 
bitmap heap scans.


I can provide more details on my changes if interested.

On whether this technique might improve sort performance  :

First,  the disk access pattern for sorting is mostly sequential  
(although I think
the sort module does some tricky work with reuse of pages in its 
logtape files
which maybe is random-like),  and there are several claims on the net 
that linux buffered file  handling
already does a pretty good job of read-ahead for a sequential access 
pattern

without any need for the application to help it.
I can half-confirm that in that I tried adding calls to PrefetchBuffer 
in regular heap scan

and did not see much improvement.But I am still pursuing that area.

But second,  it would be easy enough to add some fposix_advise calls to 
sort and see whether
that helps.(Can't make use of PrefetchBuffer since sort does not use 
the regular relation buffer pool)






It's already anticipated that we might take advantage of libaio for
the benefit of FilePrefetch() (see its accompanying comments - it uses
posix_fadvise itself - effective_io_concurrency must be  0 for this
to ever be called). It perhaps could be considered parallel
low-hanging fruit in that it allows us to offer limited though
useful backend parallelism without first resolving thorny issues
around what abstraction we might use, or how we might eventually make
backends thread-safe. AIO supports registering signal callbacks (a
SIGPOLL handler can be called), which seems relatively
uncontroversial.


I believe libaio is dead,   as it depended on the old linux kernel 
asynchronous file io,

which was problematic and imposed various restrictions on the application.
librt aio has no restrictions and does a good enough job but uses pthreads
and synchronous io, which can make CPU overhead a bit heavy and also I 
believe

results in causing more context switching than with synchronous io,
whereas one of the benefits of kernel async io (in theory) is reduce 
context switching.


From what I've seen,  pthreads aio can give a benefit when there is 
high IO wait
from mostly-read activity,  the disk access pattern is not sequential 
(so kernel readahead
cant predict it) but postgresql can predict it,  and there's enough 
spare idle CPU to
run the pthreads.So it does seem that bitmap heap scan is a good 
choice for prefetching.






Platform support for AIO might be a bit lacking, but then you can say
the same about posix_fadvise. We don't assume that poll(2) is
available, but we already use it where it is within the latch code.
Besides, in-kernel support can be emulated if POSIX threads is
available, which I believe would make this broadly useful on unix-like
platforms.

-- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 
24x7 Support, Training and Services




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers