Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
Thanks for the replies and thoughts. On 08/19/14 18:27, Heikki Linnakangas wrote: On 08/20/2014 12:17 AM, John Lumby wrote: I am attaching a new version of the patch for consideration in the current commit fest. Thanks for working on this! Relative to the one I submitted on 25 June in bay175-w412ff89303686022a9f16aa3...@phx.gbl the method for handling aio completion using sigevent has been re-written to use signals exclusively rather than a composite of signals and LWlocks, and this has fixed the problem I mentioned before with the LWlock method. ISTM the patch is still allocating stuff in shared memory that really doesn't belong there. Namely, the BufferAiocb structs. Or at least parts of it; there's also a waiter queue there which probably needs to live in shared memory, but the rest of it does not. Actually the reason the BufferAiocb ( the postgresql block corresponding to the aio's aiocb) must be located in shared memory is that, as you said, it acts as anchor for the waiter list. See further comment below. At least BufAWaitAioCompletion is still calling aio_error() on an AIO request that might've been initiated by another backend. That's not OK. Yes, you are right, and I agree with this one - I will add a aio_error_return_code field in the BufferAiocb and only the originator will set this from the real aiocb. Please write the patch without atomic CAS operation. Just use a spinlock. Umm, this is a new criticism I think. I use CAS for things other than locking, such as add/remove from shared queue. I suppose maybe a spinlock on the entire queue can be used equivalently, but with more code (extra confusion) and worse performance (coarser serialization). What is your objection to using gcc's atomic ops? Portability? There's a patch in the commitfest to add support for that, sorry, support for what? There are already spinlocks in postgresql, you mean some new kind?please point me at it with hacker msgid or something. but it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult to read. Same with all the other #ifdefs; please just pick a method that works. Ok, yes, the ifdefs are unpleasant.I will do something about that. Ideally they would be entirely confined to header files and only macro functions in C files - maybe I can do that. And eventually when the dust has settled eliminate obsolete ifdef blocks altogether. Also, please split prefetching of regular index scans into a separate patch. It's orthogonal to doing async I/O; actually not completely orthogonal, see next we could prefetch regular index scans with posix_fadvise too, and AIO should be useful for prefetching other stuff. On 08/19/14 19:10, Claudio Freire wrote: On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Also, please split prefetching of regular index scans into a separate patch. ... That patch already happened on the list, and it wasn't a win in many cases. I'm not sure it should be proposed independently of this one. Maybe a separate patch, but it should be considered dependent on this. I don't have an archive link at hand atm, but I could produce one later. Several people have asked to split this patch into several smaller ones and I have thought about it. It would introduce some awkward dependencies. E.g. splitting the scanner code (index, relation-heap) into separate patch from aio code would mean that the scanner patch becomes dependent on the aio patch. They are not quite orthogonal. The reason is that the scanners call a new function, DiscardBuffer(blockid) when aio is in use. We can get around it by providing a stub for that function in the scanner patch, but then there is some risk of someone getting the wrong version of that function in their build. It just adds yet more complexity and breakage opportunities. - Heikki One further comment concerning these BufferAiocb and aiocb control blocks being in shared memory : I explained above that the BufferAiocb must be in shared memory for wait/post. The aiocb does not necessarily have to be in shared memory, but since there is a one-to-one relationship between BufferAiocb and aiocb, it makes the code much simpler , and the operation much more efficient, if the aiocb is embedded in the BufferAiocb as I have it now. E.g. if the aiocb is in private-process memory, then an additional allocation scheme is needed (fixed number? palloc()in'g extra ones as needed? ...) which adds complexity, and probably some inefficiency since a shared pool is usually more efficient (allows higher maximum per process etc), and there would have to be some pointer de-referencing from BufferAiocb to aiocb adding some (small) overhead. I understood your objection to use of shared memory as being that you don't want a non-originator to access the originator's aiocb using
Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
On 05/31/14 20:44, johnlumby wrote: On 05/30/14 09:36, Claudio Freire wrote: Good point. I have included the guts of your little test program (modified to do polling) into the existing autoconf test program that decides on the #define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP. See config/c-library.m4. I hope this goes some way to answer your concern about robustness, as at least now if the implementation changes in some way that renders the polling ineffective, it will be caught in configure. I meant to add that by including this test, which involves a fork(), in the autoconf tester, on Windows USE_AIO_ATOMIC_BUILTIN_COMP_SWAP would always by un-defined. (But could then be defined manually if someone wanted to give it a try)
[HACKERS] Re: proposal and patch : support INSERT INTO...RETURNING with partitioned table using rule
On 09/20/12 16:34, Tom Lane wrote: John Lumby johnlu...@hotmail.com writes: On Fri, 22 Jun 2012 09:55:13, Robert Haas wrote: I do notice that the RETURNING clause of the INSERT can't reference NEW, which seems like a restriction that we probably ought to lift, but it doesn't seem to have much to do with your patch. The main use of my proposal is to be able to return the value of the sequence assigned to the NEW.id column, so yes that is a serious restriction. I think both of you are confused. What the RETURNING clause can see is the inserted row's actual values. You can certainly get the assigned sequence ID out of that. I would argue that being able to see the NEW.* expressions is at best secondary, because that data doesn't necessarily have anything to do with what went into the table (consider the possibility that a BEFORE trigger changed it). I think this part of the discussion was a bit of a (probably confused) red herring going off on a tangent. However, even if that restriction is lifted, it will not help with the case where the rule is an invocation of a function, which is the case I need. What you're requesting seems pretty much nonsensical to me. The point of being able to write a RETURNING clause in a rule is to emulate what would happen with RETURNING on a regular table. As an example, suppose that I have create table t (id serial, data1 text, data2 text); and for whatever reason I write insert into t(data1, data2) values('foo', 'bar') returning id, data2; I should get back the generated sequence value and the data2 value, but *not* the data1 value. Anything else is just wrong. Now, if t has a rule ON INSERT DO INSTEAD SELECT somefunction(), how is that going to happen? The function doesn't know what the RETURNING clause looks like. If we had a notional inserted-row-value then the executor could do the RETURNING computation based on that, but there's no way to make a connection between whatever the function does internally and the data for RETURNING to chew on. Well since you raise the question -- surely the function could return a tuple of the correct row type and the executor could then pick out whatever the actual statement requested. This actually seems to make my proposal more general and useful. And answers the point you make about doesn't play nice with RETURNING in your next para. The whole concept of ON INSERT DO [INSTEAD/ALSO] SELECT seems pretty shaky to me, as it *necessarily* involves a command substitution that causes an INSERT to act in a strange fashion that the client application will need special code to cope with. I won't argue to take the feature out, because people do use it in custom applications --- but it doesn't play nice with RETURNING, and I don't think it can be made to. It's pretty much a legacy method of doing business IMO. It seems to me that instead of lobbying to throw another kluge on top of that pile, you'd be better off looking for alternative solutions. Have you tried implementing this as an INSTEAD OF trigger, and not using rules at all? That mechanism works just fine with RETURNING, and it seems to me that it would let you do whatever you could do inside a custom function. It would certainly be enough for the dynamic-partition-redirection problem. It took me a little while to realize your implicit suggestion that I should rename my inheritance-parent (true name 'history') as 'something_else' and then CREATE VIEW history as select * from something_else amd then create the instead trigger on the view. (This *is* what you are suggesting, right?) I tried t and yes indeed it does exactly what I want - for the INSERT.Now I also have to define instead triggers for update and delete. And are there any other considerations for changing the table into a view?I mean, any other ways in which SQL or client interfaces could perceive some difference? Anyhow, yes, this does indeed serve as a solution to the problem without needing any kluges or hacks, so thank you. But it gives me (and anyone else who tries it) more work than one simple RULE on the table without needing to add the view. By the way - what is the reason for the restiction that INSTEAD OF triggers cannot be defined on real tables, only on views? Could this be lifted? John Lumby regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] prefetching and asynchronous io
On 08/18/2012 10:11 AM, John Lumby wrote: I've recently tried extending the postgresql prefetch mechanism on linux to use the posix (i.e. librt) aio_read and friends where possible. In other words, in PrefetchBuffer(), try getting a buffer and issuing aio_read before falling back to fposix_advise(). It gives me about 8% improvement in throughput relative to the fposix-advise variety, for a workload of 16 highly-disk-read-intensive applications running to 16 backends. For my test each application runs a query chosen to have plenty of bitmap heap scans. I can provide more details on my changes if interested. On whether this technique might improve sort performance : First, the disk access pattern for sorting is mostly sequential (although I think the sort module does some tricky work with reuse of pages in its logtape files which maybe is random-like), and there are several claims on the net that linux buffered file handling already does a pretty good job of read-ahead for a sequential access pattern without any need for the application to help it. I can half-confirm that in that I tried adding calls to PrefetchBuffer in regular heap scan and did not see much improvement.But I am still pursuing that area. But second, it would be easy enough to add some fposix_advise calls to sort and see whether that helps.(Can't make use of PrefetchBuffer since sort does not use the regular relation buffer pool) I have also added prefetching calls to regular index scans (non-bitmap, non-index-only, for btree only) and see a 25% reduction in total elapsed time for a heavy index-scan workload. That 25% is with just the basic posix_fadvise, and then extending that with asynch io gives a small extra improvement again. John -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] asynchronous disk io (was : tuplesort memory usage)
Date: Fri, 17 Aug 2012 00:26:37 +0100 From: Peter Geoghegan pe...@2ndquadrant.com To: Jeff Janes jeff.ja...@gmail.com Cc: pgsql-hackers pgsql-hackers@postgresql.org Subject: Re: tuplesort memory usage: grow_memtuples Message-ID: caeylb_vezpkdx54vex3x30oy_uoth89xoejjw6aucjjiujs...@mail.gmail.com On 27 July 2012 16:39, Jeff Janes jeff.ja...@gmail.com wrote: Can you suggest a benchmark that will usefully exercise this patch? I think the given sizes below work on most 64 bit machines. [...] I think this patch (or at least your observation about I/O waits within vmstat) may point to a more fundamental issue with our sort code: Why are we not using asynchronous I/O in our implementation? There are anecdotal reports of other RDBMS implementations doing far better than we do here, and I believe asynchronous I/O, pipelining, and other such optimisations have a lot to do with that. It's something I'd hoped to find the time to look at in detail, but probably won't in the 9.3 cycle. One of the more obvious ways of optimising an external sort is to use asynchronous I/O so that one run of data can be sorted or merged while other runs are being read from or written to disk. Our current implementation seems naive about this. There are some interesting details about how this is exposed by POSIX here: http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html I've recently tried extending the postgresql prefetch mechanism on linux to use the posix (i.e. librt) aio_read and friends where possible. In other words, in PrefetchBuffer(), try getting a buffer and issuing aio_read before falling back to fposix_advise(). It gives me about 8% improvement in throughput relative to the fposix-advise variety, for a workload of 16 highly-disk-read-intensive applications running to 16 backends. For my test each application runs a query chosen to have plenty of bitmap heap scans. I can provide more details on my changes if interested. On whether this technique might improve sort performance : First, the disk access pattern for sorting is mostly sequential (although I think the sort module does some tricky work with reuse of pages in its logtape files which maybe is random-like), and there are several claims on the net that linux buffered file handling already does a pretty good job of read-ahead for a sequential access pattern without any need for the application to help it. I can half-confirm that in that I tried adding calls to PrefetchBuffer in regular heap scan and did not see much improvement.But I am still pursuing that area. But second, it would be easy enough to add some fposix_advise calls to sort and see whether that helps.(Can't make use of PrefetchBuffer since sort does not use the regular relation buffer pool) It's already anticipated that we might take advantage of libaio for the benefit of FilePrefetch() (see its accompanying comments - it uses posix_fadvise itself - effective_io_concurrency must be 0 for this to ever be called). It perhaps could be considered parallel low-hanging fruit in that it allows us to offer limited though useful backend parallelism without first resolving thorny issues around what abstraction we might use, or how we might eventually make backends thread-safe. AIO supports registering signal callbacks (a SIGPOLL handler can be called), which seems relatively uncontroversial. I believe libaio is dead, as it depended on the old linux kernel asynchronous file io, which was problematic and imposed various restrictions on the application. librt aio has no restrictions and does a good enough job but uses pthreads and synchronous io, which can make CPU overhead a bit heavy and also I believe results in causing more context switching than with synchronous io, whereas one of the benefits of kernel async io (in theory) is reduce context switching. From what I've seen, pthreads aio can give a benefit when there is high IO wait from mostly-read activity, the disk access pattern is not sequential (so kernel readahead cant predict it) but postgresql can predict it, and there's enough spare idle CPU to run the pthreads.So it does seem that bitmap heap scan is a good choice for prefetching. Platform support for AIO might be a bit lacking, but then you can say the same about posix_fadvise. We don't assume that poll(2) is available, but we already use it where it is within the latch code. Besides, in-kernel support can be emulated if POSIX threads is available, which I believe would make this broadly useful on unix-like platforms. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers