Asynchronous and "direct" IO support for PostgreSQL.

2021-02-23 Thread Andres Freund
Hi,

over the last ~year I spent a lot of time trying to figure out how we could
add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
there's still a *lot* of open questions, I think I now have a decent handle on
most of the bigger architectural questions.  Thus this long email.


Just to be clear: I don't expect the current to design to survive as-is. If
there's a few sentences below that sound a bit like describing the new world,
that's because they're from the README.md in the patch series...


## Why Direct / unbuffered IO?

The main reason to want to use Direct IO are:

- Lower CPU usage / higher throughput. Particularly on modern storage
  buffered writes are bottlenecked by the operating system having to
  copy data from the kernel's page cache to postgres buffer pool using
  the CPU. Whereas direct IO can often move the data directly between
  the storage devices and postgres' buffer cache, using DMA. While
  that transfer is ongoing, the CPU is free to perform other work,
  with little impact.
- Avoiding double buffering between operating system cache and
  postgres' shared_buffers.
- Better control over the timing and pace of dirty data writeback.
- Potential for concurrent WAL writes (via O_DIRECT | O_DSYNC writes)


The main reason *not* to use Direct IO are:

- Without AIO, Direct IO is unusably slow for most purposes.
- Even with AIO, many parts of postgres need to be modified to perform
  explicit prefetching.
- In situations where shared_buffers cannot be set appropriately
  large, e.g. because there are many different postgres instances
  hosted on shared hardware, performance will often be worse then when
  using buffered IO.


## Why Asynchronous IO

- Without AIO we cannot use DIO

- Without asynchronous IO (AIO) PG has to rely on the operating system
  to hide the cost of synchronous IO from Postgres. While this works
  surprisingly well in a lot of workloads, it does not do as good a job
  on prefetching and controlled writeback as we would like.
- There are important expensive operations like fdatasync() where the
  operating system cannot hide the storage latency. This is particularly
  important for WAL writes, where the ability to asynchronously issue
  fdatasync() or O_DSYNC writes can yield significantly higher
  throughput.
- Fetching data into shared buffers asynchronously and concurrently with query
  execution means there is more CPU time for query execution.


## High level difficulties adding AIO/DIO support

- Optionally using AIO leads to convoluted and / or duplicated code.

- Platform dependency: The common AIO APIs are typically specific to one
  platform (linux AIO, linux io_uring, windows IOCP, windows overlapped IO) or
  a few platforms (posix AIO, but there's many differences).

- There are a lot of separate places doing IO in PG. Moving all of these to
  use efficiently use AIO is an, um, large undertaking.

- Nothing in the buffer management APIs expects there to be more than one IO
  to be in progress at the same time - which is required to do AIO.


## Portability & Duplication

To avoid the issue of needing non-AIO codepaths to support platforms without
native AIO support a worker process based AIO implementation exists (and is
currently the default). This also is convenient to check if a problem is
related to the native IO implementation or not.

Thanks to Thomas Munro for helping a *lot* around this area. He wrote
the worker mode, the posix aio mode, added CI, did a lot of other
testing, listened to me...


## Deadlock and Starvation Dangers due to AIO

Using AIO in a naive way can easily lead to deadlocks in an environment where
the source/target of AIO are shared resources, like pages in postgres'
shared_buffers.

Consider one backend performing readahead on a table, initiating IO for a
number of buffers ahead of the current "scan position". If that backend then
performs some operation that blocks, or even just is slow, the IO completion
for the asynchronously initiated read may not be processed.

This AIO implementation solves this problem by requiring that AIO methods
either allow AIO completions to be processed by any backend in the system
(e.g. io_uring, and indirectly posix, via signal handlers), or to guarantee
that AIO processing will happen even when the issuing backend is blocked
(e.g. worker mode, which offloads completion processing to the AIO workers).


## AIO API overview

The main steps to use AIO (without higher level helpers) are:

1) acquire an "unused" AIO: pgaio_io_get()

2) start some IO, this is done by functions like
   pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)

   The (read|write|fsync|flush_range) indicates the operation, whereas
   (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.

   (see below for more details about this design choice - it might or not be
   right)

3) optionally: assign a backend-local completion callback to the IO
   (pgaio_io_on_completion_loc

Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-23 Thread Greg Stark
On Tue, 23 Feb 2021 at 05:04, Andres Freund  wrote:
>
> ## Callbacks
>
> In the core AIO pieces there are two different types of callbacks at the
> moment:
>
> Shared callbacks, which can be invoked by any backend (normally the issuing
> backend / the AIO workers, but can be other backends if they are waiting for
> the IO to complete). For operations on shared resources (e.g. shared buffer
> reads/writes, or WAL writes) these shared callback needs to transition the
> state of the object the IO is being done for to completion. E.g. for a shared
> buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.
>
> The main reason these callbacks exist is that they make it safe for a backend
> to issue non-blocking IO on buffers (see the deadlock section above). As any
> blocked backend can cause the IO to complete, the deadlock danger is gone.

So firstly this is all just awesome work and I have questions but I
don't want them to come across in any way as criticism or as a demand
for more work. This is really great stuff, thank you so much!

The callbacks make me curious about two questions:

1) Is there a chance that a backend issues i/o, the i/o completes in
some other backend and by the time this backend gets around to looking
at the buffer it's already been overwritten again? Do we have to
initiate I/O again or have you found a way to arrange that this
backend has the buffer pinned from the time the i/o starts even though
it doesn't handle the comletion?

2) Have you made (or considered making) things like sequential scans
(or more likely bitmap index scans) asynchronous at a higher level.
That is, issue a bunch of asynchronous i/o and then handle the pages
and return the tuples as the pages arrive. Since sequential scans and
bitmap scans don't guarantee to read the pages in order they're
generally free to return tuples from any page in any order. I'm not
sure how much of a win that would actually be since all the same i/o
would be getting executed and the savings in shared buffers would be
small but if there are mostly hot pages you could imagine interleaving
a lot of in-memory pages with the few i/os instead of sitting idle
waiting for the async i/o to return.



> ## Stats
>
> There are two new views: pg_stat_aios showing AIOs that are currently
> in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

This is impressive. How easy is it to correlate with system aio stats?


-- 
greg




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-23 Thread Andres Freund
Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.

I posted it to get argued with ;).


> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback.  There's similar trickery with locks - but more about that later.


> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".


> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 
]─┬───┐
│ backend_type │ client backend 
   │
│ id   │ 98 
   │
│ gen  │ 13736  
   │
│ op   │ write  
   │
│ scb  │ sb 
   │
│ flags│ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT 
   │
│ ring │ 5  
   │
│ owner_pid│ 1866051
   │
│ merge_with   │ (null) 
   │
│ result   │ 0  
   │
│ desc │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, 
release_lock: 1, buffid

Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-23 Thread Thomas Munro
On Tue, Feb 23, 2021 at 11:03 PM Andres Freund  wrote:
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.

Hello,

Very cool to see this project escaping onto -hackers!

I have done some work on a couple of low level parts of it, and I
wanted to show a quick "hey, where'd my system calls go?" demo, which
might help illustrate some very simple things about this stuff.  Even
though io_uring is the new hotness in systems programming, I'm going
to use io_mode=worker here.  It's the default in the current patch
set, it works on all our supported OSes and is easier to understand
without knowledge of shiny new or obscure old AIO system interfaces.
I'll also use io_workers=1, an artificially low setting to make it
easy to spy on (pseudo) async I/O with strace/truss/dtruss on a single
process, and max_parallel_workers_per_gather=0 to keep executor
parallelism from confusing matters.

The first thing to notice is that there's an "io worker" process, and
while filling up a big table with "insert into t select
generate_series(1, 1)", it's doing a constant stream of 128KB
pwritev() calls.  These are writing out 16 blocks from shared buffers
at a time:

  pwritev(44, [{iov_base=..., iov_len=73728},
   {iov_base=..., iov_len=24576},
   {iov_base=..., iov_len=32768}], 3, 228032512) = 131072

The reason there are 3 vectors there rather than 16 is just that some
of the buffers happened to be adjacent in memory and we might as well
use the smallest number of vectors.  Just after we've started up and
the buffer pool is empty, it's easy to find big single vector I/Os,
but things soon get more fragmented (blocks adjacent on disk become
less likely to be adjacent in shared buffers) and that number goes up,
but that shouldn't make much difference to the OS or hardware assuming
decent scatter/gather support through the stack.  If io_data_direct=on
(not the default) and the blocks are in one physical extent on the
file system, that might even go all the way down to the disk as a
single multi-segment write command for the storage hardware DMA engine
to beam directly in/out of our buffer pool without CPU involvement.

Mixed into that stream of I/O worker system calls, you'll also see WAL
going out to disk:

  pwritev(15, [{iov_base=..., iov_len=1048576}], 1, 4194304) = 1048576

Meanwhile, the user session process running the big INSERT can be seen
signalling the I/O worker to wake it up.  The same thing happens for
bgwriter, checkpointer, autovacuum and walwriter: you can see them all
handing off most of their I/O work to the pool of I/O workers, with a
bit of new signalling going on (which we try to minimise, and can
probably minimise much more).  (You might be able to see some evidence
of Andres's new buffer cleaning scheme too, which avoids some bad
patterns of interleaving small reads and writes, but I'm skipping
right over here...)

Working through a very simple example of how the I/O comes to be
consolidated and parallelised, let's look at a simple non-parallel
SELECT COUNT(*) query on a large table.  The I/O worker does a stream
of scattered reads into our buffer pool:

  preadv(51, [{iov_base=..., iov_len=24576},
  {iov_base=..., iov_len=8192},
  {iov_base=..., iov_len=16384},
  {iov_base=..., iov_len=16384},
  {iov_base=..., iov_len=16384},
  {iov_base=..., iov_len=49152}], 6, 190808064) = 131072

Meanwhile our user session backend can be seen waking it up whenever
it's trying to start I/O and finds it snoozing:

  kill(1803835, SIGURG)   = 0
  kill(1803835, SIGURG)   = 0
  kill(1803835, SIGURG)   = 0
  kill(1803835, SIGURG)   = 0
  kill(1803835, SIGURG)   = 0

Notice that there are no sleeping system calls in the query backend,
meaning the I/O in this example is always finished by the time the
executor gets around to accessing the page it requested, so we're
staying far enough ahead and we can be 100% CPU bound.  In unpatched
PostgreSQL we'd hope to have no actual sleeping in such a simple case
anyway, thanks to the OS's readahead heuristics; but (1) we'd still do
individual pread(8KB) calls, meaning that the user's query is at least
having to pay the CPU cost of a return trip into the kernel and a
copyout of 8KB from kernel space to user space, here avoided, (2) in
io_data_direct=on mode, there's no page cache and thus no kernel read
ahead, so we need to replace that mechanism with something anyway, (3)
it's needed for non-sequential access like btree scans.

Sometimes I/Os are still run in user backends, for example because (1)
existing non-AIO code paths are still reached, (2) in worker mode,
some kinds o

Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Alexey Lesovsky

Hi,

Thank you for the amazing and great work.

On 23.02.2021 15:03, Andres Freund wrote:

## Stats

There are two new views: pg_stat_aios showing AIOs that are currently
in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.


As a DBA I would like to propose a few amendments that might help with 
practical usage of stats when feature will be finally implemented. My 
suggestions aren’t related to the central idea of the proposed changes, 
but rather to the stats part.


A quick side note, there are two terms in Prometheus 
(https://prometheus.io/docs/concepts/metric_types/):
1. Counter. A counter is a cumulative metric that represents a single 
monotonically increasing counter whose value can only increase or be 
reset to zero on restart.
2. Gauge. A gauge is a metric that represents a single numerical value 
that can arbitrarily go up and down.


For the purposes of long-term stats collection, COUNTERs are preferred 
over GAUGEs, because COUNTERs allow us to understand how metrics are 
changed overtime without missing out potential spikes in activity. As a 
result, we have a much better historic perspective.


Measuring and collecting GAUGEs is limited to the moments in time when 
the stats are taken (snapshots) so the changes that took place between 
the snapshots remain unmeasured. In systems with a high rate of 
transactions per second (even 1 second interval between the snapshots) 
GAUGEs measuring won’t provide the full picture.  In addition, most of 
the monitoring systems like Prometheus, Zabbix, etc. use longer 
intervals (from 10-15 to 60 seconds).


The main idea is to try to expose almost all numeric stats as COUNTERs - 
this increases overall observabilty of implemented feature.


pg_stat_aios.
In general, this stat is a set of text values, and at the same time it 
looks GAUGE-like (similar to pg_stat_activity or pg_locks), and is only 
relevant for the moment when the user is looking at it. I think it would 
be better to rename this view to pg_stat_progress_aios. And keep 
pg_stat_aios for other AIO stats with global COUNTERs (like stuff in 
pg_stat_user_tables or pg_stat_statements, or system-wide /proc/stat, 
/proc/diskstats).


pg_stat_aio_backends.
This stat is based on COUNTERs, which is great, but the issue here is 
that its lifespan is limited by the lifespan of the backend processes - 
once the backend exits the stat will no longer be available - which 
could be inappropriate in workloads with short-lived backends.


I think there might be few existing examples in the current code that 
could be repurposed to implement the suggestions above (such as 
pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think 
having these changes incorporated shouldn’t take significant effort 
considering the benefit it will bring to the final user.


Once again huge respect to your work on this changes and good look.

Regards, Alexey





Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Greg Stark
I guess what I would be looking for in stats would be a way to tell
what the bandwidth, latency, and queue depth is. Ideally one day
broken down by relation/index and pg_stat_statement record.

I think seeing the actual in flight async requests in a connection is
probably not going to be very useful in production. It's very low
level and in production the user is just going to find that
overwhelming detail. It is kind of cool to see the progress in
sequential operations but I think that should be solved in a higher
level way than this anyways.

What we need to calculate these values would be the kinds of per-op
stats nfsiostat uses from /proc/self/mountstats:
https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps

So number of async reads we've initiated, how many callbacks have been
called, total cumulative elapsed time between i/o issued and i/o
completed, total bytes of i/o initiated, total bytes of i/o completed.
As well a counter of requests which returned errors (eof? i/o error?)
If there are other locks or queues internally to postgres total time
spent in those states.

I have some vague idea that we should have a generic infrastructure
for stats that automatically counts things associated with plan nodes
and automatically bubbles that data up to the per-transaction,
per-backend, per-relation, and pg_stat_statements stats. But that's a
whole other ball of wax :)




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Dmitry Dolgov
> On Tue, Feb 23, 2021 at 02:03:44AM -0800, Andres Freund wrote:
>
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions.  Thus this long email.
>
> Just to be clear: I don't expect the current to design to survive as-is. If
> there's a few sentences below that sound a bit like describing the new world,
> that's because they're from the README.md in the patch series...

Thanks!

> Comments? Questions?
>
> I plan to send separate emails about smaller chunks of this seperately -
> the whole topic is just too big. In particular I plan to send something
> around buffer locking / state management - it's a one of the core issues
> around this imo.

I'm curious about control knobs for this feature, it's somewhat related
to the stats questions also discussed in this thread. I guess most
important of those are max_aio_in_flight, io_max_concurrency etc, and
they're going to be a hard limits, right? I'm curious if it makes sense
to explore possibility to have these sort of "backpressure", e.g. if
number of inflight requests is too large calculate inflight_limit a bit
lower than possible (to avoid hard performance deterioration when the db
is trying to do too much IO, and rather do it smooth). From what I
remember io_uring does have something similar only for SQPOLL. Another
similar question if this could be used for throttling of some overloaded
workers in case of misconfigured clients or such?




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Andres Freund
Hi,

On 2021-02-24 21:15:14 +0500, Alexey Lesovsky wrote:
> pg_stat_aios.
> In general, this stat is a set of text values, and at the same time it looks
> GAUGE-like (similar to pg_stat_activity or pg_locks), and is only relevant
> for the moment when the user is looking at it. I think it would be better to
> rename this view to pg_stat_progress_aios. And keep pg_stat_aios for other
> AIO stats with global COUNTERs (like stuff in pg_stat_user_tables or
> pg_stat_statements, or system-wide /proc/stat, /proc/diskstats).

Right - arguably it really shouldn't even have _stat_ in the name... I
don't particularly like the idea of adding _progress_ as that seems it'd
lead to confusing it with pg_stat_progress_vacuum etc - and it's quite a
different beast.


> pg_stat_aio_backends.
> This stat is based on COUNTERs, which is great, but the issue here is that
> its lifespan is limited by the lifespan of the backend processes - once the
> backend exits the stat will no longer be available - which could be
> inappropriate in workloads with short-lived backends.

There's a todo somewhere to roll over the per-connection stats into a
global stats piece on disconnect. In addition I was thinking of adding a
view that sums up the value of "already disconnected backends" and the
currently connected ones.  Would that mostly address your concerns?


> I think there might be few existing examples in the current code that could
> be repurposed to implement the suggestions above (such as
> pg_stat_user_tables, pg_stat_database, etc). With this in mind, I think
> having these changes incorporated shouldn’t take significant effort
> considering the benefit it will bring to the final user.

Yea - I kind of was planning to go somewhere roughly in the direction
you suggest, but took a few shortcuts due to the size of the
project. Having the views made it a lot easier to debug / develop, but
supporting longer lived stuff wasn't yet crucial. But I agree, we really
should have it...

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Andres Freund
Hi,

On 2021-02-24 14:59:19 -0500, Greg Stark wrote:
> I guess what I would be looking for in stats would be a way to tell
> what the bandwidth, latency, and queue depth is. Ideally one day
> broken down by relation/index and pg_stat_statement record.

I think doing it at that granularity will likely be too expensive...


> I think seeing the actual in flight async requests in a connection is
> probably not going to be very useful in production.

I think it's good for analyzing concrete performance issues, but
probably not that much more. Although, it's not too hard to build
sampling based on top of it with a tiny bit of work (should display the
relfilenode etc).


> So number of async reads we've initiated, how many callbacks have been
> called, total cumulative elapsed time between i/o issued and i/o
> completed, total bytes of i/o initiated, total bytes of i/o completed.

Much of that is already in pg_stat_aio_backends - but is lost after
disconnect (easy to solve). We don't track bytes of IO currently, but
that'd not be hard.

However, it's surprisingly hard to do the measurement between "issued"
and "completed" in a meaningful way. It's obviously not hard to measure
the time at which the request was issued, but there's no real way to
determine the time at which it was completed. If a backend is busy doing
other things (e.g. invoke aggregate transition functions), we'll not see
the completion immediately, and therefore not have an accurate
timestamp.

With several methods of doing AIO we can set up signals that fire on
completion, but that's pretty darn expensive. And it's painful to write
such signal handlers in a safe way.


> I have some vague idea that we should have a generic infrastructure
> for stats that automatically counts things associated with plan nodes
> and automatically bubbles that data up to the per-transaction,
> per-backend, per-relation, and pg_stat_statements stats. But that's a
> whole other ball of wax :)

Heh, yea, let's tackle that separately ;)

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-24 Thread Andres Freund
Hi,

On 2021-02-24 21:41:16 +0100, Dmitry Dolgov wrote:
> I'm curious about control knobs for this feature, it's somewhat related
> to the stats questions also discussed in this thread. I guess most
> important of those are max_aio_in_flight, io_max_concurrency etc, and
> they're going to be a hard limits, right?

Yea - there's a lot more work needed in that area.

io_max_concurrency especially really should be a GUC, I was just too
lazy for that so far.


> I'm curious if it makes sense
> to explore possibility to have these sort of "backpressure", e.g. if
> number of inflight requests is too large calculate inflight_limit a bit
> lower than possible (to avoid hard performance deterioration when the db
> is trying to do too much IO, and rather do it smooth).

It's decidedly nontrivial to compute "too large" - and pretty workload
dependant (e.g. lower QDs are better latency sensitive OLTP, higher QD
is better for bulk r/w heavy analytics). So I don't really want to go
there for now - the project is already very large.

What I do think is needed and feasible (there's a bunch of TODOs in the
code about it already) is to be better at only utilizing deeper queues
when lower queues don't suffice. So we e.g. don't read ahead more than a
few blocks for a scan where the query is spending most of the time
"elsewhere.

There's definitely also some need for a bit better global, instead of
per-backend, control over the number of IOs in flight. That's not too
hard to implement - the hardest probably is to avoid it becoming a
scalability issue.

I think the area with the most need for improvement is figuring out how
we determine the queue depths for different things using IO. Don't
really want to end up with 30 parameters influencing what queue depth to
use for (vacuum, index builds, sequential scans, index scans, bitmap
heap scans, ...) - but they benefit from a deeper queue will differ
between places.


> From what I remember io_uring does have something similar only for
> SQPOLL. Another similar question if this could be used for throttling
> of some overloaded workers in case of misconfigured clients or such?

You mean dynamically? Or just by setting the concurrency lower for
certain users? I think doing so dynamically is way too complicated for
now. But I'd expect configuring it on a per-user basis or such to be a
reasonable thing. That might require splitting it into two GUCs - one
SUSET one and a second one that's settable by any user, but can only
lower the depth.

I think it'll be pretty useful to e.g. configure autovacuum to have a
low queue depth instead of using the current cost limiting. That way the
impact on the overall system is limitted, but it's not slowed down
unnecessarily as much.

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-25 Thread Dmitry Dolgov
> On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
>
> > I'm curious if it makes sense
> > to explore possibility to have these sort of "backpressure", e.g. if
> > number of inflight requests is too large calculate inflight_limit a bit
> > lower than possible (to avoid hard performance deterioration when the db
> > is trying to do too much IO, and rather do it smooth).
>
> What I do think is needed and feasible (there's a bunch of TODOs in the
> code about it already) is to be better at only utilizing deeper queues
> when lower queues don't suffice. So we e.g. don't read ahead more than a
> few blocks for a scan where the query is spending most of the time
> "elsewhere.
>
> There's definitely also some need for a bit better global, instead of
> per-backend, control over the number of IOs in flight. That's not too
> hard to implement - the hardest probably is to avoid it becoming a
> scalability issue.
>
> I think the area with the most need for improvement is figuring out how
> we determine the queue depths for different things using IO. Don't
> really want to end up with 30 parameters influencing what queue depth to
> use for (vacuum, index builds, sequential scans, index scans, bitmap
> heap scans, ...) - but they benefit from a deeper queue will differ
> between places.

Yeah, sounds like an interesting opportunity for improvements. I'm
preparing few benchmarks to understand better how this all works, so
will keep this in mind.

> > From what I remember io_uring does have something similar only for
> > SQPOLL. Another similar question if this could be used for throttling
> > of some overloaded workers in case of misconfigured clients or such?
>
> You mean dynamically? Or just by setting the concurrency lower for
> certain users? I think doing so dynamically is way too complicated for
> now. But I'd expect configuring it on a per-user basis or such to be a
> reasonable thing. That might require splitting it into two GUCs - one
> SUSET one and a second one that's settable by any user, but can only
> lower the depth.
>
> I think it'll be pretty useful to e.g. configure autovacuum to have a
> low queue depth instead of using the current cost limiting. That way the
> impact on the overall system is limitted, but it's not slowed down
> unnecessarily as much.

Yes, you got it right, not dynamically, but rather expose this to be
configured on e.g. per-user basis.




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-02-25 Thread Alexey Lesovsky

Hi,

On 25.02.2021 02:03, Andres Freund wrote:

pg_stat_aio_backends.
This stat is based on COUNTERs, which is great, but the issue here is that
its lifespan is limited by the lifespan of the backend processes - once the
backend exits the stat will no longer be available - which could be
inappropriate in workloads with short-lived backends.

There's a todo somewhere to roll over the per-connection stats into a
global stats piece on disconnect. In addition I was thinking of adding a
view that sums up the value of "already disconnected backends" and the
currently connected ones.  Would that mostly address your concerns?


Yes, approach with separated stats for live and disconnected backends 
looks good and solves problem with "stats loss".


Or it can be done like a stats for shared objects in pg_stat_databases, 
where there is a special NULL database is used.


Regards, Alexey




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-04-02 Thread Dmitry Dolgov
Sorry for another late reply, finally found some time to formulate couple of
thoughts.

> On Thu, Feb 25, 2021 at 09:22:43AM +0100, Dmitry Dolgov wrote:
> > On Wed, Feb 24, 2021 at 01:45:10PM -0800, Andres Freund wrote:
> >
> > > I'm curious if it makes sense
> > > to explore possibility to have these sort of "backpressure", e.g. if
> > > number of inflight requests is too large calculate inflight_limit a bit
> > > lower than possible (to avoid hard performance deterioration when the db
> > > is trying to do too much IO, and rather do it smooth).
> >
> > What I do think is needed and feasible (there's a bunch of TODOs in the
> > code about it already) is to be better at only utilizing deeper queues
> > when lower queues don't suffice. So we e.g. don't read ahead more than a
> > few blocks for a scan where the query is spending most of the time
> > "elsewhere.
> >
> > There's definitely also some need for a bit better global, instead of
> > per-backend, control over the number of IOs in flight. That's not too
> > hard to implement - the hardest probably is to avoid it becoming a
> > scalability issue.
> >
> > I think the area with the most need for improvement is figuring out how
> > we determine the queue depths for different things using IO. Don't
> > really want to end up with 30 parameters influencing what queue depth to
> > use for (vacuum, index builds, sequential scans, index scans, bitmap
> > heap scans, ...) - but they benefit from a deeper queue will differ
> > between places.

Talking about parameters, from what I understand the actual number of queues
(e.g. io_uring) created is specified by PGAIO_NUM_CONTEXTS, shouldn't it be
configurable? Maybe in fact there should be not that many knobs after all - if
the model assumes the storage has:

* Some number of hardware queues, then the number of queues AIO implementation
  needs to use depends on it. For example, lowering number of contexts between
  different benchmark runs I could see that some of the hardware queues were
  significantly underutilized. Potentially there could be also such
  thing as too many contexts.

* Certain bandwidth, then the submit batch size (io_max_concurrency or
  PGAIO_SUBMIT_BATCH_SIZE) depends on it. This will allow to distinguish
  attached storage with high bandwidth and high latency vs local storages.

>From what I see max_aio_in_flight is used as a queue depth for contexts, which
is workload dependent and not easy to figure out as you mentioned. To avoid
having 30 different parameters maybe it's more feasible to introduce "shallow"
and "deep" queues, where particular depth for those could be derived from depth
of hardware queues. The question which activity should use which queue is not
easy, but if I get it right from queuing theory (assuming IO producers are
stationary processes and fixed IO latency from the storage) it depends on IO
arrivals distribution in every particular case and this in turn could be
roughly estimated for each type of activity. One can expect different IO
arrivals distributions for e.g. a normal point-query backend and a checkpoint
or vacuum process, no matter what are the other conditions (collecting those
for few benchmark runs gives indeed pretty distinct distributions).

If I understand correctly, those contexts defined by PGAIO_NUM_CONTEXTS are the
main working horse, right? I'm asking because there is also something called
local_ring, but it seems there are no IOs submitted into those. Assuming that
contexts are a main way of submitting IO, it would be also interesting to
explore isolated for different purposes contexts. I haven't finished yet my
changes here to give any results, but at least doing some tests with fio show
different latencies, when two io_urings are processing mixed read/writes vs
isolated read or writes. On the side note, at the end of the day there are so
many queues - application queue, io_uring, mq software queue, hardware queue -
I'm really curious if it would amplify tail latencies.

Another thing I've noticed is AIO implementation is much more significantly
affected by side IO activity than synchronous one. E.g. AIO version tps drops
from tens of thousands to a couple of hundreds just because of some kworker
started to flush dirty buffers (especially with disabled writeback throttling),
while synchronous version doesn't suffer that much. Not sure what to make of
it. Btw, overall I've managed to get better numbers from AIO implementation on
IO bounded test cases with local NVME device, but non IO bounded were mostly a
bit slower - is it expected, or am I missing something?

Interesting thing to note is that io_uring implementation apparently relaxed
requirements for polling operations, now one needs to have only CAP_SYS_NICE
capability, not CAP_SYS_ADMIN. I guess theoretically there are no issues using
it within the current design?




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-09-22 Thread Thomas Munro
On Wed, Sep 1, 2021 at 5:57 PM Andres Freund  wrote:
> - lots of progress on posix AIO backend (Thomas)

A quick note on this piece:  Though it's still a work in progress with
a few things that need to be improved, I've tested this on a whole lot
of different OSes now.  I originally tried to use realtime signals
(big mistake), but after a couple of reworks I think it's starting to
look plausible and quite portable.  Of the ~10 or so OSes we support
and test in the build farm, ~8 of them have this API, and of those I
have only one unknown: HPUX (I have no access and I am beginning to
suspect it is an ex-parrot), and one mysteriously-doesn't-work: NetBSD
(I'd be grateful for any clues from NetBSD gurus and happy to provide
build/test instructions if anyone would like to take a look).




Re: Asynchronous and "direct" IO support for PostgreSQL.

2023-01-17 Thread Wenjing Zeng


> 2021年9月1日 13:56,Andres Freund  写道:
> 
> Hi,
> 
> Attached is an updated patch AIO series. The major changes are:
> - rebased onto master (Andres)
> - lots of progress on posix AIO backend (Thomas)
> - lots of progress towards a windows native AIO implementation - not yet quite
>  merged (Thomas & David)
> - considerably improved "worker" io_method (Thomas)
> - some preliminary patches merged (Thomas) and thus dropped
> - error handling overhaul, AIO references now use resource owners
> - quite a few more localized bugfixes
> - further CI improvements
> 
> Unfortunately there's a few tests that don't pass on windows. At least some of
> those failures also happen on master - hence the alternative output file added
> in the last commit.
> 
> Thanks to Thomas there's now a new wiki page for AIO support:
> https://wiki.postgresql.org/wiki/AIO
> It's currently mostly a shared todo list
> 
> My own next steps are to try to get some of the preliminary patches merged
> into master, and to address some of the higher level things that aren't yet
> quite right with the AIO interface, and to split the "main" AIO patch into
> smaller patches.
> 
> I hope that we soon send in a new version with native AIO support for
> windows. I'm mostly interested in that to make sure that we get the shared
> infrastructure right.
> 
> Melanie has some work improving bitmap heap scan AIO support and some IO stats
> / explain improvements.
> 
> I think a decent and reasonably simple example for the way the AIO interface
> can be used to do faster IO is
> v3-0028-aio-Use-AIO-in-nbtree-vacuum-scan.patch.gz which adds AIO for nbtree
> vacuum. It's not perfectly polished, but I think it shows that it's not too
> hard to add AIO usage to individual once the general infrastructure is in
> place.
> 
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often... The code is at
> https://github.com/anarazel/postgres/tree/aio

HI Andres:

I noticed this feature and did some testing.
code in GitHub's aio branch:
Function 
static void
pgaio_write_smgr_retry(PgAioInProgress *io)
{
uint32 off;
AioBufferTag *tag = &io->scb_data.write_smgr.tag;
SMgrRelation reln = smgropen(tag->rlocator.locator, tag->rlocator.backend);

io->op_data.read.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off);
Assert(off == io->op_data.read.offset);
}

seems should to be:
io->op_data.write.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off);
Assert(off == io->op_data.write.offset);


Best regards,
Wenjing

> 
> Greetings,
> 
> Andres Freund
> 



Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-10-11 Thread Michael Paquier
On Tue, Aug 31, 2021 at 10:56:59PM -0700, Andres Freund wrote:
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often... The code is at
> https://github.com/anarazel/postgres/tree/aio

I don't know what's the exact status here, but as there has been no
activity for the past five months, I have just marked the entry as RwF
for now.
--
Michael


signature.asc
Description: PGP signature


Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-11-08 Thread Andres Freund
Hi,

On 2022-10-12 14:45:26 +0900, Michael Paquier wrote:
> On Tue, Aug 31, 2021 at 10:56:59PM -0700, Andres Freund wrote:
> > I've attached the code for posterity, but the series is large enough that I
> > don't think it makes sense to do that all that often... The code is at
> > https://github.com/anarazel/postgres/tree/aio
> 
> I don't know what's the exact status here, but as there has been no
> activity for the past five months, I have just marked the entry as RwF
> for now.

We're trying to get a number of smaller prerequisite patches merged this CF
(aligned alloc, direction IO, dclist, bulk relation extension, ...). Once
that's done I'm planning to send out a new version of the (large) remainder of
the changes.

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-01-11 Thread Julien Rouhaud
Hi,

On Wed, Sep 1, 2021 at 1:57 PM Andres Freund  wrote:
>
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often...

Agreed.

> The code is at
> https://github.com/anarazel/postgres/tree/aio

Just FYI the cfbot says that this version of the patchset doesn't
apply anymore, and it seems that your branch was only rebased to
43c1c4f (Sept. 21th) which doesn't rebase cleanly:

error: could not apply 8a20594f2f... lwlock, xlog: Report caller wait
event for LWLockWaitForVar.

Since it's still a WIP and a huge patchset I'm not sure if I should
switch the cf entry to Waiting on Author or not as it's probably going
to rot quite fast anyway.  Just to be safe I'll go ahead and change
the status.  If that's unhelpful just let me know and I'll switch it
back to needs review, as people motivated enough to review the patch
can still work with 43c1c4f as a starting point.




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-28 Thread Melanie Plageman
On Tue, Feb 23, 2021 at 5:04 AM Andres Freund  wrote:
>
> ## AIO API overview
>
> The main steps to use AIO (without higher level helpers) are:
>
> 1) acquire an "unused" AIO: pgaio_io_get()
>
> 2) start some IO, this is done by functions like
>pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
>
>The (read|write|fsync|flush_range) indicates the operation, whereas
>(smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
>
>(see below for more details about this design choice - it might or not be
>right)
>
> 3) optionally: assign a backend-local completion callback to the IO
>(pgaio_io_on_completion_local())
>
> 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
>put on a per-backend list of pending IOs. The pending IOs can be explicitly
>be flushed pgaio_submit_pending(), but will also be submitted if the
>pending list gets to be too large, or if the current backend waits for the
>IO.
>
>The are two main reasons not to submit the IO immediately:
>- If adjacent, we can merge several IOs into one "kernel level" IO during
>  submission. Larger IOs are considerably more efficient.
>- Several AIO APIs allow to submit a batch of IOs in one system call.
>
> 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
>backend. When other backends may need to wait for an IO to finish,
>pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
>BufferDesc), which can be waited for using pgaio_io_wait_ref().
>
> 6) Process the results of the request. If a callback was registered in 3),
>this isn't always necessary. The results of AIO can be accessed using
>pgaio_io_result() which returns an integer where negative numbers are
>-errno, and positive numbers are the [partial] success conditions
>(e.g. potentially indicating a short read).
>
> 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
>another operation (pgaio_io_recycle())
>
>
> Most places that want to use AIO shouldn't themselves need to care about
> managing the number of writes in flight, or the readahead distance. To help
> with that there are two helper utilities, a "streaming read" and a "streaming
> write".
>
> The "streaming read" helper uses a callback to determine which blocks to
> prefetch - that allows to do readahead in a sequential fashion but importantly
> also allows to asynchronously "read ahead" non-sequential blocks.
>
> E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> uses the tids in LVDeadTuples to figure out which blocks are going to be
> needed. Here's the latter as an example:
> https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
>

Attached is a patch on top of the AIO branch which does bitmapheapscan
prefetching using the PgStreamingRead helper already used by sequential
scan and vacuum on the AIO branch.

The prefetch iterator is removed and the main iterator in the
BitmapHeapScanState node is now used by the PgStreamingRead helper.

Some notes about the code:

Each IO will have its own TBMIterateResult allocated and returned by the
PgStreamingRead helper and freed later by
heapam_scan_bitmap_next_block() before requesting the next block.
Previously it was allocated once and saved in the TBMIterator in the
BitmapHeapScanState node and reused. Because of this, the table AM API
routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
as an output parameter.

The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is
now the actual BitmapHeapScanState node. It needed access to the
iterator, the heap scan descriptor, and a few fields in the
BitmapHeapScanState node that could be moved elsewhere or duplicated
(visibility map buffer and can_skip_fetch, for example). So, it is
possible to either create a new struct or move fields around to avoid
this--but, I'm not sure if that would actually be better.

Because the PgStreamingReadHelper needs to be set up with the
BitmapHeapScanState node but also needs some table AM specific
functions, I thought it made more sense to initialize it using a new
table AM API routine. Instead of fully implementing that I just wrote a
wrapper function, table_bitmap_scan_setup() which just calls
bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

I haven't made the GIN code reasonable yet either (it uses the TID
bitmap functions that I've changed).

There are various TODOs in the code posing questions both to the
reviewer and myself for future versions of the patch.

Oh, also, I haven't updated the failing partition_prune regression test
because I haven't had a chance to look at the EXPLAIN code which adds
the text which is not being produced to see if it is 

Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-28 Thread Andres Freund
Hi,

On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.

Cool! I'm heartened to see "12 files changed, 272 insertions(+), 495 
deletions(-)"


It's worth calling out that this fixes some abstraction leakyness around
tableam too...


> Each IO will have its own TBMIterateResult allocated and returned by the
> PgStreamingRead helper and freed later by
> heapam_scan_bitmap_next_block() before requesting the next block.
> Previously it was allocated once and saved in the TBMIterator in the
> BitmapHeapScanState node and reused. Because of this, the table AM API
> routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> as an output parameter.
>
> I haven't made the GIN code reasonable yet either (it uses the TID
> bitmap functions that I've changed).

I don't quite understand the need to change the tidbitmap interface, or
maybe rather I'm not convinced that pessimistically preallocating space
is a good idea?


> I don't see a need for it right now. If you wanted you
> Because the PgStreamingReadHelper needs to be set up with the
> BitmapHeapScanState node but also needs some table AM specific
> functions, I thought it made more sense to initialize it using a new
> table AM API routine. Instead of fully implementing that I just wrote a
> wrapper function, table_bitmap_scan_setup() which just calls
> bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

That makes sense.


>  static bool
>  heapam_scan_bitmap_next_block(TableScanDesc scan,
> -   TBMIterateResult 
> *tbmres)
> +  TBMIterateResult **tbmres)

ISTM that we possibly shouldn't expose the TBMIterateResult outside of
the AM after this change? It feels somewhat like an implementation
detail now. It seems somewhat odd to expose a ** to set a pointer that
nodeBitmapHeapscan.c then doesn't really deal with itself.


> @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
>* Create the TBMIterator struct, with enough trailing space to serve 
> the
>* needs of the TBMIterateResult sub-struct.
>*/
> - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
> -   
> MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
> + iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
>   iterator->tbm = tbm;

Hm?


> diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
> index 9a07f06b9f..8e1aa48827 100644
> --- a/src/include/storage/aio.h
> +++ b/src/include/storage/aio.h
> @@ -39,7 +39,7 @@ typedef enum IoMethod
>  } IoMethod;
>
>  /* We'll default to bgworker. */
> -#define DEFAULT_IO_METHOD IOMETHOD_WORKER
> +#define DEFAULT_IO_METHOD IOMETHOD_IO_URING

I agree with the sentiment, but ... :)

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Melanie Plageman
On Wed, Jul 28, 2021 at 2:10 PM Andres Freund  wrote:
> On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
>
> > Each IO will have its own TBMIterateResult allocated and returned by the
> > PgStreamingRead helper and freed later by
> > heapam_scan_bitmap_next_block() before requesting the next block.
> > Previously it was allocated once and saved in the TBMIterator in the
> > BitmapHeapScanState node and reused. Because of this, the table AM API
> > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> > as an output parameter.
> >
> > I haven't made the GIN code reasonable yet either (it uses the TID
> > bitmap functions that I've changed).
>
> I don't quite understand the need to change the tidbitmap interface, or
> maybe rather I'm not convinced that pessimistically preallocating space
> is a good idea?
>

TBMIterator cannot contain a TBMIterateResult because it prefetches
blocks and calls tbm_iterate() for each one, which would overwrite the
relevant information in the TBMIterateResult before it has been returned
to heapam_scan_bitmap_next_block().*

Thus, we need at least as many TBMIterateResults as the size of the
prefetch window at its largest.

We could save some memory if we separated the data in TBMIterateResult
and made a new struct, let's call it BitmapBlockState, with just the
block number, buffer number, and recheck to be used and returned by
bitmapheapscan_pgsr_next_single().

We need both block and buffer because we need to distinguish between
hit_end, skip_fetch, and invalid block number conditions in the caller.
We need recheck before initiating IO to determine if we should
skip_fetch.

Then a separate struct which is much the same as the existing
TBMIterateResult could be maintained in the BitmapHeapScanState node and
passed into heapam_scan_bitmap_next_block() along with the bitmap (a new
parameter).

In heapam_scan_bitmap_next_block(), after getting the BitmapBlockState
from pg_streaming_read_get_next(), we could call tbm_find_pageentry()
with the block number and bitmap.
For a non-lossy page, we could then scrape the offsets and ntuples using
the PageTableEntry. If it is lossy, we would set recheck and ntuples
accordingly. (I do wonder if that allows us to distinguish between a
lossy page and a block number that is erroneous and isn't in the
bitmap--but maybe that can't happen.)

However, we would still have as many palloc() calls (one for every block
to create the BitmapBlockState. We would have less outstanding memory by
limiting the number of offsets arrays created.
We would still need to pass the recheck flag, ntuples, and buffer back
up to BitmapHeapNext(), so, at that point we would still need a data
structure that is basically the same as the existing TBMIterateResult.

Alternatively, we could keep an array of TBMIterateResults the size of
the prefetch window and reuse them -- though I'm not sure where to keep
it and how to manage it when the window gets resized.

In my current patch, I allocate and free one TBMIterateResult for each
block. The amount of outstanding memory will be #ios in prefetch window
* sizeof(TBMIterateResult).

We don't want to always palloc() memory for the TBMIterateResult inside
of tbm_iterate(), since other users (like GIN) still only need one
TBMIterateResult

So, if the TBMIterateResult is not inside of the TBMIterator and
tbm_iterate() does not allocate the memory, we need to pass it in as an
output parameter, and, if we do that, it felt odd to also return it --
hence the function signature change.

One alternative I tried was having the TBMIterator have a pointer to the
TBMIterateResult and then users of it can allocate the TBMIterateResult
and set it in the TBMIterator before calling tbm_iterate(). But, then we
need to expose the TBMIterator outside of the TID bitmap API. Also, it
felt weird to have a member of the iterator which must not be NULL when
tbm_iterate() is called but which isn't set up in tbm_begin_iterate().

>
>
> >  static bool
> >  heapam_scan_bitmap_next_block(TableScanDesc scan,
> > -
TBMIterateResult *tbmres)
> > +  TBMIterateResult **tbmres)
>
> ISTM that we possibly shouldn't expose the TBMIterateResult outside of
> the AM after this change? It feels somewhat like an implementation
> detail now. It seems somewhat odd to expose a ** to set a pointer that
> nodeBitmapHeapscan.c then doesn't really deal with itself.
>

All the members of the TBMIterateResult are populated in
bitmapheapscan_pgsr_next_single() and then
most of it is used by heapam_scan_bitmap_next_block() to
  - detect error conditions and done-ness
  - fill in the HeapScanDesc with the information needed by
heapam_scan_bitmap_next_tuple() (rs_cbuf [which is basically
redundant with TBMIterateResult->buffer] and rs_vistuples)

However, some of the information is used up in BitmapHeapNext() and in
heapam_scan_bitmap_next_tuple() and doesn't go in the HeapScanDesc:
  - BitmapHeapNext() uses the state of the

Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> * I think that having TBMIterateResult inside of TBMIterator is not
>   well-defined C language behavior. In [1], it says
> 
>   "Structures with flexible array members (or unions who have a
>   recursive-possibly structure member with flexible array member) cannot
>   appear as array elements or as members of other structures."

> [1] https://en.cppreference.com/w/c/language/struct

I think it is ok as long as the struct with the flexible array member is
at the end of the struct it is embedded in. I think even by the letter
of the standard, but it's as always hard to parse...

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Thomas Munro
On Sat, Jul 31, 2021 at 7:52 AM Andres Freund  wrote:
> On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> > * I think that having TBMIterateResult inside of TBMIterator is not
> >   well-defined C language behavior. In [1], it says
> >
> >   "Structures with flexible array members (or unions who have a
> >   recursive-possibly structure member with flexible array member) cannot
> >   appear as array elements or as members of other structures."
>
> > [1] https://en.cppreference.com/w/c/language/struct
>
> I think it is ok as long as the struct with the flexible array member is
> at the end of the struct it is embedded in. I think even by the letter
> of the standard, but it's as always hard to parse...

That's clearly the de facto situation (I think that was the case on
the most popular compilers long before flexible array members were
even standardised), but I think it might technically still be not
allowed since this change has not yet been accepted AFAICS:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2083.htm

In any case, we already do it which is why wrasse (Sun Studio
compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
always the final member of the containing struct, depending on
CATALOG_VARLEN...




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Tom Lane
Thomas Munro  writes:
> In any case, we already do it which is why wrasse (Sun Studio
> compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
> always the final member of the containing struct, depending on
> CATALOG_VARLEN...

Hm?  CATALOG_VARLEN is never to be defined, see genbki.h.

regards, tom lane




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-08-09 Thread Melanie Plageman
On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
 wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund  wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> >pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> >The (read|write|fsync|flush_range) indicates the operation, whereas
> >(smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> >(see below for more details about this design choice - it might or not be
> >right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> >(pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> >put on a per-backend list of pending IOs. The pending IOs can be 
> > explicitly
> >be flushed pgaio_submit_pending(), but will also be submitted if the
> >pending list gets to be too large, or if the current backend waits for 
> > the
> >IO.
> >
> >The are two main reasons not to submit the IO immediately:
> >- If adjacent, we can merge several IOs into one "kernel level" IO during
> >  submission. Larger IOs are considerably more efficient.
> >- Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> >backend. When other backends may need to wait for an IO to finish,
> >pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> >BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> >this isn't always necessary. The results of AIO can be accessed using
> >pgaio_io_result() which returns an integer where negative numbers are
> >-errno, and positive numbers are the [partial] success conditions
> >(e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> >another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a 
> > "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but 
> > importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility 
> > map
> > to figure out which block needs to be read next. Similarly 
> > lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>

I have done some testing on how effective it is now.

I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
  EXPLAIN (buffers, analyze) SELECT ...

The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.

I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.

The short version of my results is that the prefetching works:

- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms

DDL and repro details below:

On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing

And these non-default PostgreSQL settings:
  io_data_direct=1
  io_data_force_async=off
  io_method=io_uring
  log_min_duration_statement=0
  log_duration=on
  set track_io_timing to on;

  set max_parallel_workers_per_gather to 0;
  set enable_seqscan to off;
  set enable_indexscan to off;
  set enable_bitmapscan to on;

  set effective_io_concurren

Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-05-10 Thread Aleksander Alekseev
Hi Andres,

> > The code is at
> > https://github.com/anarazel/postgres/tree/aio
>
> Just FYI the cfbot says that this version of the patchset doesn't
> apply anymore, and it seems that your branch was only rebased to
> 43c1c4f (Sept. 21th) which doesn't rebase cleanly:

After watching your recent talk "IO in PostgreSQL: Past, Present,
Future" [1] I decided to invest some of my time into this patchset. It
looks like at very least it could use a reviewer, or maybe two :)
Unfortunately, it's a bit difficult to work with the patchset at the
moment. Any chance we may expect a rebased version for the July CF?

> Comments? Questions?

Personally, I'm very enthusiastic about this patchset. However, a set
of 39 patches seems to be unrealistic to test and/or review and/or
keep up to date. The 64 bit XIDs patchset [2] is much less
complicated, but still it got the feedback that it should be splitted
to more patches and CF entries. Any chance we could decompose this
effort?

For instance, I doubt that we need all the backends in the first
implementation. The fallback "worker" one, and io_uring one will
suffice. Other backends can be added as separate features. Considering
that in any case the "worker" backend shouldn't cause any significant
performance degradation, maybe we could start even without io_uring.
BTW, do we need Posix AIO at all, given your feedback on this API?

Also, what if we migrate to AIO/DIO one part of the system at a time?
As I understood from your talk, sequential scans will benefit most
from AIO/DIO. Will it be possible to improve them first, while part of
the system will continue using buffered IO?

[1]: https://www.youtube.com/watch?v=3Oj7fBAqVTw
[2]: https://commitfest.postgresql.org/38/3594/


-- 
Best regards,
Aleksander Alekseev




Re: Asynchronous and "direct" IO support for PostgreSQL.

2022-05-15 Thread Bharath Rupireddy
On Wed, Sep 1, 2021 at 11:27 AM Andres Freund  wrote:
>
> Hi,
>
> Attached is an updated patch AIO series. The major changes are:

Hi Andres, is there a plan to get fallocate changes alone first? I think
fallocate API can help parallel inserts work (bulk relation extension
currently writes zero filled-pages) and make pre-padding while allocating
WAL files faster.

Regards,
Bharath Rupireddy.