Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

Adam Hooper Wed, 16 Sep 2020 08:49:46 -0700

On Tue, Sep 15, 2020 at 1:00 PM Wes McKinney <wesmck...@gmail.com> wrote:


> We have additional problems in that some file-loading related tasks do
> a mixture of CPU work and IO work, and once a thread has been
> dispatched to execute one of these tasks, when IO takes place, a CPU
> core may sit underutilized while the IO is waiting.
>

Maybe it's best to say what *won't* be solved....

For instance: in Workbench [https://www.workbenchdata.com], we use Arrow to
transfer data between processes and support larger-than-RAM tables. So we
use mmap() and expect the kernel to swap mmapped data out of RAM. It
wouldn't be fair to call calculations "CPU-bound" because we expect them to
page-fault -- cause blocking IO.

Is mmap() the norm? Maybe there's not much point in distinguishing between
IO and CPU. Maybe Parquet-reading is best treated as an exception....

For example, it strikes me as similar to concurrency issues
> inside an analytic database. How are they preventing concurrent
> workload starvation problems or handling CPU/IO task scheduling to
> avoid CPU underutilization?
>

RDBMSs don't prevent concurrent workload starvation. They use process
pools; they provide knobs; and they train an army of specialists to find
compromises.

For instance, Postgres supports parallel SELECT. The knobs are
max_parallel_workers (default 8) and max_parallel_workers_per_gather
(default 2): the former sets the process-pool size and the latter is a nod
to fairness. The kernel handles scheduling. When all worker processes are
busy, Postgres evades its own parallel-query logic and runs the entire
query on the client-connection process itself. (Each connection is a
separate process.) That knob is max_connections (default 100). So in
effect, Postgres by default limits itself to 108 concurrent processes --
and they can be all IO-bound or all CPU-bound.

It is dead-easy to cripple an RDBMS. Heck, it's dead-easy to cripple an
entire *computer* using an RDBMS. Also, Postgres doesn't use all processes
by default, even when they're all available.

Enter DB administrators. A DB administrator monitors queries, finds
overly-expensive ones, and *solves the problem*. Sometimes, the solution is
to tweak knobs. But there are so many other solutions: optimize the query;
add an index; call the query less often (or at night); move it to a
read-only replica; take it out of the RDBMS entirely; or even *change the
project requirements* so the query won't be needed.

The *team* solves the problem. The DB administrator can't do any of these
things in a vacuum.

Point being: it's about people. A good RDBMS provides tools. Those tools
help a team discover which of the RDBMS's multitude of weaknesses are most
urgent. The RDBMS provides few knobs and much documentation. The team
selects compromises.

I think a good bullet for your list of requirements is: "simple enough to
explain to a non-programmer." A disaster *will* happen; and someone will
need to explain it to the boss....

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

Reply via email to