On Tue, Sep 15, 2020 at 1:00 PM Wes McKinney <wesmck...@gmail.com> wrote:
> We have additional problems in that some file-loading related tasks do > a mixture of CPU work and IO work, and once a thread has been > dispatched to execute one of these tasks, when IO takes place, a CPU > core may sit underutilized while the IO is waiting. > Maybe it's best to say what *won't* be solved.... For instance: in Workbench [https://www.workbenchdata.com], we use Arrow to transfer data between processes and support larger-than-RAM tables. So we use mmap() and expect the kernel to swap mmapped data out of RAM. It wouldn't be fair to call calculations "CPU-bound" because we expect them to page-fault -- cause blocking IO. Is mmap() the norm? Maybe there's not much point in distinguishing between IO and CPU. Maybe Parquet-reading is best treated as an exception.... For example, it strikes me as similar to concurrency issues > inside an analytic database. How are they preventing concurrent > workload starvation problems or handling CPU/IO task scheduling to > avoid CPU underutilization? > RDBMSs don't prevent concurrent workload starvation. They use process pools; they provide knobs; and they train an army of specialists to find compromises. For instance, Postgres supports parallel SELECT. The knobs are max_parallel_workers (default 8) and max_parallel_workers_per_gather (default 2): the former sets the process-pool size and the latter is a nod to fairness. The kernel handles scheduling. When all worker processes are busy, Postgres evades its own parallel-query logic and runs the entire query on the client-connection process itself. (Each connection is a separate process.) That knob is max_connections (default 100). So in effect, Postgres by default limits itself to 108 concurrent processes -- and they can be all IO-bound or all CPU-bound. It is dead-easy to cripple an RDBMS. Heck, it's dead-easy to cripple an entire *computer* using an RDBMS. Also, Postgres doesn't use all processes by default, even when they're all available. Enter DB administrators. A DB administrator monitors queries, finds overly-expensive ones, and *solves the problem*. Sometimes, the solution is to tweak knobs. But there are so many other solutions: optimize the query; add an index; call the query less often (or at night); move it to a read-only replica; take it out of the RDBMS entirely; or even *change the project requirements* so the query won't be needed. The *team* solves the problem. The DB administrator can't do any of these things in a vacuum. Point being: it's about people. A good RDBMS provides tools. Those tools help a team discover which of the RDBMS's multitude of weaknesses are most urgent. The RDBMS provides few knobs and much documentation. The team selects compromises. I think a good bullet for your list of requirements is: "simple enough to explain to a non-programmer." A disaster *will* happen; and someone will need to explain it to the boss.... Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com