On 30/4/20 6:22, Thomas Munro wrote:
On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
Yeah, I think the question is what are the expected benefits of using
raw devices. It might be an interesting exercise / experiment, but my
understanding is that most of the benefits can be achieved by using file
systems but with direct I/O and async I/O, which would allow us to
continue reusing the existing filesystem code with much less disruption
to our code base.
Agreed.

[snip] That's probably the main work
required to make this work, and might be a valuable thing to have
independently of whether you stick it on a raw device, a big data
file, NV RAM
   ^^^^^^  THIS, with NV DIMMs / PMEM (persistent memory) possibly becoming a hot topic in the not-too-distant future
or some other kind of storage system -- but it's a really
difficult project.

Indeed.... But you might have already pointed out the *only* required feature for this to work: a "database" of relfilenode ---which is actually an int, or rather, a tuple (relfilenode,segment) where both components are 32-bit currently: that is, a 64bit "objectID" of sorts--- to "set of extents" ---yes, extents, not blocks: sequential I/O is still faster in all known storage/persistent (vs RAM) systems---- where the current I/O primitives would be able to write.

Some conversion from "absolute" (within the "file") to "relative" (within the "tablespace") offsets would need to happen before delegating to the kernel... or even dereferencing a pointer to an mmap'd region !, but not much more, ISTM (but I'm far from an expert in this area).

Out of the top of my head:

CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2' WITH (kind=raw, extent_min=4MB);

  or something similar to that approac might do it.

    Please note that I have purposefully specified "namespace 2" in an "enterprise" NVME device, to show the possibility.

OR

  use some filesystem (e.g. XFS) with DAX[1] (mount -o dax ) where available along something equivalent to  WITH(kind=mmaped)


... though the locking we currently get "for free" from the kernel would need to be replaced by something else.


Indeed it seems like an enormous amount of work.... but it may well pay off. I can't fully assess the effort, though


Just my .02€

[1] https://www.kernel.org/doc/Documentation/filesystems/dax.txt


Thanks,

    / J.L.




Reply via email to