Okay, I did some more research into this area. It looks like it will
be feasible to use large TLB pages for PostgreSQL.

Tom Lane <[EMAIL PROTECTED]> writes:
> It wasn't clear from your description whether large-TLB shmem segments
> even have IDs that one could use to determine whether "the segment still
> exists".

There are two types of hugepages:

        (a) private: Not shared on fork(), not accessible to processes
            other than the one that allocates the pages.

        (b) shared: Shared across a fork(), accessible to other
            processes: different processes can access the same segment
            if they call sys_alloc_hugepages() with the same key.

So for a standalone backend, we can just use private pages (probably
worth using private hugepages rather than malloc, although I doubt it
matters much either way).

> > Another possibility might be to still allocate a small SysV shmem
> > area, and use that to provide the interlock, while we allocate the
> > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
> > I think it would resolve the interlock problem, at least.
> 
> Not a bad idea ... I have not got a better one offhand ... but watch
> out for SHMMIN settings.

As it turns out, this will be completely unnecessary. Since hugepages
are an in-kernel data structure, the kernel takes care of ensuring
that dieing processes don't orphan any unused hugepage segments. The
logic works like this: (for shared hugepages)

        (a) sys_alloc_hugepages() without IPC_EXCL will return a
            pointer to an existing segment, if there is one that
            matches the key. If an existing segment is found, the
            usage counter for that segment is incremented. If no
            matching segment exists, an error is returned. (I'm pretty
            sure the usage counter is also incremented after a fork(),
            but I'll double-check that.)

        (b) sys_free_hugepages() decrements the usage counter

        (c) when a process that has allocated a shared hugepage dies
            for *any reason* (even kill -9), the usage counter is
            decremented

        (d) if the usage counter for a given segment ever reaches
            zero, the segment is deleted and the memory is free'd.

If we used a key that would remain the same between runs of the
postmaster, this should ensure that there isn't a possibility of two
independant sets of backends operating on the same data dir. The most
logical way to do this IMHO would be to just hash the data dir, but I
suppose the current method of using the port number should work as
well.

To elaborate on (a) a bit, we'd want to use this logic when allocating
a new set of hugepages on postmaster startup:

        (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns
            an error, we're in the clear: there's no page matching
            that key. If it returns a pointer to a previously existing
            segment, panic: it is very likely that there are some
            orphaned backends still active.

        (2) If the previous call didn't find anything, call
            sys_alloc_hugepages() again, specifying IPC_EXCL to create
            a new segment.

Now, the question is: how should this be implemented? You recently
did some of the legwork toward supporting different APIs for shared
memory / semaphores, which makes this work easier -- unfortunately,
some additional stuff is still needed. Specifically, support for
hugepages is a configuration option, that may or may not be enabled
(if it's disabled, the syscall returns a specific error). So I believe
the logic is something like:

        - if compiling on a Linux system, enable support for hugepages
          (the regular SysV stuff is still needed as a backup)

        - if we're compiling on a Linux system but the kernel headers
          don't define the syscalls we need, use some reasonable
          defaults (e.g. the syscall numbers for the current hugepage
          syscalls in Linux 2.5)

        - at runtime, try to make one of these syscalls. If it fails,
          fall back to the SysV stuff.

Does that sound reasonable?

Any other comments would be appreciated.

Cheers,

Neil

-- 
Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to