Okay, I did some more research into this area. It looks like it will be feasible to use large TLB pages for PostgreSQL.
Tom Lane <[EMAIL PROTECTED]> writes: > It wasn't clear from your description whether large-TLB shmem segments > even have IDs that one could use to determine whether "the segment still > exists". There are two types of hugepages: (a) private: Not shared on fork(), not accessible to processes other than the one that allocates the pages. (b) shared: Shared across a fork(), accessible to other processes: different processes can access the same segment if they call sys_alloc_hugepages() with the same key. So for a standalone backend, we can just use private pages (probably worth using private hugepages rather than malloc, although I doubt it matters much either way). > > Another possibility might be to still allocate a small SysV shmem > > area, and use that to provide the interlock, while we allocate the > > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but > > I think it would resolve the interlock problem, at least. > > Not a bad idea ... I have not got a better one offhand ... but watch > out for SHMMIN settings. As it turns out, this will be completely unnecessary. Since hugepages are an in-kernel data structure, the kernel takes care of ensuring that dieing processes don't orphan any unused hugepage segments. The logic works like this: (for shared hugepages) (a) sys_alloc_hugepages() without IPC_EXCL will return a pointer to an existing segment, if there is one that matches the key. If an existing segment is found, the usage counter for that segment is incremented. If no matching segment exists, an error is returned. (I'm pretty sure the usage counter is also incremented after a fork(), but I'll double-check that.) (b) sys_free_hugepages() decrements the usage counter (c) when a process that has allocated a shared hugepage dies for *any reason* (even kill -9), the usage counter is decremented (d) if the usage counter for a given segment ever reaches zero, the segment is deleted and the memory is free'd. If we used a key that would remain the same between runs of the postmaster, this should ensure that there isn't a possibility of two independant sets of backends operating on the same data dir. The most logical way to do this IMHO would be to just hash the data dir, but I suppose the current method of using the port number should work as well. To elaborate on (a) a bit, we'd want to use this logic when allocating a new set of hugepages on postmaster startup: (1) call sys_alloc_hugepages() without IPC_EXCL. If it returns an error, we're in the clear: there's no page matching that key. If it returns a pointer to a previously existing segment, panic: it is very likely that there are some orphaned backends still active. (2) If the previous call didn't find anything, call sys_alloc_hugepages() again, specifying IPC_EXCL to create a new segment. Now, the question is: how should this be implemented? You recently did some of the legwork toward supporting different APIs for shared memory / semaphores, which makes this work easier -- unfortunately, some additional stuff is still needed. Specifically, support for hugepages is a configuration option, that may or may not be enabled (if it's disabled, the syscall returns a specific error). So I believe the logic is something like: - if compiling on a Linux system, enable support for hugepages (the regular SysV stuff is still needed as a backup) - if we're compiling on a Linux system but the kernel headers don't define the syscalls we need, use some reasonable defaults (e.g. the syscall numbers for the current hugepage syscalls in Linux 2.5) - at runtime, try to make one of these syscalls. If it fails, fall back to the SysV stuff. Does that sound reasonable? Any other comments would be appreciated. Cheers, Neil -- Neil Conway <[EMAIL PROTECTED]> || PGP Key ID: DB3C29FC ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster