Mark Hahn wrote:

so the question is, if you had a magic wand, what would you change in the kernel (or perhaps libc or other support libs, etc)? most of the things I can think of are not clear-cut. I'd like to be able to give better info from perf counters to our users (but I don't think Linux is really in the way). I suspect we lose some performance due to jitter
injected by the OS (and/or our own monitoring) and would like to improve,
but again, it's hard to blame Linux. I'd love to have better options for cluster-aware filesystems. kernel-assisted network shared memory?
_______________________________________________
There's a good rant to be written for Usenix or the Ottowa Linux Symposium I suspect.

VM - 4096 is small now. In 1976 a page was 512 bytes. It moved to 4096 in the mid '90s? I forget. Since then computers and memory bandwidths are much bigger and faster. The telling point for me was that I took a look at a running system and there were only a couple of <hundred> VM areas in service, so page breakage amounts to almost nothing. We run with 64K pages and plan to experiment with much larger ones.

One could argue about thread stacks, but I think that threads and HPC don't mix well, so there won't be that many. I am aware of the great debate about the right way to program high core-count nodes, but I
doubt that more threads than processors is the right answer.

Linux also has pretty poor mechanisms for keeping physical memory contiguous, the slabs tend to get fragmented, which is why the big page stuff and things like bigphysarea get preallocated. There's
no good reason why you couldn't compact memory on the fly.

The VM system is also in the way of OS bypass RDMA NICs - you either get large kernel patches like Quadrics to let virtual RDMA work, or you get pinning and registration and other performance sapping cruft. The new external-pager stuff may help a lot here, I haven't looked at it yet.

I/O system

The block device layer has 512 byte sectors wired in, and is solely useful for devices that you own exclusively. You've got queueing going on at multiple levels, I think because the architecture has assumptions about cpu/disk performance ratios baked in. And the segments of a bio have to complete in order, what's that about? A little one we ran into here is that the block I/O system doesn't know if an I/O is to satisfy an I stream page fault or a D stream page fault. Consequently if your L1 Icache is not coherent (and few are) you have to flush it on all read completions. A little book keeping would
solve that. (I hope I am wrong about this one!)

File systems

Agree complelely about cluster aware FS. We struggle with the Lustre patch sets, which may be an
extreme case.

Performance stuff

We are big users of the PAPI infrastructure, which is pretty good, but once you step off that train you have to deal with things like sysfs. So we're trying to read hardware counters without undue disturbance to running HPC applications, and the advice of Linux is to make a system call for each value, converted to ascii. This makes sense for slow admin stuff but not for performance data. At least it isn't XML.

Runtime system

I tend towards thinking we would be better off without shared libraries. Memory is big, programs are generally small. There is a lot of complexity here, to which I am allergic. To the extent that shared libraries make the program slower (due to separate segments for library data, for example), lets get rid of them. Two arguments in favor are when the library is implementing a system service chosen by the admin, rather than the programmer (PAM modules), and there is this talk about MPI ABIs, so applications can use alternate packages without relinking. I think that is a bad idea
too, but it is off-topic.

OS noise

This becomes a big issue in large systems. There's way too much stuff running in linux, each piece separately designed, each thread with its own notions of timing and periodic wakeups. Maybe the OS should run on a separate node altogether, and you communicate with it via RDMA. All that is
left behind is maybe memory management.

-L




_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to