On Fri, 2015-10-23 at 11:52 +0200, casper....@oracle.com wrote: > > >Ho-hum... It could even be made lockless in fast path; the problems I see > >are > > * descriptor-to-file lookup becomes unsafe in a lot of locking > >conditions. Sure, most of that happens on the entry to some syscall, with > >very light locking environment, but... auditing every sodding ioctl that > >might be doing such lookups is an interesting exercise, and then there are > >->mount() instances doing the same thing. And procfs accesses. Probably > >nothing impossible to deal with, but nothing pleasant either. > > In the Solaris kernel code, the ioctl code is generally not handled a file > descriptor but instead a file pointer (i.e., the lookup is done early in > the system call). > > In those specific cases where a system call needs to convert a file > descriptor to a file pointer, there is only one routines which can be used. > > > * memory footprint. In case of Linux on amd64 or sparc64, > >main() > >{ > > int i; > > for (i = 0; i < 1<<24; dup2(0, i++)) // 16M descriptors > > ; > >} > >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient > >ulimit -n, > >of course). How much will Solaris eat on the same? > > Yeah, that is a large amount of memory. Of course, the table is only > sized when it is extended and there is a reason why there is a limit on > file descriptors. But we're using more data per file descriptor entry. > > > > * related to the above - how much cacheline sharing will that involve? > >These per-descriptor use counts are bitch to pack, and giving each a > >cacheline > >of its own... <shudder> > > As I said, we do actually use a lock and yes that means that you really > want to have a single cache line for each and every entry. It does make > it easy to have non-racy file description updates. You certainly do not > want false sharing when there is a lot of contention. > > Other data is used to make sure that it only takes O(log(n)) to find the > lowest available file descriptor entry. (Where n, I think, is the returned > descriptor)
Yet another POSIX deficiency. When a server deals with 10,000,000+ socks, we absolutely do not care of this requirement. O(log(n)) is still crazy if it involves O(log(n)) cache misses. > > Not contended locks aren't expensive. And all is done on a single cache > line. > > One question about the Linux implementation: what happens when a socket in > select is closed? I'm assuming that the kernel waits until "shutdown" is > given or when a connection comes in? > > Is it a problem that you can "hide" your listening socket with a thread in > accept()? I would think so (It would be visible in netstat but you can't > easily find out why has it) Again, netstat -p on a server with 10,000,000 sockets never completes. Never try this unless you are desperate and want to avoid a reboot maybe. If you absolutely want to nuke a listener because of untrusted applications, we better implement a proper syscall. Android has such a facility. Alternative would be to extend netlink (ss command from iproute2 package) to carry one pid per socket. ss -atnp state listening -> would not have to readlink (/proc/*/fd/*) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html