Re: mmap() question
On Fri, Oct 11, 2013 at 09:57:24AM +0400, Dmitry Sivachenko wrote: On 11.10.2013, at 9:17, Konstantin Belousov kostik...@gmail.com wrote: On Wed, Oct 09, 2013 at 03:42:27PM +0400, Dmitry Sivachenko wrote: Hello! I have a program which mmap()s a lot of large files (total size more that RAM and I have no swap), but it needs only small parts of that files at a time. My understanding is that when using mmap when I access some memory region OS reads the relevant portion of that file from disk and caches the result in memory. If there is no free memory, OS will purge previously read part of mmap'ed file to free memory for the new chunk. But this is not the case. I use the following simple program which gets list of files as command line arguments, mmap()s them all and then selects random file and random 1K parts of that file and computes a XOR of bytes from that region. After some time the program dies: pid 63251 (a.out), uid 1232, was killed: out of swap space It seems I incorrectly understand how mmap() works, can you please clarify what's going wrong? I expect that program to run indefinitely, purging some regions out of RAM and reading the relevant parts of files. You did not specified several very important parameters for your test: 1. total amount of RAM installed 24GB 2. count of the test files and size of the files To be precise: I used 57 files with size varied form 74MB to 19GB. The total size of these files is 270GB. 3. which filesystem files are located at UFS @ SSD drive 4. version of the system. FreeBSD 9.2-PRERELEASE #0 r254880M: Wed Aug 28 11:07:54 MSK 2013 I was not able to reproduce the situation locally. I even tried to start a lot of threads accessing the mapped regions, to try to outrun the pagedaemon. The user threads sleep on the disk read, while pagedaemon has a lot of time to rebalance the queues. It might be a case when SSD indeed makes a difference. Still, I see how this situation could appear. The code, which triggers OOM, never fires if there is a free space in the swapfile, so the absense of swap is neccessary condition to trigger the bug. Next, OOM calculation does not account for a possibility that almost all pages on the queues can be reused. It just fires if free pages depleted too much or free target cannot be reached. IMO one of the possible solution is to account the queued pages in addition to the swap space. This is not entirely accurate, since some pages on the queues cannot be reused, at least transiently. Most precise algorithm would count the hold and busy pages globally, and substract this count from queues length, but it is probably too costly. Instead, I think we could rely on the numbers which are counted by pagedaemon threads during the passes. Due to the transient nature of the pagedaemon failures, this should be fine. Below is the prototype patch, against HEAD. It is not applicable to stable, please use HEAD kernel for test. diff --git a/sys/sys/vmmeter.h b/sys/sys/vmmeter.h index d2ad920..ee5159a 100644 --- a/sys/sys/vmmeter.h +++ b/sys/sys/vmmeter.h @@ -93,9 +93,10 @@ struct vmmeter { u_int v_free_min; /* (c) pages desired free */ u_int v_free_count; /* (f) pages free */ u_int v_wire_count; /* (a) pages wired down */ - u_int v_active_count; /* (q) pages active */ + u_int v_active_count; /* (a) pages active */ u_int v_inactive_target; /* (c) pages desired inactive */ - u_int v_inactive_count; /* (q) pages inactive */ + u_int v_inactive_count; /* (a) pages inactive */ + u_int v_queue_sticky; /* (a) pages on queues but cannot process */ u_int v_cache_count;/* (f) pages on cache queue */ u_int v_cache_min; /* (c) min pages desired on cache queue */ u_int v_cache_max; /* (c) max pages in cached obj (unused) */ diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c index 713a2be..4bb1f1f 100644 --- a/sys/vm/vm_meter.c +++ b/sys/vm/vm_meter.c @@ -316,6 +316,7 @@ VM_STATS_VM(v_active_count, Active pages); VM_STATS_VM(v_inactive_target, Desired inactive pages); VM_STATS_VM(v_inactive_count, Inactive pages); VM_STATS_VM(v_cache_count, Pages on cache queue); +VM_STATS_VM(v_queue_sticky, Pages which cannot be moved from queues); VM_STATS_VM(v_cache_min, Min pages on cache queue); VM_STATS_VM(v_cache_max, Max pages on cached queue); VM_STATS_VM(v_pageout_free_min, Min pages reserved for kernel); diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 7846702..6943a0e 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -226,6 +226,7 @@ struct vm_domain { long vmd_segs; /* bitmask of the segments */ boolean_t vmd_oom; int vmd_pass; /* local pagedaemon pass */ + int vmd_queue_sticky; /* pages on queues which cannot be processed */ struct vm_page vmd_marker; /* marker for pagedaemon private
Re: mmap() question
On Sat, Oct 12, 2013 at 04:04:31PM +0400, Dmitry Sivachenko wrote: On 12.10.2013, at 13:59, Konstantin Belousov kostik...@gmail.com wrote: I was not able to reproduce the situation locally. I even tried to start a lot of threads accessing the mapped regions, to try to outrun the pagedaemon. The user threads sleep on the disk read, while pagedaemon has a lot of time to rebalance the queues. It might be a case when SSD indeed makes a difference. With ordinary SATA drive it will take hours just to read 20GB of data from disk because of random access, it will do a lot of seeks and reading speed will be extremely low. SSD dramatically improves reading speed. Still, I see how this situation could appear. The code, which triggers OOM, never fires if there is a free space in the swapfile, so the absense of swap is neccessary condition to trigger the bug. Next, OOM calculation does not account for a possibility that almost all pages on the queues can be reused. It just fires if free pages depleted too much or free target cannot be reached. First I tried with some swap space configured. The OS started to swap out my process after it reached about 20GB which is also not what I expected: what is the reason to swap out regions of read-only mmap()ed files? Is it the expected behaviour? How did you concluded that the pages from your r/o mappings were paged out ? VM never does this. Only anonymous memory could be written to swap file, including the shadow pages for the writeable COW mappings. I suspect that you have another 20GB of something used on the machine meantime. Below is the prototype patch, against HEAD. It is not applicable to stable, please use HEAD kernel for test. Thanks, I will test the patch soon and report the results. pgp4mxTG6rGdf.pgp Description: PGP signature
Re: mmap() question
On Wed, Oct 09, 2013 at 03:42:27PM +0400, Dmitry Sivachenko wrote: Hello! I have a program which mmap()s a lot of large files (total size more that RAM and I have no swap), but it needs only small parts of that files at a time. My understanding is that when using mmap when I access some memory region OS reads the relevant portion of that file from disk and caches the result in memory. If there is no free memory, OS will purge previously read part of mmap'ed file to free memory for the new chunk. But this is not the case. I use the following simple program which gets list of files as command line arguments, mmap()s them all and then selects random file and random 1K parts of that file and computes a XOR of bytes from that region. After some time the program dies: pid 63251 (a.out), uid 1232, was killed: out of swap space It seems I incorrectly understand how mmap() works, can you please clarify what's going wrong? I expect that program to run indefinitely, purging some regions out of RAM and reading the relevant parts of files. You did not specified several very important parameters for your test: 1. total amount of RAM installed 2. count of the test files and size of the files 3. which filesystem files are located at 4. version of the system. pgpu_xJMm2QsF.pgp Description: PGP signature
Re: Mixing amd64 kernel with i386 world
On Sat, Sep 28, 2013 at 01:25:39PM +0200, Dimitry Andric wrote: On Sep 28, 2013, at 12:37, Peter Jeremy pe...@rulingia.com wrote: I have a system with 4GB RAM and hence need to use an amd64 kernel to use all the RAM (I can only access 3GB RAM with an i386 kernel). OTOH, amd64 processes are significantly (50-100%) larger than equivalent i386 processes and none none of the applications I'll be running on the system need to be 64-bit. This implies that the optimal approach is an amd64 kernel with i386 userland (I'm ignoring PAE as a useable approach). I've successfully run i386 jails on amd64 systems so I know this mostly works. I also know that there are some gotchas: - kdump needs to match the kernel - anything accessing /dev/mem or /dev/kmem (which implies anything that uses libkvm) probably needs to match the kernel. Has anyone investigated this approach? I have only run i386 jails on amd64, just like you. Though for the use case you are describing, the really optimal approach would be to have real x32 support. Unfortunately, that is quite a lot of work... :) The management interfaces usually do not work for compat32 on amd64. The biggest exception I am aware of is nmount(2). But you would be without network. The same is true for almost all sysctls, again kern.proc.* hierarchy is an exception and works. Syscalls required to normal usermode operation do work, at least most developers care about this, except for capsicum. Using 32bit jail or, even easier, chroot, for now, is the easiest proposition. pgpAZZ52Y_U8w.pgp Description: PGP signature
Re: mmap on emulated i386
On Tue, Sep 17, 2013 at 07:01:16PM -0700, Rui Paulo wrote: Hi, I'm trying to figure out why the following fails when compiled on amd64 with -m32: mmap(NULL, 0x7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_ANON|MAP_PRIVATE, -1, 0); It returns EINVAL. I looked around everywhere but I couldn't find where the EINVAL is coming from. The length argument doesn't really make any difference. At this point I'm thinking it's a bug... What is the version of your system, including the userspace ? -m32 only works on HEAD. Do you have r255657/r255658 in your kernel ? Please provide kdump of the ktraced execution of your test program, as well as the test program itself. N.B. The following worked fine for me, both in 64 and 32 bit binary. #include sys/mman.h #include err.h #include stdlib.h int main(void) { char *p; p = mmap(NULL, 0x7, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE, -1, 0); if (p == MAP_FAILED) err(1, mmap); return (0); } pgpIkDQh3vpHS.pgp Description: PGP signature
Re: ps_strings
On Mon, Aug 19, 2013 at 06:49:55PM -0600, Chris Torek wrote: Yes, p_args caches the arguments, but not always. Right now, kernel does not cache arguments if the string is longer than 256 bytes. Look for ps_arg_cache_limit in kern_exec.c. setproctitle() always informs the kernel with sysctl and sets the pointers in ps_strings. kern.proc.args sysctl first tries the p_args, and falls back to reading ps_strings and following the pointers if p_args is NULL. Ah, that's what I get for scanning through years of updates too fast. :-) This seems a bit of a worst of both worlds: there's now some extra kernel code for poking through the ps_strings and the pointer-vectors (this code is no longer in libkvm at all -- that was where I looked first and found the sysctl), for the no p_args case. It seems like perhaps there could just be a sysctl to return the ps_strings address, and leave the follow argv pointers code in libkvm, if there is to be code for that. There is a demand for other things besides arguments, e.g. environment, and most important for the debuggers, ELF aux vector. Also, moving this code to libkvm would mean that mismatched ABIs cannot be easily supported, like 64bit binary trying to get 32bit binary information. I would say that the current placement fullfill its goals. (The kernel saves a bit of time for the presumably-usual p_args not NULL case, and finding the location of the ps_strings structure when it *is* used is automatically correct. So that part is a straight-up improvement, at least.) Not that big a deal either way, but it does seem as though there should be documentation for ps_strings. Chris pgpeki4CRh2uH.pgp Description: PGP signature
Re: ps_strings
On Sun, Aug 18, 2013 at 04:05:11PM -0600, Chris Torek wrote: FreeBSD now, however, uses a per-process p_args field in the proc structure, with sysctl()s to set and get p_args. (I had nothing to do with this new code, but I approve, as if anyone cares. :-) ) This removes the fixed-virtual-address limitation. The cost is a bit more kernel code (for the sysctl()s) and this per-process data, but there is no more messing-about with where is ps_strings in this memory-layout / emulation etc. (Meanwhile libkvm still retrieves the arguments. It just does it now with sysctl().) Yes, p_args caches the arguments, but not always. Right now, kernel does not cache arguments if the string is longer than 256 bytes. Look for ps_arg_cache_limit in kern_exec.c. setproctitle() always informs the kernel with sysctl and sets the pointers in ps_strings. kern.proc.args sysctl first tries the p_args, and falls back to reading ps_strings and following the pointers if p_args is NULL. pgpfNuTdP710t.pgp Description: PGP signature
Re: kldload ipfw, with IPFIREWALL_DEFAULT_TO_ACCEPT
On Mon, Jul 29, 2013 at 12:27:40PM +0100, Karl Pielorz wrote: --On 29 July 2013 13:02 +0200 Stefan Esser s...@freebsd.org wrote: I guess you were looking for: net.inet.ip.fw.default_to_accept=1 which is a tunable to be set in /boot/loader.conf ... Very probably - but that's at boot time :( - Is there nothing I can do at kldload time to have the initial kldload give me a 'allow ip from any to any' rule as it loads? (thus not affecting traffic on the machine, or more importantly the CARP interfaces)? kenv net.inet.ip.fw.default_to_accept=1 should have the same effect after the usermode is booted. Kenv must be set before the module is loaded. pgpMRufLjGHVb.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On Sat, Jun 29, 2013 at 10:06:11AM +0300, Alexander Motin wrote: I understand that lock attempt will steal cache line from lock owner. What I don't very understand is why avoiding it helps performance in this case. Indeed, having mutex on own cache line will not let other cores to steal also bswlist, but it also means that bswlist should be prefetched separately (and profiling shows resource stalls there). Or in this case separate speculative prefetch will be better then forced one which could be stolen? Is there cases when it is not, or the only reason to not pad all global mutexes is only saving memory? I can speculate that it is the case when speculative execution helps. If mutex and list head are on the different cache lines, then cpu could speculatively read the head, and then prove that executing the read before the lock acquisition does not break the ordering rules (because lock protects the head, other core indeed cannot modify the head if the lock acquisition was successfull). I think it is very similar reason why locked instructions as barriers are faster then the explicit barriers, cpu could still do the speculative execution after the lock prefix if the ordering is provable consistent. Please see the Intel IA32 architecture optimization manual 8.4.5 for the recommendations (but not much explanation). Yes, I think putting all locks on dedicated cache lines is the waste, only hot locks need this. pgpREr_akYKys.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: Hi. While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? Does it help if you add void *b_pad before b_freelist instead of adding b_freeslist ? The present patch is here: http://people.freebsd.org/~mav/buf_slist.patch The question is how to do it better? What is the KPI/KBI policy for struct buf? I could replace b_freelist by a union and keep KBI, but partially break KPI. Or I could add another field, probably breaking KBI, but keeping KPI. Or I could do something handmade with no breakage. Or this change is just a bad idea? The same question about using union for b_freelist/b_freeslist, does the effect of magically fixing the contention still there if b_freeslist is on the same offset as the b_freelist ? There are no K{B,P}I policy for struct buf in HEAD, just change it as it fits. pgpGt3DQKL6vB.pgp Description: PGP signature
Re: expanding amd64 past the 1TB limit
On Thu, Jun 27, 2013 at 03:23:36PM -0600, Chris Torek wrote: OK, I wasted :-) way too much time, but here's a text file that can be comment-ified or stored somewhere alongside the code or whatever... I think it would be a nice addition to the VM article in the doc collection. The content is correct and useful, IMO. See some notes below. The reason for having those empty (all-zero) PTE pages is that whenever new processes are created, in pmap_pinit(), they have their (new) L4 PTE page set up to point to the *same* physical pages that the kernel is using. Thus, if the kernel creates or destroys any level-3-or-below mapping by writing into any of the above four pages, that mapping is also created/destroyed in all processes. Similarly, the NDMPML4 pages starting at DMPDPphys are mapped identically in all processes. The kernel can therefore borrow a user pmap at any time, i.e., there's no need to adjust the CPU's CR4 on entry to the kernel. It is %cr3. (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E entries into the new pmap, the L3 pages would not have to be physically contiguous, but the KVA ones would still all have to exist. It's free to allocate physically contiguous pages here anyway though.) I do not see how the physical continuity of the allocated page table pages is relevant there. Copying the L4 or L3 PTEs would cause serious complications. In fact, this is how i386 operates, since due to limited VA, pmap does not have a luxurly of allocating entire L(top-1) page table pages to the kernel. As result, pmap_kenter_pde() have to iterate over the whole pmap list, which is maintained just to be able to update KVA mappings. So, the last NKPML4E slots in KPML4phys point to the following page tables, which use all of L3, L2, and L1 style PTEs. (Note that we did not need any L1 PTEs for the direct map, which always uses 2MB or 1GB super-pages.) This is not quite true. In the initial state, indeed all PTEs for direct map are superpages, either 1G or 2M. But Intel states that a situation when the physical page has mappings with different caching modes causes undefined behaviour. As result, if a page is remapped with non-write back caching attributes, the direct map has to demote the superpage and adjust the mapping attribute of the page frame for the page. As result, it is possible to have small mappings in the direct map. See pmap_page_set_memattr(), which delegates the work to pmap_change_attr(_locked). pgpHAwuqmlvZy.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 08:14:42AM -0700, Adrian Chadd wrote: .. i'd rather you narrow down _why_ it's performing better before committing it. Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Or penalize some other set of machines where this is currently not a problem. The cause should be identified before any change is committed. You're using instructions-retired; how about using l1/l2 cache loads, stores, etc? There's a lot more CPU counters available. You have a very cool problem to solve. If I could reproduce it locally I'd give you a hand. Thanks, -adrian pgpY5K5ioAQsN.pgp Description: PGP signature
Re: expanding amd64 past the 1TB limit
On Wed, Jun 26, 2013 at 10:11:44AM -0600, Chris Torek wrote: diff --git a/conf/options.amd64 b/conf/options.amd64 index 90348b7..f3ce505 100644 --- a/conf/options.amd64 +++ b/conf/options.amd64 @@ -1,6 +1,7 @@ # $FreeBSD$ # Options specific to AMD64 platform kernels +AMD64_HUGE opt_global.h Is this option needed ? The SMAP is already parsed at the time of pmap_bootstrap() call, so you could determine the amount of physical memory and size the KVA map accordingly ? pgpuR4X3THSPE.pgp Description: PGP signature
Re: b_freelist TAILQ/SLIST
On Sat, Jun 29, 2013 at 01:15:19AM +0300, Alexander Motin wrote: On 28.06.2013 09:57, Konstantin Belousov wrote: On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: While doing some profiles of GEOM/CAM IOPS scalability, on some test patterns I've noticed serious congestion with spinning on global pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is already very simple, I've tried to optimize probably the only thing possible there: switch bswlist from TAILQ to SLIST. As I can see, b_freelist field of struct buf is really used as TAILQ in some other places, so I've just added another SLIST_ENTRY field. And result appeared to be surprising -- I can no longer reproduce the issue at all. May be it was just unlucky synchronization of specific test, but I've seen in on two different systems and rechecked results with/without patch three times. This is too unbelievable. Could it be, e.g. some cache line conflicts which cause the trashing, in fact ? I think it indeed may be a cache trashing. I've made some profiling for getpbuf()/relpbuf() and found interesting results. With patched kernel using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in relpbuf() -- first lock acquisition causes 78% of them. Later memory accesses including the lock release are hitting the same cache line and almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY spread almost equally between lock acquisition, bswlist access and lock release. It looks like the cache line is constantly erased by something. My guess was that patch somehow changed cache line sharing. But several checks with nm shown that, while memory allocation indeed changed slightly, in both cases content of the cache line in question is absolutely the same, just shifted in memory by 128 bytes. I guess the cache line could be trashed by threads doing adaptive spinning on lock after collision happened. That trashing increases lock hold time and even more increases chance of additional collisions. May be switch from TAILQ to SLIST slightly reduces lock hold time, reducing chance of cumulative effect. The difference is not big, but in this test this global lock acquired 1.5M times per second by 256 threads on 24 CPUs (12xL2 and 2xL3 caches). Another guess was that we have some bad case of false cache line sharing, but I don't know how that can be either checked or avoided. At the last moment mostly for luck I've tried to switch pbuf_mtx from mtx to mtx_padalign on clean kernel. For my surprise that also seems fixed the congestion problem, but I can't explain why. RESOURCE_STALLS.ANY still show there is cache trashing, but the lock spinning has gone. Any ideas about what is going on there? FWIW, Jeff just changed pbuf_mtx allocation to use padalign, it is a somewhat unrelated change in r252330. Are pbuf_mtx and bswlist are located next to next in your kernel ? If yes, then I would expect that the explanation is how the MESI protocol and atomics work. When performing the locked op, CPU takes the whole cache line into the exclusive ownership. Since our locks try the cmpset as the first operation, and then 'adaptive' loop interleaving cmpset and check for the ownership, false cache line sharing between pbuf_mtx and bswlist should result exactly in such effects. Different cores should bounce the ownership of the cache line, slowing down the accesses. AFAIR Intel exports some MESI events as performance counters, but I was not able to find a reference in the architecturally defined events. It seems to be present in the model-specific part. pgpzr8SbYK8oQ.pgp Description: PGP signature
Re: please review, patch for lost camisr
On Wed, May 29, 2013 at 10:16:45AM +0300, Konstantin Belousov wrote: Well, if you and I are right, the minimal patch should be diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c index 8d63c9b..7c21015 100644 --- a/sys/kern/kern_intr.c +++ b/sys/kern/kern_intr.c @@ -1349,7 +1349,7 @@ ithread_loop(void *arg) * we are running, it will set it_need to note that we * should make another pass. */ - while (ithd-it_need) { + while (atomic_load_acq_int(ithd-it_need)) { /* * This might need a full read and write barrier * to make sure that this write posts before any So, was it tested ? If yes, did the patch helped ? pgpDMd2znDb_c.pgp Description: PGP signature
Re: Where is stack of program?
On Mon, Jun 10, 2013 at 08:44:25PM +0400, Vagner wrote: Hi! I need method of finding in struct vm_map or vm_object segments of memory which is situated in the stack. Can you help me please? Note that the stack is per-thread. The concept is somewhat machine-specific, some architectures utilize two stacks, on some the stack is purely software convention. You did not specified what context your code is to be run, e.g. is it kernel or user space ? Assuming it is kernel since you mentioned vm_something. The least error prone route is the thread context (frame) - tf_esp on i386 (or tf_rsp on amd64) - lookup of the map entry in the process p_vmspace. In reality, the stack is often fragmented, since the stack grow code does not coalesce the adjacent grow-down map entries. pgpcOQLckCgs_.pgp Description: PGP signature
Re: please review, patch for lost camisr
On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote: On 5/28/13 10:08 PM, Konstantin Belousov wrote: On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote: [[ moved to hackers@ from private mail. ]] On 5/28/13 1:13 PM, John Baldwin wrote: On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote: On 5/28/13 9:04 AM, John Baldwin wrote: On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote: Hey folks, I had a talk with Nathan Whitehorn about the camisr issue. The issue we are seeing we mostly know, but to summarize, we are losing the camisr signal and the camisr is not being run. I gave him a summary of what we have been seeing and pointed him to the code I am concerned about here: http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c). What I think that is happening is that the setting of it_need to 0 inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled correctly and it is being delayed until AFTER the call to ithread_execute_handlers() right below the atomic_store_rel_int(). This seems highly unlikely, to the extent that if this were true all our locking primitives would be broken. The store_rel is actually a release barrier which acts like more a read/write fence. No memory accesses (read or write) from before the release can be scheduled after the associated store, either by the compiler or CPU. That is what Konstantin is referring to in his commit when he says release semantics. Yes, that makes sense, however does it specify that the writes *must* occur at that *point*? If it only enforces ordering then we may have some issue, specifically because the setting of it to '1' inside of intr_event_schedule_thread has no barrier other than the acq semantics of the thread lock. I am wondering what is forcing out the '1' there. Nothing ever forces writes. You would have to invalidate the cache to do that and that is horribly expensive. It is always only about ordering and knowing that if you can complete another operation on the same cookie variable with acquire semantics that earlier writes will be visible. By cookie, you mean a specific memory address, basically a lock? This is starting to reinforce my suspicions as the setting of it_need is done with release semantics, however the acq on the other CPU is done on the thread lock. Maybe that is irrelevant. We will find out shortly. See below as I think we have proof that this is somehow happening. Having ih_need of 1 and it_need of 0 is certainly busted. The simplest fix is probably to stop using atomics on it_need and just grab the thread lock in the main ithread loop and hold it while checking and clearing it_need. OK, we have some code that will either prove this, or perturb the memory ordering enough to make the bug go away, or prove this assertion wrong. We will update on our findings hopefully in the next few days. IMO the read of it_need in the 'while (ithd-it_need)' should have acquire semantic, otherwise the future reads in the ithread_execute_handlers(), in particular, of the ih_need, could pass the read of it_need and cause the situation you reported. I do not see any acquire barrier between a condition in the while() statement and the read of ih_need in the execute_handlers(). It is probably true that the issue you see was caused by r236456, in the sense that implicitely locked xchgl instruction on x86 has a full barrier semantic. As result, the store_rel() was actually an acquire too, making this reordering impossible. I argue that this is not a bug in r236456, but the issue in the kern_intr.c. If I remember the code correctly that would probably explain why we see it only on 9.1 system. On the other hand, the John' suggestion to move the manipulations of it_need under the lock is probably the best anyway. I was wondering if it would be lower latency to maintain it_need, however to keep another variable it_needlocked under the thread lock. This would result in potential superfluous interrupts, however under load you would allow the ithread to loop without taking the thread lock some number of times. I am not really sure if this is really worth the optimization (especially since it can result in superfluous interrupts) however it may reduce latency and that might be important. Is there some people that I can pass the patch onto for help with performance once we confirm that this is the actual bug? We can do internal testing, but I am worried about regressing performance of any form of IO for the kernel. I'll show the patch soon. Thank you for the information. This is promising. Well, if you and I are right, the minimal patch should be diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c index 8d63c9b..7c21015 100644 --- a/sys/kern/kern_intr.c +++ b/sys/kern/kern_intr.c @@ -1349,7
Re: please review, patch for lost camisr
On Wed, May 29, 2013 at 12:27:46AM -0700, Alfred Perlstein wrote: On 5/29/13 12:16 AM, Konstantin Belousov wrote: On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote: On 5/28/13 10:08 PM, Konstantin Belousov wrote: On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote: [[ moved to hackers@ from private mail. ]] On 5/28/13 1:13 PM, John Baldwin wrote: On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote: On 5/28/13 9:04 AM, John Baldwin wrote: On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote: Hey folks, I had a talk with Nathan Whitehorn about the camisr issue. The issue we are seeing we mostly know, but to summarize, we are losing the camisr signal and the camisr is not being run. I gave him a summary of what we have been seeing and pointed him to the code I am concerned about here: http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c). What I think that is happening is that the setting of it_need to 0 inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled correctly and it is being delayed until AFTER the call to ithread_execute_handlers() right below the atomic_store_rel_int(). This seems highly unlikely, to the extent that if this were true all our locking primitives would be broken. The store_rel is actually a release barrier which acts like more a read/write fence. No memory accesses (read or write) from before the release can be scheduled after the associated store, either by the compiler or CPU. That is what Konstantin is referring to in his commit when he says release semantics. Yes, that makes sense, however does it specify that the writes *must* occur at that *point*? If it only enforces ordering then we may have some issue, specifically because the setting of it to '1' inside of intr_event_schedule_thread has no barrier other than the acq semantics of the thread lock. I am wondering what is forcing out the '1' there. Nothing ever forces writes. You would have to invalidate the cache to do that and that is horribly expensive. It is always only about ordering and knowing that if you can complete another operation on the same cookie variable with acquire semantics that earlier writes will be visible. By cookie, you mean a specific memory address, basically a lock? This is starting to reinforce my suspicions as the setting of it_need is done with release semantics, however the acq on the other CPU is done on the thread lock. Maybe that is irrelevant. We will find out shortly. See below as I think we have proof that this is somehow happening. Having ih_need of 1 and it_need of 0 is certainly busted. The simplest fix is probably to stop using atomics on it_need and just grab the thread lock in the main ithread loop and hold it while checking and clearing it_need. OK, we have some code that will either prove this, or perturb the memory ordering enough to make the bug go away, or prove this assertion wrong. We will update on our findings hopefully in the next few days. IMO the read of it_need in the 'while (ithd-it_need)' should have acquire semantic, otherwise the future reads in the ithread_execute_handlers(), in particular, of the ih_need, could pass the read of it_need and cause the situation you reported. I do not see any acquire barrier between a condition in the while() statement and the read of ih_need in the execute_handlers(). It is probably true that the issue you see was caused by r236456, in the sense that implicitely locked xchgl instruction on x86 has a full barrier semantic. As result, the store_rel() was actually an acquire too, making this reordering impossible. I argue that this is not a bug in r236456, but the issue in the kern_intr.c. If I remember the code correctly that would probably explain why we see it only on 9.1 system. On the other hand, the John' suggestion to move the manipulations of it_need under the lock is probably the best anyway. I was wondering if it would be lower latency to maintain it_need, however to keep another variable it_needlocked under the thread lock. This would result in potential superfluous interrupts, however under load you would allow the ithread to loop without taking the thread lock some number of times. I am not really sure if this is really worth the optimization (especially since it can result in superfluous interrupts) however it may reduce latency and that might be important. Is there some people that I can pass the patch onto for help with performance once we confirm that this is the actual bug? We can do internal testing, but I am worried about regressing performance of any form of IO for the kernel. I'll show the patch soon. Thank you for the information. This is promising. Well, if you and I are right, the minimal patch should be diff --git
Re: please review, patch for lost camisr
On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote: [[ moved to hackers@ from private mail. ]] On 5/28/13 1:13 PM, John Baldwin wrote: On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote: On 5/28/13 9:04 AM, John Baldwin wrote: On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote: Hey folks, I had a talk with Nathan Whitehorn about the camisr issue. The issue we are seeing we mostly know, but to summarize, we are losing the camisr signal and the camisr is not being run. I gave him a summary of what we have been seeing and pointed him to the code I am concerned about here: http://pastebin.com/tLKr7mCV (this is inside of kern_intr.c). What I think that is happening is that the setting of it_need to 0 inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled correctly and it is being delayed until AFTER the call to ithread_execute_handlers() right below the atomic_store_rel_int(). This seems highly unlikely, to the extent that if this were true all our locking primitives would be broken. The store_rel is actually a release barrier which acts like more a read/write fence. No memory accesses (read or write) from before the release can be scheduled after the associated store, either by the compiler or CPU. That is what Konstantin is referring to in his commit when he says release semantics. Yes, that makes sense, however does it specify that the writes *must* occur at that *point*? If it only enforces ordering then we may have some issue, specifically because the setting of it to '1' inside of intr_event_schedule_thread has no barrier other than the acq semantics of the thread lock. I am wondering what is forcing out the '1' there. Nothing ever forces writes. You would have to invalidate the cache to do that and that is horribly expensive. It is always only about ordering and knowing that if you can complete another operation on the same cookie variable with acquire semantics that earlier writes will be visible. By cookie, you mean a specific memory address, basically a lock? This is starting to reinforce my suspicions as the setting of it_need is done with release semantics, however the acq on the other CPU is done on the thread lock. Maybe that is irrelevant. We will find out shortly. See below as I think we have proof that this is somehow happening. Having ih_need of 1 and it_need of 0 is certainly busted. The simplest fix is probably to stop using atomics on it_need and just grab the thread lock in the main ithread loop and hold it while checking and clearing it_need. OK, we have some code that will either prove this, or perturb the memory ordering enough to make the bug go away, or prove this assertion wrong. We will update on our findings hopefully in the next few days. IMO the read of it_need in the 'while (ithd-it_need)' should have acquire semantic, otherwise the future reads in the ithread_execute_handlers(), in particular, of the ih_need, could pass the read of it_need and cause the situation you reported. I do not see any acquire barrier between a condition in the while() statement and the read of ih_need in the execute_handlers(). It is probably true that the issue you see was caused by r236456, in the sense that implicitely locked xchgl instruction on x86 has a full barrier semantic. As result, the store_rel() was actually an acquire too, making this reordering impossible. I argue that this is not a bug in r236456, but the issue in the kern_intr.c. On the other hand, the John' suggestion to move the manipulations of it_need under the lock is probably the best anyway. pgpL971MsQjuU.pgp Description: PGP signature
Re: preemptive kernel
On Mon, May 27, 2013 at 05:00:13AM +, Orit Moskovich wrote: Just to be more specific - On x86, during a filter routine all interrupts are disabled on the cpu currently handling a filter routine or only interrupts on the IRQ that generated the interrupt? The CPU enters the handler through the interrupt gate, which clears the PSL_I. Interrupts are not re-enabled until the handler return. Is it possible that on a different cpu an interrupt (filter or ithread) of the same IRQ will run? Should worry about locking data because an ithread can preempt the same ithread or they'll run concurrently on different cpus? The APIC or PCI MSI registers are programmed in the physical destination mode. This, together with the fact that interrupts are disabled on the target CPU, prevents parallel or nested execution of the filter. Having both filter and ithread for the same interrupt is apparently possible but weird. I do not see anything which would prevent interrupt filter from being executed while the ithread is running. But again, this is very unusual setup. ithread has single stack, it cannot be run simultaneously on several CPUs. -Original Message- From: Konstantin Belousov [mailto:kostik...@gmail.com] Sent: Sunday, May 26, 2013 06:48 PM To: Orit Moskovich Cc: freebsd-hackers@freebsd.org Subject: Re: preemptive kernel On Sun, May 26, 2013 at 11:09:03AM +, Orit Moskovich wrote: Can a filter routine preempt another filter routine? And can an interrupt thread (or a filter routine) preempt another ithread? Filter handler borrows the context from the thread executed at the time of the interrupt. At least on x86 (i.e. i386 and amd64) interrupt handlers are executed with the interrupts disabled. So the filter cannot interrupt another filter. Interrupt threads are executed with the interrupts enabled. So if an interrupt is delivered to the CPU while the CPU executed the ithread, the filter 'preempts' the ithread by executing in its context. For the same reason, if the interrupt delivered causes a higher-priority thread become ready to run, then the new runnable thread could preempt the interrupt thread according to the priority, affinity and other factors considered by a scheduler. pgpn6qvXMc7PX.pgp Description: PGP signature
Re: preemptive kernel
On Sun, May 26, 2013 at 11:09:03AM +, Orit Moskovich wrote: Can a filter routine preempt another filter routine? And can an interrupt thread (or a filter routine) preempt another ithread? Filter handler borrows the context from the thread executed at the time of the interrupt. At least on x86 (i.e. i386 and amd64) interrupt handlers are executed with the interrupts disabled. So the filter cannot interrupt another filter. Interrupt threads are executed with the interrupts enabled. So if an interrupt is delivered to the CPU while the CPU executed the ithread, the filter 'preempts' the ithread by executing in its context. For the same reason, if the interrupt delivered causes a higher-priority thread become ready to run, then the new runnable thread could preempt the interrupt thread according to the priority, affinity and other factors considered by a scheduler. pgpsGRsiThClG.pgp Description: PGP signature
Re: How to get ucred/xucred in user space?
On Tue, May 14, 2013 at 10:00:21AM +0200, Mike Ma wrote: Hi folks, Can I ask if there's any way to get ucred/xucred of a process in user space? As I'm trying to port glustertfs and it's a userland filesystem, I need to get secondary groups of a process. AFAIK, Linux gets them in /proc and NetBSD gets them in this way: int name[] = { CTL_KERN, KERN_PROC, KERN_PROC_PID, frame-root-pid }; size_t namelen = sizeof name / sizeof name[0]; struct kinfo_proc kp; size_t kplen = sizeof(kp); int i, ngroups; if (sysctl(name, namelen, kp, kplen, NULL, 0) != 0) return; ngroups = MIN(kp.kp_eproc.e_ucred.cr_ngroups, GF_REQUEST_MAXGROUPS); I realized none of them would work in FreeBSD. I'm wondering if there's any alternative way to get group information? The sysctl to retrieve the list of the groups the process belongs to is CTL_KERN/KERN_PROC/KERN_PROC_GROUPS. On HEAD, the libprocstat(3) contains the helper procstat_getgroups() which would be more conventient to use, or you could borrow a code from it. pgpmzj11Nd2Ld.pgp Description: PGP signature
Re: [patch] export CPU physical and virtual address sizes in sysctl oids using do_cpuid
On Mon, May 06, 2013 at 01:37:01AM +0200, Sofian Brabez wrote: This patch uses do_cpuid function to fetch CPU Physical and Virtual address sizes and adds 2 new sysctl machine dependant OIDs (machdep.cpu_physical_address_bits and machdep.cpu_virtual_address_bits). In order to retrieve the information, it calls do_cpuid by setting eax register to 0x8008 value like referenced in chapter 5.2.7 in the CPUID specification [1] The patch itself is not quite right. The cpuid with the high %eax value should be only called when the high value is supported by processor. If you look at the function you changing more closely, you would see the already present conditional call to do_cpuid(0x8008), done in the correct way. Also, I think that the patch breaks i386 build. Apple, Inc. in xnu kernel do the same thing but they created a specific node called 'machdep.cpu' to store CPU Vendor information, the sysctls are machdep.cpu.address_bits.virtual and machdep.cpu.address_bits.virtual. I really would like to see this patch in our operating system because it's a valuable information nowdays and should be provided like others. And, the real question about your patch is why this information is of _any_ interest for the usermode. The width of the physical address space only affects BIOS, I believe, since the reported available memory cannot be bigger then the width. The width of the virtual address space is more interesting for the OS, but the only layer where the value makes any sense is pmap, and all existing CPUs only support the 4-level pages, which makes the reported width rather theoretical. Pmap hardwires the 4-level pages and the 48-bit of the linear address, which is also encoded in the memory map of a 64-bit processes. Also, note that cpuid is not a privileged instruction, so any code which calls sysctl(2) could as well perform cpuid and get all the information it needs, instead of interpreting second-hand bits. For example, you can take a look at sysutils/x86info. Thus, I would like advices to see if before to be imported it needs modification to fit like Apple (i.e using a new sysctl node) or stay like that. Also, I profited of this changes to patch sys/modules/linprocfs in order to display the address sizes values in the output of /usr/compat/linux/proc/cpuinfo like it's done in Linux procfs [2]. For linux cpuinfo, I agree that the information needs to be provided if native linux exports it. I believe it is better to call do_cpuid() in the linprocfs_docpuinfo() instead. I filled a pr referenced as amd64/178357Â [3]. My kernel was rebuilt without any problems in r250287 and it works as expected. pgpIJxSgHAKWJ.pgp Description: PGP signature
Re: What is the correct way to declare assembler global variable ?
On Fri, May 03, 2013 at 03:26:36PM -0700, Yuri wrote: On 05/03/2013 13:30, Konstantin Belousov wrote: Formal answer is for you to read about the .type directive in the GNU as manual. Also, you need to read about either common symbols, or about the .size directive. But, note that you cannot access hidden libc symbols from the code which links to libc (dynamically). You probably need to re-consider higher-level approach to your issue. I didn't write this code. This is google-perftools-2.0 I am trying to port. They added the procedure do_sbrk there (see below), that attempts to emulate what libc is doing for sbrk. It fails, they probably didn't test it. The idea is that they need to have hooks before and after sbrk call. I am not sure what the best approach would be here? How can the memory allocation library override sbrk in libc and still be able to attach to the original libc version of sbrk? On Linux there is another symbol __sbrk, and they just use it to conenct to the original sbrk. But there is no such thing of FreeBSD. Yuri static inline void* do_sbrk(intptr_t increment) { void* curbrk = 0; #if defined(__x86_64__) || defined(__amd64__) # ifdef PIC __asm__ __volatile__( movq .curbrk@GOTPCREL(%%rip), %%rdx; movq (%%rdx), %%rax; movq %%rax, %0; : =r (curbrk) :: %rdx, %rax); # else __asm__ __volatile__( movq .curbrk(%%rip), %%rax; movq %%rax, %0; : =r (curbrk) :: %rax); # endif #else __asm__ __volatile__( movl .curbrk, %%eax; movl %%eax, %0; : =r (curbrk) :: %eax); #endif if (increment == 0) { return curbrk; } char* prevbrk = static_castchar*(curbrk); void* newbrk = prevbrk + increment; if (brk(newbrk) == -1) { return reinterpret_castvoid*(static_castintptr_t(-1)); } return prevbrk; } extern C void* sbrk(intptr_t increment) __THROW { MallocHook::InvokePreSbrkHook(increment); void *result = do_sbrk(increment); MallocHook::InvokeSbrkHook(result, increment); return result; } So all the code needs is to call hooks before and after the sbrk call, right ? For the !PIC case, what is cited above would probably work, but it requires that libc is also linked static. For the dynamic case, it should be enough to use the dlsym(RTLD_NEXT, sbrk) to get the pointer to interposed symbol in the wrapper. It assumes that the library which interposes sbrk is loaded before libc, but it is relatively non-trivial to fail this. Something along the lines, obviously not even compiled: #ifdef PIC void * sbrk(intptr_t incr) { static void *(*libc_sbrk)(intptr_t); void *ret; if (libc_sbrk == NULL) libc_sbrk = dlsym(RTLD_NEXT, sbrk); sbrk_pre_hook(incr); ret = libc_sbrk(incr); sbrk_post_hook(ret, incr); return (ret); } #endif The sample is in fact thread-safe, or rather, not any more unsafe than an sbrk() use itself. pgpmpPyOYHTwM.pgp Description: PGP signature
Re: What is the correct way to declare assembler global variable ?
On Fri, May 03, 2013 at 01:10:29PM -0700, Yuri wrote: I am trying to compile this code fragment into my program (taken from lib/libc/amd64/sys/sbrk.S): void my_func() { ... __asm__ __volatile__( movq .curbrk(%%rip), %%rax; lea .curbrk(%%rip), %%rdx; movq %%rax, %0; movq %%rdx, %1; : =r (my_curbrk), =r (my_curbrk_ptr) :: %rax, %rdx); ... } I get a warning: /usr/bin/ld: warning: type and size of dynamic symbol `.curbrk@@FBSDprivate_1.0' are not defined What is the correct way to declare .curbrk in in-place assembly? Formal answer is for you to read about the .type directive in the GNU as manual. Also, you need to read about either common symbols, or about the .size directive. But, note that you cannot access hidden libc symbols from the code which links to libc (dynamically). You probably need to re-consider higher-level approach to your issue. pgp9v5aJINKcp.pgp Description: PGP signature
Re: potential future proofing fix for aicasm build.
On Thu, May 02, 2013 at 10:57:23AM -0700, Xin Li wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 05/02/13 00:19, Dimitry Andric wrote: On May 1, 2013, at 18:44, Alfred Perlstein alf...@ixsystems.com wrote: I took a shot at fixing this issue with building aicasm as part of buildkernel of an older 9.0 src on a machine running HEAD. aicasm.o: In function `__getCurrentRuneLocale': /usr/include/runetype.h:96: undefined reference to `_ThreadRuneLocale' I don't understand this error message... It seems like a linker error, but it also seems to refer to an incorrect include file? Is this during linking or compiling? This is because the locale code being a macro: #if defined(__NO_TLS) || defined(__RUNETYPE_INTERNAL) extern const _RuneLocale *__getCurrentRuneLocale(void); #else extern _Thread_local const _RuneLocale *_ThreadRuneLocale; static __inline const _RuneLocale *__getCurrentRuneLocale(void) { if (_ThreadRuneLocale) return _ThreadRuneLocale; if (_CurrentRuneLocale) return _CurrentRuneLocale; return _DefaultRuneLocale; } #endif /* __NO_TLS || __RUNETYPE_INTERNAL */ What really puzzles me is that why the build picks up headers from the running system. Because the utitily being build is intended to be run on the running system. pgpuuWI9Zum8E.pgp Description: PGP signature
Re: vmspace_fork()
On Tue, Apr 23, 2013 at 10:44:56AM -0700, Vijay Singh wrote: Hackers, what does hitting the following assert in vmspace_fork() imply? 3101 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3101 new_map = vm2 http://fxr.watson.org/fxr/ident?im=3;i=vm2-vm_map http://fxr.watson.org/fxr/ident?im=3;i=vm_map; 3102 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3102 locked http://fxr.watson.org/fxr/ident?im=3;i=locked = vm_map_trylock http://fxr.watson.org/fxr/ident?im=3;i=vm_map_trylock(new_map); */* trylock to silence WITNESS */* 3103 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3103 KASSERT http://fxr.watson.org/fxr/ident?im=3;i=KASSERT(locked http://fxr.watson.org/fxr/ident?im=3;i=locked, (*vmspace_fork: lock failed*)); I have hard time reading this html. I am hitting the assert in line 3103 and it seems like the assumption is that the trylock will always get the lock? Assuming you are referencing line 3103 of the sys/vm/vm_map.c, yes, the try lock operation must always succeed. The reason is that the new_map AKA vm2 is freshly allocated and its pointer is not yet published for other threads to see. So nothing could own the lock. The try is done to silence the witness warning about the same lock type, as indicated by the comment. pgpkS59pQBGmW.pgp Description: PGP signature
Re: [patch] pipe2
On Sun, Apr 21, 2013 at 11:02:43PM +0200, Jilles Tjoelker wrote: On Sat, Apr 20, 2013 at 12:48:39AM +0200, Jilles Tjoelker wrote: I'm also working on pipe2() (using linuxulator work) and dup3() (patch from Jukka A. Ukkonen). This is an implementation of pipe2. As with the accept4 patch, make sysent needs to be run in sys/kern and sys/compat/freebsd32. As a bonus, new architectures might implement pipe(p) as pipe2(p, 0) avoiding the need for assembler (but behaviour is different if the pointer is invalid). I do not see anything wrong with the patch. That said, I prefer the pipe(2) approach of delegating the access to the usermode memory to usermode, which avoids the need of the calls to kern_close(). pgphScrSccGS0.pgp Description: PGP signature
Re: MADV_FREE and wait4 EFAULT
On Fri, Apr 19, 2013 at 11:53:22AM -0700, Carl Shapiro wrote: On Fri, Apr 19, 2013 at 5:49 AM, Konstantin Belousov kostik...@gmail.comwrote: It would be of some interest to see the evidence. Certainly. Here is some of the debugging messages that I added to my application. The first line is a print statement that executes after the system call returns. (As an aside, we issue system calls directly and do not link against the C library.) The other 2 lines of interest are output from the dump of /proc/curproc/map that correspond to the status and rusage addresses. wait4 returned EFAULT, status is 0xc20021c0e8 rusage is 0xc2000eaf30 ... 0xc1 0xc20010 245 0 0xfe07cd9ebbc8 rw- 1 0 0x3000 COW NNC default - CH 1001 ... 0xc20020 0xc20030 250 0 0xfe0215e3ed98 rw- 1 0 0x3000 COW NNC default - CH 1001 I realize this might not be satisfying but I am happy to provide any other information you might be interested in. Is your code multithreaded ? Yes. Just observations/speculations: the addresses you shown are not the (usual) addresses for the malloc heap on amd64. It is possible to allocate and map enough shared libraries to make malloc start using these addresses, but more likely your app is using custom mmap() calls, possibly with the explicite address hints or MAP_FIXED. Am I right ? If yes and your app is multithreaded, it is possible that other thread performs some manipulations on the address space while current thread tries to access the range. The test case is required to decide whether the bug is in the application or in the OS. To be clear, I do not have a strong reason to believe there bug is in FreeBSD. My original enquiry was solely into whether we were misusing MADV_FREE pages. However, the wait4 failure is very suspicious because the only two addresses written to are out parameters. In fact, the question is, why are you trying to access the memory after the MADV_FREE. Was it correctly marked as 'still used' to prevent mmaps there ? Note that all this is pure speculation. pgp9JSagGP6_S.pgp Description: PGP signature
Re: MADV_FREE and wait4 EFAULT
On Thu, Apr 18, 2013 at 02:51:43PM -0700, Carl Shapiro wrote: On Wed, Apr 17, 2013 at 1:21 AM, Konstantin Belousov kostik...@gmail.comwrote: Did you ensured with e.g. ktrace and procstat -v that your assumptions hold, i.e. the addresses supplied as wait4(2) arguments are valid ? Please provide the minimal test case demonstrating the behaviour. Yes. I instrumented my code to check for a wait4 failure, print the addresses of the status and rusage arguments, and dump the contents of /proc/curproc/map. The addresses of the status and rusage arguments are always in the range of a mapping and marked as read write. It would be of some interest to see the evidence. Is your code multithreaded ? I have yet to distill the failure to a minimal test case. The test case I do have is the test harness for the Go language. After running for about 45 minutes I can observe a failure. I have been working to produce something smaller and faster. The test case is required to decide whether the bug is in the application or in the OS. MADV_FREE should only result in the possible lost of the previous content of the page, not in the faulting of the page access. From the inspection of the code, I do not see how MADV_FREE could result in the memory address becoming invalid. I see. What has lead us to believe this might be an issue with page faults is that writing zeroes to the page with memset before passing it to wait4 makes the error go away. There is no difference in the access performed by copyout vs. access caused by the usermode write. Do you have any advice about how one might go about instrumenting wait4 to generate more information about a failed copyout? Are tools such as dtrace useful in these situations or might it be too invasive? Because of the protracted test cycle and my lack of knowledge in this area, conducting experiments is quite painful at the moment. No, I cannot give an advice, I think we should first decide which code to blame. BTW, you could try enabling sysctl machdep.uprintf_signal. Oh, you did not specified the architecture and version of the system. pgp0WoI0ucOnE.pgp Description: PGP signature
Re: Synchronizing TSC
On Thu, Apr 18, 2013 at 05:16:22PM +0300, Alexander Motin wrote: On 17.04.2013 11:50, Konstantin Belousov wrote: On Wed, Apr 17, 2013 at 09:46:15AM +0300, Alexander Motin wrote: On 17.04.2013 03:25, Jim Harris wrote: On Tue, Apr 16, 2013 at 3:04 PM, Alexander Motin m...@freebsd.org mailto:m...@freebsd.org wrote: Hi. Recently I've got 6-core/12-thread system on Sandy Bridge-E Core i7-3930K CPU and was unpleasantly surprised to see that TSCs are not synchronized there. While all 11 APs were synchronized, BSP was far behind them. Since it is single-socket system, I don't know any good reason for such behavior except some BIOS bug. But I've recalled that somewhere was some discussions about possible TSC synchronization. I've implemented patch below that allows to adjust TSC values of BSPs to AP's one on boot using CPU MSRs, hoping that they should not diverge after that: http://people.freebsd.org/~__mav/tsc_adj2.patch http://people.freebsd.org/~mav/tsc_adj2.patch I don't know very much about all different TSC hardware to predict when it is safe to enable the functionality, but at least on my system being enabled via loader tunable it seems working well. Comments? You may be remembering this thread on r238755 last year: http://lists.freebsd.org/pipermail/svn-src-head/2012-July/038992.html This was a bug fix in the TSC synchronization test code though, not anything for trying to adjust out-of-sync TSCs. I remember that thread, but I think I've seen somebody told somewhere that it could be interesting to implement some MI mechanism. Never mind. The Intel SDM (volume 3, section 17.13 of March 2013 revision) says earlier models can only write to lower 32 bits of IA32_TIME_STAMP_COUNTER, but these models also should not have invariant TSC so they would never even get to your new routine. So your patch seems OK for Intel CPUs, at least as a tunable that is disabled by default. Thanks. My only concern would be why TSC on the BSP started out-of-sync on your system. Theoretically, BIOS could adjust TSCs in SMM to try to hide SMI code execution from the OS, which could then make them out-of-sync again. Not sure if that's what's happening here, but might be worth a test putting the TSC test code on a periodic timer to see if they ever get out of sync again. I did one more interesting observation: on every reboot drift between BSP and APs is growing proportionally to the previous system power-on time. On first boot it is -3878361036 (just above one second), after reboot some minutes later it is -1123454492776 (about 6 minutes), after another reboot it is -1853033521804 (about 10 minutes). Unless my adjustment code would be active, I would guess that AP's TSC is running linearly while BSP's for some reason reset to zero on every reboot. But since I am synchronizing them on each boot, the only possibility for it I see is that there is some other timer(s) / counter(s) not affected by MSR writes that ticks linearly and reloading AP's TSC, but for some reason not reloading BSP's. For me it sounds as the BIOS bug, indeed. Could you verify the content of IA32_TSC_ADJUST on all cores (I believe it is present on E5) ? Also, using TSC_ADJUST to correct the skew seems to be preferrable, according to the Intel docs. IA32_TSC_ADJUST register seems not present there. At least cpucontrol doesn't want to read it. In Intel docs I also see it mentioned only in context of future Haswell generation. And I don't see Standard Extended Features line in dmesg. Why do you use cpuid in the assembly sequence ? As I understand, you ensure that there is a serialization point, but why do you need it ? The idea was to minimize time distance between following MSR read and write. But may be it is not needed, I am not exactly sure about that magic. Well, I do not believe that such trick is useful. Patch with removed cpuid and with resync disabled by default (as it is now, AFAIR) would be IMO fine for the commit. pgpzjLRbnEjZP.pgp Description: PGP signature
Re: MADV_FREE and wait4 EFAULT
On Tue, Apr 16, 2013 at 02:12:54PM -0700, Carl Shapiro wrote: I am seeing wait4 system calls failing with an EFAULT and I am trying to understand what might be going wrong. An inspection of the wait4 implementation suggests the opportunity for EFAULT is within its invocations of copyout. In my situation, the status and rusage pointer arguments contain addresses to mmaped pages which have been madvised as MADV_FREE. Is it permissible to pass pages which have been madvised MADV_FREE to wait4 or any other system call for that matter? Might there be another opportunity for a wait4 to EFAULT? Did you ensured with e.g. ktrace and procstat -v that your assumptions hold, i.e. the addresses supplied as wait4(2) arguments are valid ? Please provide the minimal test case demonstrating the behaviour. MADV_FREE should only result in the possible lost of the previous content of the page, not in the faulting of the page access. From the inspection of the code, I do not see how MADV_FREE could result in the memory address becoming invalid. pgprDGMZd7Ffy.pgp Description: PGP signature
Re: Synchronizing TSC
On Wed, Apr 17, 2013 at 09:46:15AM +0300, Alexander Motin wrote: On 17.04.2013 03:25, Jim Harris wrote: On Tue, Apr 16, 2013 at 3:04 PM, Alexander Motin m...@freebsd.org mailto:m...@freebsd.org wrote: Hi. Recently I've got 6-core/12-thread system on Sandy Bridge-E Core i7-3930K CPU and was unpleasantly surprised to see that TSCs are not synchronized there. While all 11 APs were synchronized, BSP was far behind them. Since it is single-socket system, I don't know any good reason for such behavior except some BIOS bug. But I've recalled that somewhere was some discussions about possible TSC synchronization. I've implemented patch below that allows to adjust TSC values of BSPs to AP's one on boot using CPU MSRs, hoping that they should not diverge after that: http://people.freebsd.org/~__mav/tsc_adj2.patch http://people.freebsd.org/~mav/tsc_adj2.patch I don't know very much about all different TSC hardware to predict when it is safe to enable the functionality, but at least on my system being enabled via loader tunable it seems working well. Comments? You may be remembering this thread on r238755 last year: http://lists.freebsd.org/pipermail/svn-src-head/2012-July/038992.html This was a bug fix in the TSC synchronization test code though, not anything for trying to adjust out-of-sync TSCs. I remember that thread, but I think I've seen somebody told somewhere that it could be interesting to implement some MI mechanism. Never mind. The Intel SDM (volume 3, section 17.13 of March 2013 revision) says earlier models can only write to lower 32 bits of IA32_TIME_STAMP_COUNTER, but these models also should not have invariant TSC so they would never even get to your new routine. So your patch seems OK for Intel CPUs, at least as a tunable that is disabled by default. Thanks. My only concern would be why TSC on the BSP started out-of-sync on your system. Theoretically, BIOS could adjust TSCs in SMM to try to hide SMI code execution from the OS, which could then make them out-of-sync again. Not sure if that's what's happening here, but might be worth a test putting the TSC test code on a periodic timer to see if they ever get out of sync again. I did one more interesting observation: on every reboot drift between BSP and APs is growing proportionally to the previous system power-on time. On first boot it is -3878361036 (just above one second), after reboot some minutes later it is -1123454492776 (about 6 minutes), after another reboot it is -1853033521804 (about 10 minutes). Unless my adjustment code would be active, I would guess that AP's TSC is running linearly while BSP's for some reason reset to zero on every reboot. But since I am synchronizing them on each boot, the only possibility for it I see is that there is some other timer(s) / counter(s) not affected by MSR writes that ticks linearly and reloading AP's TSC, but for some reason not reloading BSP's. For me it sounds as the BIOS bug, indeed. Could you verify the content of IA32_TSC_ADJUST on all cores (I believe it is present on E5) ? Also, using TSC_ADJUST to correct the skew seems to be preferrable, according to the Intel docs. Why do you use cpuid in the assembly sequence ? As I understand, you ensure that there is a serialization point, but why do you need it ? The common knowledge is that for CPUs with invariant TSC, the TSC counter is single-instance and located on uncore. For single-socket configurations, your patch would be fine. But, for multi-socket machines, each package has its own counter, and counters might drift. As result, the initial synchronization would still allow the eventual de-sync and this is problematic. pgpUKPVTz52N0.pgp Description: PGP signature
Re: devstat overhead VS precision
On Mon, Apr 15, 2013 at 08:42:03PM +0200, Pawel Jakub Dawidek wrote: On a mostly unrelated note when two threads (T0 and T1) call get*time() on two different cores, but T0 does that a bit earlier is it possible that T0 can get later time than T1? Define earlier first. If you have taken sufficient measures to prevent preemption and interruption, e.g. by entering spinlock before the fragment that calls get*, then no, it is impossible, at least not with any x86 timekeeping hardware we use. On the other hand, if interrupts are allowed, all bets are off. pgp5apjuytuPs.pgp Description: PGP signature
Re: devstat overhead VS precision
On Mon, Apr 15, 2013 at 09:37:30PM +0200, Pawel Jakub Dawidek wrote: On Mon, Apr 15, 2013 at 10:18:15PM +0300, Konstantin Belousov wrote: On Mon, Apr 15, 2013 at 08:42:03PM +0200, Pawel Jakub Dawidek wrote: On a mostly unrelated note when two threads (T0 and T1) call get*time() on two different cores, but T0 does that a bit earlier is it possible that T0 can get later time than T1? Define earlier first. If you have taken sufficient measures to prevent preemption and interruption, e.g. by entering spinlock before the fragment that calls get*, then no, it is impossible, at least not with any x86 timekeeping hardware we use. On the other hand, if interrupts are allowed, all bets are off. So if we consider only one thread, it is not possible for it to obtain time t0, be scheduled to different CPU and obtain t1 where t1 t0? Yes, I believe this scenario is also not possible. The context switching ensures that thread's view on the global memory is consistent. At least it is so on x86, and I think it must be the same on all other architectures. Otherwise the compiler emited code would not work. pgprLMOfbzxUH.pgp Description: PGP signature
Re: libprocstat(3): retrieve process command line args and environment
On Fri, Mar 29, 2013 at 02:31:57PM +0200, Mikolaj Golub wrote: On Fri, Mar 29, 2013 at 11:22:45AM +0200, Konstantin Belousov wrote: On Thu, Mar 28, 2013 at 11:18:21PM +0200, Mikolaj Golub wrote: On Thu, Mar 28, 2013 at 12:51:34PM +0200, Konstantin Belousov wrote: In the generic Elf 64bit draft specification I have, the notes sections are specified to consists of entries, each of which is an array of 8-byte words. I think we are right using the 8-byte alignment. I have impression many implementations use 4-byte alignment. E.g. in NetBSD: sys/kern/core_elf32.c: #define ELFROUNDSIZE4 /* XXX Should it be sizeof(Elf_Word)? */ #define elfround(x) roundup((x), ELFROUNDSIZE) Note that this is core_elf32. I am concerned with the 64-bit cores. core_elf64.c: #define ELFSIZE 64 #include core_elf32.c Also, the 4-bytes alignment is described in the comments in the glibc csu/abi-note.S. Also, we have inconsistency with imgactl_elf.c/parse_notes(), which uses 4-byte alignment: note = (const Elf_Note *)((const char *)(note + 1) + roundup2(note-n_namesz, sizeof(Elf32_Addr)) + roundup2(note-n_descsz, sizeof(Elf32_Addr))); I suppose there were no issues before, because accidentally the sizes of all notes we had were 8 bytes aligned. Indeed, both ABI and NOINIT notes have size which is multiple of 8. Now, when I add new notes it will break things. I don't have strong opinion, it will be ok for me to leave 8-byte alignment and fix issues, just want to have strong support here :-) Well, while the issue is discussed and decided, you could just make your new notes size be multiple of 8 too. I thought about this too. Then I need to be more caerful when extracting stats from notes, because the length returned by procstat_core_get() can be more than a real payload. Ok, I will try this way. I could add length to the note header, which is currently contains only structsize, so it would became something like: struct { int structsize; int lenght; } But not sure it is worth doing, especially if the forced 8-bit alignment is a temporary mesure. No, it is definitely not worth it. I inspected imgact_elf.c:parse_note(), imgact_elf.c:putnote() and rtld.c:digest_notes(). Only putnote() uses 8-byte alignment. Every other OS and our !coredump code assumes 4-byte alignment. Does changing the putnote() to align on the 4-byte boundary cause real change in the core file notes layout ? pgppHqvLntCg7.pgp Description: PGP signature
Re: libprocstat(3): retrieve process command line args and environment
On Sun, Mar 31, 2013 at 06:53:00PM +0300, Mikolaj Golub wrote: On Sun, Mar 31, 2013 at 04:40:47PM +0300, Konstantin Belousov wrote: I inspected imgact_elf.c:parse_note(), imgact_elf.c:putnote() and rtld.c:digest_notes(). Only putnote() uses 8-byte alignment. Every other OS and our !coredump code assumes 4-byte alignment. Thanks! Does changing the putnote() to align on the 4-byte boundary cause real change in the core file notes layout ? Currently, we store only 4 types of notes in a core file: #define NT_PRSTATUS 1 /* Process status. */ #define NT_FPREGSET 2 /* Floating point registers. */ #define NT_PRPSINFO 3 /* Process state info. */ #define NT_THRMISC 7 /* Thread miscellaneous info. */ I checked the sizes of structures inserted into the notes, and on amd64 they all are multiple of 8: (kgdb) p sizeof(prpsinfo_t) % 8 $1 = 0 (kgdb) p sizeof(prstatus_t) % 8 $2 = 0 (kgdb) p sizeof(prfpregset_t) % 8 $3 = 0 (kgdb) p sizeof(thrmisc_t) % 8 $4 = 0 so both 4-byte and 8-byte aligned. Well, FreeBSD supports some more 64bit architectures, besides amd64. At least on powerpc64, I get prpsinfo 120 0 prstatus_t 344 0 prfpregset_t 264 0 thrmisc_t 24 0 Second column is sizeof(), third is sizeof() % 8. This is in fact not much surprising, since all ABIs define the alignment of the structure as the alignment of the most demanding member. And, because 64bit architectures have 8-byte registers, it is indeed expected that the size % 8 == 0. I believe that the patch below will not change the current core file notes layout, will make things consistent in our tree, and will make adding my procstat notes easier, if I use 4-byte alignment. Are you ok if I commit it before introducing my changes? Yes, I believe this is the right thing to do. Index: sys/kern/imgact_elf.c === --- sys/kern/imgact_elf.c (revision 248706) +++ sys/kern/imgact_elf.c (working copy) @@ -1538,10 +1538,10 @@ __elfN(putnote)(void *dst, size_t *off, const char *off += sizeof note; if (dst != NULL) bcopy(name, (char *)dst + *off, note.n_namesz); - *off += roundup2(note.n_namesz, sizeof(Elf_Size)); + *off += roundup2(note.n_namesz, sizeof(Elf32_Size)); if (dst != NULL) bcopy(desc, (char *)dst + *off, note.n_descsz); - *off += roundup2(note.n_descsz, sizeof(Elf_Size)); + *off += roundup2(note.n_descsz, sizeof(Elf32_Size)); } static boolean_t Also, shouldn't we update then the following comment in sys/elf_common.h? /* * Note header. The .note section contains an array of notes. Each * begins with this header, aligned to a word boundary. Immediately * following the note header is n_namesz bytes of name, padded to the * next word boundary. Then comes n_descsz bytes of descriptor, again * padded to a word boundary. The values of n_namesz and n_descsz do * not include the padding. */ -- Mikolaj Golub pgpt3ztmYjZwj.pgp Description: PGP signature
Re: libprocstat(3): retrieve process command line args and environment
On Thu, Mar 28, 2013 at 11:18:21PM +0200, Mikolaj Golub wrote: On Thu, Mar 28, 2013 at 12:51:34PM +0200, Konstantin Belousov wrote: In the generic Elf 64bit draft specification I have, the notes sections are specified to consists of entries, each of which is an array of 8-byte words. I think we are right using the 8-byte alignment. I have impression many implementations use 4-byte alignment. E.g. in NetBSD: sys/kern/core_elf32.c: #define ELFROUNDSIZE4 /* XXX Should it be sizeof(Elf_Word)? */ #define elfround(x) roundup((x), ELFROUNDSIZE) Note that this is core_elf32. I am concerned with the 64-bit cores. Also, we have inconsistency with imgactl_elf.c/parse_notes(), which uses 4-byte alignment: note = (const Elf_Note *)((const char *)(note + 1) + roundup2(note-n_namesz, sizeof(Elf32_Addr)) + roundup2(note-n_descsz, sizeof(Elf32_Addr))); I suppose there were no issues before, because accidentally the sizes of all notes we had were 8 bytes aligned. Indeed, both ABI and NOINIT notes have size which is multiple of 8. Now, when I add new notes it will break things. I don't have strong opinion, it will be ok for me to leave 8-byte alignment and fix issues, just want to have strong support here :-) Well, while the issue is discussed and decided, you could just make your new notes size be multiple of 8 too. BTW, looking at NetBSD code I see they set p_align in the note segement to ELFROUNDSIZE: /* Write out the PT_NOTE header. */ ws.psections-p_type = PT_NOTE; ws.psections-p_offset = notestart; ws.psections-p_vaddr = 0; ws.psections-p_paddr = 0; ws.psections-p_filesz = notesize; ws.psections-p_memsz = 0; ws.psections-p_flags = PF_R; ws.psections-p_align = ELFROUNDSIZE; while we set to 0: /* The note segement. */ phdr-p_type = PT_NOTE; phdr-p_offset = hdrsize; phdr-p_vaddr = 0; phdr-p_paddr = 0; phdr-p_filesz = notesz; phdr-p_memsz = 0; phdr-p_flags = 0; phdr-p_align = 0; You mean, for the core dumps ? Shouldn't we set it to alignment size too? Note also, they set Segment is readable flag. I think both changes are fine. I skimmed over the usermode parts of the patch. One thing I did not liked is the mis-handling of the read() return values. If there is short read, the errno value is meaningless, but warn() would still append it to the message. I suggest to explicitely distinguish -1 and = 0 returns from reads. pgp4aWv2zWVyL.pgp Description: PGP signature
Re: libprocstat(3): retrieve process command line args and environment
On Sun, Mar 24, 2013 at 05:54:28PM +0200, Mikolaj Golub wrote: Here is an updated patch set, which I think includes all kib's suggestions. It also introduces procstat groups, umask, rlimit, and osrel notes. http://people.freebsd.org/~trociny/procstat_core.3.patch Sbuf section interface looks like below: /* start a section */ sbuf_start_section(sb, NULL); /* write something to sbuf */ ... /* start a subsection (need to save the current section length) */ sbuf_start_section(sb, old_len); /* write something to sbuf */ ... /* end a subsection padding to 4 bytes with '\0' bytes (need to provide the previously saved section length) */ sbuf_end_section(sb, old_len, 4, 0); ... /* aling the whole section to page size */ sbuf_end_section(sb, -1, PAGE_SIZE, 0); This looks fine. Open issues/questions: 1) I would also like to make libprocstat(3) extract environment, args, and auxv from the core. It looks like I don't need to store envv and argv in notes, as it is already present in the core. But I think I need to know psstrings to find them. Would it be ok, if I add auxv and psstrings notes, and extract envv and agrv from a program section in the core? Yes, I agree that not doing this in kernel at the time of the core dump is preferred. As usual, both args/env and aux vector could be overriden by the userspace, so its reading is only best-efforts operation. 2) I started NT_PROCSTAT constants from the first not used number in elf_common.h, i.e. 8. But in this case they conflict with those available on other systems: contrib/binutils/include/elf/common.h: #define NT_PSTATUS 10 /* Has a struct pstatus */ #define NT_FPREGS 12 /* Has a struct fpregset */ #define NT_PSINFO 13 /* Has a struct psinfo */ #define NT_LWPSTATUS16 /* Has a struct lwpstatus_t */ #define NT_LWPSINFO 17 /* Has a struct lwpsinfo_t */ #define NT_WIN32PSTATUS 18 /* Has a struct win32_pstatus */ Although note name = FreeBSD should have disambiguated this, readelf looks like ignores this and its output for my core on i386 looks confusing: Owner Data size Description FreeBSD 0x006c NT_PRPSINFO (prpsinfo structure) FreeBSD 0x0068 NT_PRSTATUS (prstatus structure) FreeBSD 0x00b0 NT_FPREGSET (floating point registers) FreeBSD 0x0018 NT_THRMISC (thrmisc structure) FreeBSD 0x0304 Unknown note type: (0x0008) FreeBSD 0x0a6c Unknown note type: (0x0009) FreeBSD 0x08d4 NT_PSTATUS (pstatus structure) FreeBSD 0x000c Unknown note type: (0x000b) FreeBSD 0x0006 NT_FPREGS (floating point registers) FreeBSD 0x00d4 NT_PSINFO (psinfo structure) FreeBSD 0x0008 Unknown note type: (0x000e) Should I use some other range for NT_PROCSTAT notes? I think the only possible solution is to post on the binutils list and fix the readelf ignore of the note vendor. 3) Following our current code I align notes to sizeof(Elf_Size), which is 4 on i386 and 8 on amd64. But I have an issue reading the notes by readelf on amd64, which alway uses 4 byte alignment: contrib/binutils/binutils/readelf.c: next = (Elf_External_Note *)(inote.descdata + align_power (inote.descsz, 2)); where #define align_power(addr, align)\ (((addr) + ((bfd_vma) 1 (align)) - 1) ((bfd_vma) -1 (align))) Should I change alignment to 4 bytes? In the generic Elf 64bit draft specification I have, the notes sections are specified to consists of entries, each of which is an array of 8-byte words. I think we are right using the 8-byte alignment. 4) In libprocstat I added new functions and placed them under already existent FBSD_1.3 version section in Symbol.map. Shouldn't I bump the version? Won't I need any additional care if I want to MFC the code to stable/9 and may be 8? Version of what ? MFC does not require any additional actions, FBSD_1.3 is the valid version namespace in 9 and 8. pgpJEnpLmwEkV.pgp Description: PGP signature
Re: syscalls per process ?
On Tue, Mar 26, 2013 at 11:58:02AM +1000, Paul Koch wrote: Hi, At the start of all our programs, we set up a SIGALRM signal handler to retrieve rusage stats each second and log this info for analysis. It is useful for long running programs. One of the things we would really really like to get is the number of system calls the process has performed. Is there a simple way for a process to retrieve its own number of syscalls it has performed ? Pity it's not available via getrusage(). We don't want to run an external program (eg. truss/dtrace) on each program. AFAIR, the per-threads syscalls are not accounted. The accounting was performed for some short time, but it was only used as a workaround for some nfs client issue, and since the alternate approach was used, the td_syscalls was removed. Only global system count of the syscalls is maintained. pgphES023wWDl.pgp Description: PGP signature
Re: [patch] SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC
On Sun, Mar 17, 2013 at 10:23:53PM +0100, Jilles Tjoelker wrote: Here are some more modifications to allow creating file descriptors with close-on-exec set. Like in linux/glibc, SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_*. I do not pass the SOCK_* flags to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. I expect audit to cope with the new flags. For MSG_CMSG_CLOEXEC, I had to change unp_externalize to take a flags argument. This looks fine to me. The only note I have, which is not directly related to your patch, is the recvmsg(2) behaviour when the undefined flag is passed. The syscall silently ignores the flags. I think this is quite wrong, and would cause interesting (security) implications if the program using the MSG_CMSG_CLOEXEC is run on older kernel which does not interpret the flag. Might be, we should start returning EINVAL for unknown flag, despite SUSv4 not specifying the condition ? pgp6vw8MKGyIu.pgp Description: PGP signature
Re: libprocstat(3): retrieve process command line args and environment
On Sun, Mar 17, 2013 at 12:35:20AM +0200, Mikolaj Golub wrote: On Sat, Mar 16, 2013 at 09:16:05PM +0200, Konstantin Belousov wrote: IMO sbuf_pad() should be moved to subr_sbuf.c. I find the KPI of the sbuf_pad() wrong. You have two separate number, the amount to pad to, and the current amount. Natural interface would take the two numbers separate instead of the difference. Also, could the sbuf_pad() deduce the current amount on its own, this would be the best ? Hm, I am not sure about this. So are you proposing instead of something like this sbuf_pad(sb, roundup(x, y) - x, 0); to have sbuf_pad(sb, x, roundup(x, y), 0)? Although it is a little easier to write, it looks less intuitive for me. E.g. I have troubles how to document this and explain. I can't reffer x as a current position in sbuf, because it might not be. It is just a some position, roundup(x,y) is another one, and only their difference makes sence for sbuf_pad, so why not just to provide this difference? So sbuf_pad(sb, from, to, c); looks for me less intutive than sbuf_pad(sb, len, c); A KPI that would be natural for my case: /* start a section that is going to be aligned to sizeof(Elf_Size), using byte '0' for padding */ sbuf_padded_start(sb, sizeof(Elf_Size), 0); /* write something to sbuf */ sbuf_bcat(sb, data, len); /* align the sbuf section */ sbuf_pad(sb); This might look a little awkward and would add some overhead for the normal case though... This looks fine, in fact. You might want to call it sbuf_start_section() and sbuf_end_section() ? In register_note(), put spaces around '|' for the malloc line. It seems that you did not moved the 'ABI hack' for ENOMEM situation for sysctl_kern_proc_filedesc() into the rewritten function. Ah, this is a thing I wanted to discuss but forgot! As I understand the idea of the 'ABI hack' is: if the output buffer is less than the size of data we have, truncate our data to the last successfully written kinfo_file structure and return without error. In my code I do reset ENOMEM to 0 (see sysctl_kern_proc_filedesc), but I don't see a way how using sbuf interface I can truncate req-oldidx to the last successfully written file: sbuf_bcat() (to internal buffer) may be successful and the error might appear only later, when draining. I suspect it will break things if I return with a partial kinfo_file, but haven't come with a solution yet... All you need is to reset req-oldidx. This could be done outside the sbuf interface, in the top level function implementing the sysctl ? What you propose in the follow-up message should work too, I do not have any preference. Please commit the changes to use pget() in the sysctl handlers separately. Indents after the else clauses in kern_proc_out() are wrong. Do you mean indents after '#ifdef COMPAT_FREEBSD32' block? I did it that way so if COMPAT_FREEBSD32 sections were removed from the code the indentation would be correct. I saw this practice through the code and used it myself before. The sections are not going to be removed. IMHO code should be formatted as if the preprocessor directive lines are not present. Could you point out an example of existing code consistent with your indentation ? Since you are moving the KERN_PROC_ZOMBMASK out of kern_proc.c, a comment is needed to describe its usage. I would find it very confusing if grep reveals no like of code that sets the flags. In the comment for sbuf_drain_core_output(), s/drainig/draining/, s/sefely/safely/ and s/hold/call us with the process lock held/. I do not see much sense in the kstack note. The core is dumped when the threads are moved into the safe state in the kernel, so you cannot catch 'living' stack backtraces for the kernel side of things. I was afraid of it after I had tried it on real dumps :-( Ok, will remove the kstack note. On the other hand, things like umask, resources and osrel might be of the interest for post-morted analysis. This is in my TODO list. I did not looked at the usermode changes. Thanks for your suggestions! Will do them. -- Mikolaj Golub pgpB2nVH0GTw4.pgp Description: PGP signature
Re: Linking produces unusable executable
On Sun, Mar 10, 2013 at 09:00:02PM -0400, Ilya Kaliman wrote: Hello, hackers! I have a strange problem with a big executable. I have a piece of scientific software and some C++ module in it with a LOT of template code. When compiled this module produces 450 MB archive file (w/o debugging symbols). Now, when I compile the program without this module everything works perfect. With this module turned on the linker produces an executable (around 180 MB in size) without any errors or warnings. But when I try to start the final program zsh just says: abort. ldd on this exe says: ./a.out: signal 6. I watched the memory consumption during linking and it doesn't seem to exhaust all available memory (the linker seem to stop allocating on around 2 GB). I've also tried to enable --no-keep-memory for ld with no luck - linking still produces no errors but the resulting executable is unusable. I've tried it on 9.1 and 10-CURRENT with both gcc/g++/ld from the base system and from ports (gcc 4.7.3, binutils 2.23.1) and with clang. I've tried to build some of the bigger ports like chromium (just in case): all works fine. Everything works on linux though (with the same gcc/ld). With debugging symbols the exe is around 1GB, without them its around 200MB. Works fine in every case with different optimization levels. Any ideas how to solve this? For start, it would be nice to provide some useful information together or even instead of the long story. What is the architecture ? Show at least the output of the size(1) on the final binary. Show exact shell message on the attempt of the binary run. Show the ktrace/kdump of the start attempt. As a guess, look at the sysctl kern.maxtsiz and compare it to the text size reported by the size(1). The same for the initialized data segment and kern.maxdsiz. pgph__UaQKK53.pgp Description: PGP signature
Re: clang generated code sometimes confuses fbt
On Sat, Mar 02, 2013 at 09:23:15PM +0100, Dimitry Andric wrote: On 2013-03-02 18:52, Andriy Gapon wrote: on 02/03/2013 19:35 Andriy Gapon said the following: Now, I am not quite sure why ctfconvert skips bpobj_iterate_impl in the clang-generated code. Seems like some sort of a bug in ctfconvert. It seems that gcc and clang put different names for symbol of type FILE: clang: readelf -a -W /usr/obj/usr/src/sys/TRANT/bpobj.o| fgrep -w FILE 1: 0 FILELOCAL DEFAULT ABS /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/bpobj.c gcc: readelf -a -W /usr/obj/usr/src/sys/ODYSSEY/bpobj.o| fgrep -w FILE 1: 0 FILELOCAL DEFAULT ABS bpobj.c ctfconvert seems to compare this value with bpobj.c and so in the clang case it doesn't recognize the static symbols. Does my analysis seem reasonable? Have you verified that ctfconvert does the right thing, if you modify the FILE symbol to have just the filename? Indeed, clang puts the original filename from the command line in the .file directive, while gcc explicitly removes any directory names; see contrib/gcc/toplev.c, around line 680: void output_file_directive (FILE *asm_file, const char *input_name) { int len; const char *na; if (input_name == NULL) input_name = stdin; len = strlen (input_name); na = input_name + len; /* NA gets INPUT_NAME sans directory names. */ while (na input_name) { if (IS_DIR_SEPARATOR (na[-1])) break; na--; } ... That NA gets INPUT_NAME sans directory names comment was inserted by rms in r279. :-) So I guess this is the way gcc has done it from the start, but there is no explanation as to why rms chose to remove those directory names. I do not see the problem, except maybe for having reproducible builds? I seems that at least gdb also depends on the stripping the path for stabs (which is not dwarf) debugging format interpretation. On the FreeBSD system, do 'info', select 'stabs' - Stab Sections - Elf Linker Relocation. The last paragraph documents the gdb requirements. So it seems that stripped path in the STT_FILE is the common expectation. pgpIIMEAO2ALX.pgp Description: PGP signature
Re: [patch] statfs does not detect -t nullfs -o union as a union mount
On Sat, Mar 02, 2013 at 01:00:33AM +0100, Jilles Tjoelker wrote: On Thu, Feb 28, 2013 at 12:21:44AM +0200, Konstantin Belousov wrote: On Wed, Feb 27, 2013 at 10:31:42PM +0100, Jilles Tjoelker wrote: While testing recent changes to opendir(), I noticed that fstatfs() does not return the MNT_UNION flag for a -t nullfs -o union mount. As a result, opendir()/readdir() return files that exist in both top and bottom directories twice (at least . and ..). Other -o union mounts and -t unionfs mounts work correctly in this regard. The below patch passes through just the MNT_UNION flag of the nullfs mount itself. Perhaps more flags should be passed through. commit fce32a779af4eb977c9b96feb6e4f811d89f2881 Author: Jilles Tjoelker jil...@stack.nl Date: Sat Feb 23 22:22:39 2013 +0100 nullfs: Preserve the MNT_UNION flag of the nullfs mount itself. This is needed so that opendir() can properly detect a union mount like mount -t nullfs -o union dir1 dir2. diff --git a/sys/fs/nullfs/null_vfsops.c b/sys/fs/nullfs/null_vfsops.c index 3724e0a..ff06f57 100644 --- a/sys/fs/nullfs/null_vfsops.c +++ b/sys/fs/nullfs/null_vfsops.c @@ -313,7 +313,7 @@ nullfs_statfs(mp, sbp) /* now copy across the interesting information and fake the rest */ sbp-f_type = mstat.f_type; - sbp-f_flags = mstat.f_flags; + sbp-f_flags = (sbp-f_flags MNT_UNION) | mstat.f_flags; sbp-f_bsize = mstat.f_bsize; sbp-f_iosize = mstat.f_iosize; sbp-f_blocks = mstat.f_blocks; Would it make sense to preserve more flags from the upper mount ? I see a use for MNT_NOEXEC as well, at least. Yes, preserving MNT_NOEXEC will make -t nullfs -o noexec work better, in particular rtld's check for libraries loaded via environment variables. In the same way MNT_RDONLY, MNT_NOSUID and MNT_NOSYMFOLLOW could be preserved. On the other hand, MNT_ROOTFS should probably not be passed through from the underlying filesystem. This would give sbp-f_flags = (sbp-f_flags (MNT_RDONLY | MNT_NOEXEC | MNT_NOSUID | MNT_UNION | MNT_NOSYMFOLLOW) | (mstat.f_flags ~MNT_ROOTFS); I think this is fine. pgpbxLGJDTdz9.pgp Description: PGP signature
Re: [patch] statfs does not detect -t nullfs -o union as a union mount
On Wed, Feb 27, 2013 at 10:31:42PM +0100, Jilles Tjoelker wrote: While testing recent changes to opendir(), I noticed that fstatfs() does not return the MNT_UNION flag for a -t nullfs -o union mount. As a result, opendir()/readdir() return files that exist in both top and bottom directories twice (at least . and ..). Other -o union mounts and -t unionfs mounts work correctly in this regard. The below patch passes through just the MNT_UNION flag of the nullfs mount itself. Perhaps more flags should be passed through. commit fce32a779af4eb977c9b96feb6e4f811d89f2881 Author: Jilles Tjoelker jil...@stack.nl Date: Sat Feb 23 22:22:39 2013 +0100 nullfs: Preserve the MNT_UNION flag of the nullfs mount itself. This is needed so that opendir() can properly detect a union mount like mount -t nullfs -o union dir1 dir2. diff --git a/sys/fs/nullfs/null_vfsops.c b/sys/fs/nullfs/null_vfsops.c index 3724e0a..ff06f57 100644 --- a/sys/fs/nullfs/null_vfsops.c +++ b/sys/fs/nullfs/null_vfsops.c @@ -313,7 +313,7 @@ nullfs_statfs(mp, sbp) /* now copy across the interesting information and fake the rest */ sbp-f_type = mstat.f_type; - sbp-f_flags = mstat.f_flags; + sbp-f_flags = (sbp-f_flags MNT_UNION) | mstat.f_flags; sbp-f_bsize = mstat.f_bsize; sbp-f_iosize = mstat.f_iosize; sbp-f_blocks = mstat.f_blocks; Would it make sense to preserve more flags from the upper mount ? I see a use for MNT_NOEXEC as well, at least. pgpZ9zKLu606K.pgp Description: PGP signature
Re: [patch] Wine DLL base address patches
On Tue, Feb 26, 2013 at 10:52:15PM +0200, Damjan Jovanovic wrote: On Fri, Feb 22, 2013 at 5:19 AM, Damjan Jovanovic damjan@gmail.com wrote: On Thu, Feb 21, 2013 at 5:44 PM, Konstantin Belousov kostik...@gmail.com wrote: On Thu, Feb 21, 2013 at 12:57:45AM +0200, Damjan Jovanovic wrote: On Wed, Feb 20, 2013 at 10:51 PM, Tijl Coosemans t...@coosemans.org wrote: On 20-02-2013 16:48, Konstantin Belousov wrote: On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote: Hi Wine needs some of its libraries to be loaded at specific base addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently lacks. I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that loads libraries at their preferred base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port of Prelink to FreeBSD which Wine uses to set base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-), the changed dynamic loader doesn't show any problems in a few days of testing, and prelink works with the --reloc-only option as used by Wine. Please review/test/comment/commit. Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings which could exist at the specified address. I've simplified the rtld patch to a single line. The second patch makes Wine use -Ttext-segment linker flag instead of prelink. This requires binutils from ports, but it's easier than porting prelink. All of that occurred to me as well. The problem with that one-line rtld patch is that loading an application will now fail if any of its libraries cannot be loaded at their requested address. But this is intended behaviour. Also, the default virtaddr base for the shared libraries is 0, so the existing binaries should be not affected. In that case, and since failing to load a library only causes the process to exit when starting up and not when it calls dlopen(), I approve. The problem with -Ttext-segment (and isn't it just -Ttext?) is that it doesn't seem to work: the base_vaddr seen by rtld will remain 0, and the address listed in /proc/.../map is different from what it should be. Also run readelf -l on a library compiled that way and compare with the output of one run through prelink --reloc-only, you'll see the lowest VirtAddr and PhysAddr in LOAD headers change only with prelink. I really ported prelink because there was no other choice. The -Ttext-segment does work. As indicated by Tijl, you need recent binutils. I just verified that ld 2.32.1 obeys -Ttext-segment. You can also take a look at the default linker script to see how -Ttext-segment is used, look for SEGMENT_START(text-segment). My apologies: I confused -Ttext which is documented but doesn't work, with -Ttext-segment which is undocumented in FreeBSD 9.1 and might work. I would test it further, but -CURRENT doesn't installworld (ERROR: Required auditdistd user is missing, see /usr/src/UPDATING.) and I am away until next week. Prelink is now in Ports. What I'd recommend is checking if the binaries are the same, and if not, doing a diff between readelf -a outputs of the prelinked binary vs -Ttext-segmented binary. Also run this a few times and make sure the address is what's expected: #include windows.h #include stdio.h int main(int argc, char **argv) { printf(%p\n, LoadLibrary(KERNEL32)); return 0; } mingw32-gcc hello.c -o hello.exe wine hello.exe With binutils 2.23.1 (in ports), comparing the output of ld -Ttext-segment=0x7b80 and prelink --reloc-only 0x7b80 using diffs of readelf -a outputs gives this: -11: 7b80 0 OBJECT LOCAL DEFAULT6 _GLOBAL_OFFSET_TABLE_ +11: 0 OBJECT LOCAL DEFAULT6 _GLOBAL_OFFSET_TABLE_ in other words, prelink also shifts the global offset table to the requested base address, ld does not. I don't think this matters since it's only ELF segments that get loaded - sections are irrelevant. I suspect that it is sort of bug in ld. On the other hand, _G_O_T_ symbol should be not used for real relocations, because both i386 and amd64 define specific relocations which allow to reference the begining of the GOT. The symbol is mostly a symbolic way to generate the relocations. So indeed, this should be fine. objdump -s finds no differences. So I am happy with all of Tijl's patches, please commit them. I expect Tijl to do it himself. pgpg3p0c_4gD1.pgp Description: PGP signature
Re: Building a program to be used only during make buildkernel
On Mon, Feb 25, 2013 at 09:34:46AM +0100, Jean-S?bastien P?dron wrote: I haven't tried to build the module into the kernel yet. IMO this is not a target for now. The KMS currently is not integrated with syscons at all, so startup of a KMS driver causes blank screen and lost console. The vidconsole is functional, but it has no use for the user who cannot see the output. This is the state of i915 driver, which does not provide a files glue for static linking into the kernel, for the described reason. Solving integration issues with syscons is rather pointless, both due to the baroque interface of the syscons, and due to the supposed replacement of syscons, available in the base/user/ed/newcons. pgpG41nDnQLxk.pgp Description: PGP signature
Re: Building a program to be used only during make buildkernel
On Fri, Feb 22, 2013 at 02:36:42PM +0100, Jean-S?bastien P?dron wrote: Hello, During the build of the radeon DRM driver in Linux, a C program is first built to generate headers included by the driver. The program is not meant to be installed. In FreeBSD, how would one build and use such a program during the build of a kernel module? And what if the driver is built into the kernel instead? Look at the genasssym program, used by the kernel and several ABI modules for exactly this use. pgpI9C7XfLA3X.pgp Description: PGP signature
Re: [patch] Wine DLL base address patches
On Thu, Feb 21, 2013 at 12:57:45AM +0200, Damjan Jovanovic wrote: On Wed, Feb 20, 2013 at 10:51 PM, Tijl Coosemans t...@coosemans.org wrote: On 20-02-2013 16:48, Konstantin Belousov wrote: On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote: Hi Wine needs some of its libraries to be loaded at specific base addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently lacks. I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that loads libraries at their preferred base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port of Prelink to FreeBSD which Wine uses to set base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-), the changed dynamic loader doesn't show any problems in a few days of testing, and prelink works with the --reloc-only option as used by Wine. Please review/test/comment/commit. Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings which could exist at the specified address. I've simplified the rtld patch to a single line. The second patch makes Wine use -Ttext-segment linker flag instead of prelink. This requires binutils from ports, but it's easier than porting prelink. All of that occurred to me as well. The problem with that one-line rtld patch is that loading an application will now fail if any of its libraries cannot be loaded at their requested address. But this is intended behaviour. Also, the default virtaddr base for the shared libraries is 0, so the existing binaries should be not affected. The problem with -Ttext-segment (and isn't it just -Ttext?) is that it doesn't seem to work: the base_vaddr seen by rtld will remain 0, and the address listed in /proc/.../map is different from what it should be. Also run readelf -l on a library compiled that way and compare with the output of one run through prelink --reloc-only, you'll see the lowest VirtAddr and PhysAddr in LOAD headers change only with prelink. I really ported prelink because there was no other choice. The -Ttext-segment does work. As indicated by Tijl, you need recent binutils. I just verified that ld 2.32.1 obeys -Ttext-segment. You can also take a look at the default linker script to see how -Ttext-segment is used, look for SEGMENT_START(text-segment). See my attached test cases, and examine the contents of /proc/.../map while they run. pgpsVpqSyF9L5.pgp Description: PGP signature
Re: [patch] Wine DLL base address patches
On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote: Hi Wine needs some of its libraries to be loaded at specific base addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently lacks. I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that loads libraries at their preferred base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port of Prelink to FreeBSD which Wine uses to set base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-), the changed dynamic loader doesn't show any problems in a few days of testing, and prelink works with the --reloc-only option as used by Wine. Please review/test/comment/commit. Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings which could exist at the specified address. What you could do, is - either specify the desired base without MAP_FIXED. Kernel takes the base as the hint, but only uses it if the request completely fits into the hinted area without overriding any existing mapping. Also, the heap is explicitely avoided. The code would need to check if kernel honored hint by comparing the actual map address with the desired address, and act accordingly. - or add a new mmap(2) flag, which would specify that mapping should be done at that address, if possible to not overlap with existing mapping, otherwise the mapping should fail. pgpTmtvgtVj_B.pgp Description: PGP signature
Re: [patch] Wine DLL base address patches
On Wed, Feb 20, 2013 at 09:51:37PM +0100, Tijl Coosemans wrote: On 20-02-2013 16:48, Konstantin Belousov wrote: On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote: Hi Wine needs some of its libraries to be loaded at specific base addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently lacks. I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that loads libraries at their preferred base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port of Prelink to FreeBSD which Wine uses to set base addresses (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-), the changed dynamic loader doesn't show any problems in a few days of testing, and prelink works with the --reloc-only option as used by Wine. Please review/test/comment/commit. Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings which could exist at the specified address. I've simplified the rtld patch to a single line. The second patch makes Wine use -Ttext-segment linker flag instead of prelink. This requires binutils from ports, but it's easier than porting prelink. Index: libexec/rtld-elf/map_object.c === --- libexec/rtld-elf/map_object.c (revision 246986) +++ libexec/rtld-elf/map_object.c (working copy) @@ -175,7 +175,7 @@ map_object(int fd, const char *path, const struct base_vaddr = trunc_page(segs[0]-p_vaddr); base_vlimit = round_page(segs[nsegs]-p_vaddr + segs[nsegs]-p_memsz); mapsize = base_vlimit - base_vaddr; -base_addr = hdr-e_type == ET_EXEC ? (caddr_t) base_vaddr : NULL; +base_addr = (caddr_t) base_vaddr; mapbase = mmap(base_addr, mapsize, PROT_NONE, MAP_ANON | MAP_PRIVATE | MAP_NOCORE, -1, 0); If this is enough for wine, I definitely have no objection. The typical dso has zero virtual base for the first segment, so the patch should change nothing for dso built without special map file. Index: Makefile === --- Makefile (revision 312556) +++ Makefile (working copy) @@ -28,7 +28,7 @@ LATEST_LINK= wine-devel CPPFLAGS+= -I${LOCALBASE}/include LDFLAGS+=-L${LOCALBASE}/lib -USE_GCC= any +USE_GCC= 4.7 GNU_CONFIGURE= yes CONFIGURE_ARGS+=--verbose --disable-tests \ --without-alsa --without-capi --without-dbus \ Index: files/patch-tools-winegcc-utils.h === --- files/patch-tools-winegcc-utils.h (revision 0) +++ files/patch-tools-winegcc-utils.h (working copy) @@ -0,0 +1,12 @@ +--- tools/winegcc/utils.h.orig tools/winegcc/utils.h +@@ -42,7 +42,8 @@ + + enum target_platform + { +-PLATFORM_UNSPECIFIED, PLATFORM_APPLE, PLATFORM_SOLARIS, PLATFORM_WINDOWS, PLATFORM_CYGWIN ++PLATFORM_UNSPECIFIED, PLATFORM_APPLE, PLATFORM_FREEBSD, PLATFORM_SOLARIS, ++PLATFORM_WINDOWS, PLATFORM_CYGWIN + }; + + void error(const char* s, ...) DECLSPEC_NORETURN; Index: files/patch-tools-winegcc-winegcc.c === --- files/patch-tools-winegcc-winegcc.c (revision 0) +++ files/patch-tools-winegcc-winegcc.c (working copy) @@ -0,0 +1,32 @@ +--- tools/winegcc/winegcc.c.orig tools/winegcc/winegcc.c +@@ -172,6 +172,7 @@ + { + { macos, PLATFORM_APPLE }, + { darwin, PLATFORM_APPLE }, ++{ freebsd, PLATFORM_FREEBSD }, + { solaris, PLATFORM_SOLARIS }, + { cygwin, PLATFORM_CYGWIN }, + { mingw32, PLATFORM_WINDOWS }, +@@ -232,6 +233,8 @@ + + #ifdef __APPLE__ + static enum target_platform build_platform = PLATFORM_APPLE; ++#elif defined(__FreeBSD__) ++static enum target_platform build_platform = PLATFORM_FREEBSD; + #elif defined(__sun) + static enum target_platform build_platform = PLATFORM_SOLARIS; + #elif defined(__CYGWIN__) +@@ -1020,6 +1023,12 @@ + if (opts-strip) + strarray_add(link_args, -Wl,-x); + break; ++case PLATFORM_FREEBSD: ++if (opts-image_base) ++{ ++strarray_add(link_args, strmake(-Wl,-Ttext-segment=%s, opts-image_base)); ++} ++break; + case PLATFORM_SOLARIS: + { + char *mapfile = get_temp_file( output_name, .map ); pgpAviZFPInSj.pgp Description: PGP signature
Re: Stupid question about integer sizes
On Tue, Feb 19, 2013 at 06:52:34AM -0800, m...@freebsd.org wrote: On Tue, Feb 19, 2013 at 5:11 AM, Borja Marcos bor...@sarenet.es wrote: Hello, I'm really sorry if this is a stupid question, but as far as I know, u_int64_t defined in /usr/include/sys/types.h should *always* be a 64 bit unsigned integer, right? Seems there's a bug (or I need more and stronger coffee). Compiling a program on a 64 bit system with -m32 gets the 64 bit integer types wrong. % cat tachin.c #include sys/types.h #include stdio.h main() { printf(sizeof uint64_t = %d\n, sizeof(uint64_t)); printf(sizeof u_int64_t = %d\n, sizeof(u_int64_t)); } uname -a FreeBSD splunk 9.1-RELEASE FreeBSD 9.1-RELEASE #14: Wed Jan 23 17:24:05 CET 2013 root@splunk:/usr/obj/usr/src/sys/SPLUNK amd64 % gcc -o tachin tachin.c % ./tachin sizeof uint64_t = 8 sizeof u_int64_t = 8 % ./tachin sizeof uint64_t = 4 === WRONG!! sizeof u_int64_t = 4=== WRONG!! The same happens with clang. % clang -m32 -o tachin tachin.c tachin.c:5:1: warning: type specifier missing, defaults to 'int' [-Wimplicit-int] main() ^~~~ 1 warning generated. % ./tachin sizeof uint64_t = 4 === WRONG!! sizeof u_int64_t = 4=== WRONG!! if I do the same on a i386 system (8.2-RELEASE, but it should be irrelevant) the u_int64 types have the correct size. %gcc -o tachin tachin.c %./tachin sizeof uint64_t = 8 sizeof u_int64_t = 8 Am I missing anything? Seems like a too stupid problem to be a real bug. Sorry if I am wrong. Last I knew -m32 still wasn't quite supported on 9.1. This is fixed in CURRENT. The problem in in the machine-dependent headers. sys/types.h includes sys/_types.h which includes machine/_types.h. But the machine alias can only point to one place; it's the amd64 headers since you're running the amd64 kernel. sys/amd64/include/_types.h defines __uint64_t as unsigned long, which is correct in 64-bit mode but wrong in 32-bit mode. On CURRENT there is a merge x86/_types.h which uses #ifdef guards to define __uint64_t as unsigned long or unsigned long long, depending on __LP64__, so that the size is correct on a 32-bit compiler invocation. Yes, but there are still some unconverted headers, mostly related to the machine context and signal handlers. Quick look: machine/sigframe.h (probably not much useful for usermode) machine/ucontext.h machine/frame.h (again, usermode most likely does not need it) machine/signal.h machine/elf.h libm machdep headers. pgpSY1dpkwpFP.pgp Description: PGP signature
Re: SIGSEGV/SIGBUS when accessing after end of mmapped file; why it differs with GCC?
On Wed, Feb 13, 2013 at 12:13:58PM -0500, Ryan Stone wrote: On Wed, Feb 13, 2013 at 11:18 AM, nat...@centrum.cz wrote: Hello, I am porting an application which maps files into the memory and works directly with the memory. When doing this, it can happen that when someone resizes the file so that part of the previously mapped region is no longer backed by the file, synchronous signal is sent to the process which needs to be handled. On all other platforms than FreeBSD I have tested (Solaris, Linux, Darwin, HP-UX) the signal in question is SIGBUS. However on FreeBSD, depending on the compiler used, it is either SIGBUS or SIGSEGV. When I compile the binary with gcc version 4.2.1 20070831 patched [FreeBSD], amd64, the signal is SIGBUS, when i use gcc version 4.7.3 20121103 (prerelease) (FreeBSD Ports Collection), the signal is SIGSEGV. Please note that one of the versions of gcc between 4.2 and 4.7 also caused SIGSEFV to be sent but this did not matter for me as I did not need to use that version; I however need gcc 4.7 to work because of c++11 stuff that my project has recently started to use. Unfortunately registering signal handler on SIGSEGV is very inconvenient for me; I would prefer to somehow switch the behavior to be sane. Please anyone has an idea whether or how this could be achieved? I have tried to find out why, on single machine, just because of different gcc version, kernel sends different signal but I have never worked with fbsd kernel before and so my search did not succeed so far. Machine in question runs amd64 FreeBSD 9.1-RC2, but this has also happened to me with older version of FreeBSD. gcc 4.2 (same happens also with libstdc++ and co. from gcc 4.2): Program received signal SIGBUS, Bus error. (gdb) i shared FromTo Syms Read Shared Object Library 0x000800f2ef70 0x000800f3ee68 Yes (*) /libexec/ld-elf.so.1 0x00080114d710 0x000801159748 Yes (*) /lib/libthr.so.3 0x0008013c2d30 0x00080142c656 Yes /usr/local/lib/gcc47/libstdc++.so.6 0x0008016737c0 0x0008016891b8 Yes (*) /lib/libm.so.5 0x000801893930 0x0008018a3088 Yes /usr/local/lib/gcc47/libgcc_s.so.1 0x000801ad71d0 0x000801ba9358 Yes (*) /lib/libc.so.7 gcc 4.7: Program received signal SIGSEGV, Segmentation fault. (gdb) i shared FromTo Syms Read Shared Object Library 0x000800f5ef70 0x000800f6ee68 Yes (*) /libexec/ld-elf.so.1 0x00080117d710 0x000801189748 Yes (*) /lib/libthr.so.3 0x0008013f2d30 0x00080145c656 Yes /usr/local/lib/gcc47/libstdc++.so.6 0x0008016a37c0 0x0008016b91b8 Yes (*) /lib/libm.so.5 0x0008018c3930 0x0008018d3088 Yes /usr/local/lib/gcc47/libgcc_s.so.1 0x000801b071d0 0x000801bd9358 Yes (*) /lib/libc.so.7 I would be glad for any hint or information. Kind Regards, Ondrej Kolacek ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org I think that setting sysctl machdep.prot_fault_translation=1 would do what you want. This might be an indication of the issue with the toolchains you use, in particular, with the linker or csu. The default selection for the signal is based on the note osver, linked into the binary from crt1.o. Either linker omits the note, or wrong crt1 is used. You did not specified anything about version of the FreeBSD used, nor the exact compiler invocations. Using the crystal ball, I see the r244600 for HEAD and r244904 for stable/9, if you use --gc-sections flags. This is more or less consistent with what you reported, since gcc from ports uses binutils from ports, which have newer ld with bugfix already applied. pgpKqdnUDY0de.pgp Description: PGP signature
Re: Request for review, time_pps_fetch() enhancement
On Wed, Feb 13, 2013 at 08:16:32AM -0700, Ian Lepore wrote: On Tue, 2013-02-12 at 22:34 +0200, Konstantin Belousov wrote: On Tue, Feb 12, 2013 at 09:03:39AM -0700, Ian Lepore wrote: On Sun, 2013-02-10 at 12:37 +0200, Konstantin Belousov wrote: On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote: On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote: On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote: I'd like feedback on the attached patch, which adds support to our time_pps_fetch() implementation for the blocking behaviors described in section 3.4.3 of RFC 2783. The existing implementation can only return the most recently captured data without blocking. These changes add the ability to block (forever or with timeout) until a new event occurs. Index: sys/kern/kern_tc.c === --- sys/kern/kern_tc.c(revision 246337) +++ sys/kern/kern_tc.c(working copy) @@ -1446,6 +1446,50 @@ * RFC 2783 PPS-API implementation. */ +static int +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps) +{ [snip] + aseq = pps-ppsinfo.assert_sequence; + cseq = pps-ppsinfo.clear_sequence; + while (aseq == pps-ppsinfo.assert_sequence + cseq == pps-ppsinfo.clear_sequence) { Note that compilers are allowed to optimize these accesses even over the sequential point, which is the tsleep() call. Only accesses to volatile objects are forbidden to be rearranged. I suggest to add volatile casts to pps in the loop condition. The memory pointed to by pps is global (other code may have a pointer to it); therefore, the compiler must assume that the tsleep() call (which invokes code in a different compilation unit) may modify it. Because volatile does not make concurrent access by multiple threads defined either, adding it here only seems to slow down the code (potentially). The volatile guarantees that the compiler indeed reloads the value on read access. Conceptually, the tsleep() does not modify or even access the checked fields, and compiler is allowed to note this by whatever methods (LTO ?). More, the standard says that an implementation is allowed to not evaluate part of the expression if no side effects are produced, even by calling a function. I agree that for practical means, the _currently_ used compilers should consider the tsleep() call as the sequential point. But then the volatile qualifier cast applied for the given access would not change the code as well. Doesn't this then imply that essentially every driver has this problem, and for that matter, every sequence of code anywhere in the base involving loop while repeatedly sleeping, then waking and checking the state of some data for changes? I sure haven't seen that many volatile qualifiers scattered around the code. No, it does not imply that every driver has this problem. A typical driver provides the mutual exclusion for access of the shared data, which means using locks. Locks include neccessary barries to ensure the visibility of the changes, in particular the compiler barriers. O. I had never considered that using mutexes had other side effects. So is there a correct MI way to invoke the right barrier magic in a situation like this? My belief is that you do not need a barrier there. The only (slightly) problematic issue there is a purely theoretical possibility that a very smart compiler would omit the reload step. The volatile qualifier for the dererefence in the loop condition should close this, as I described in the very first reply. pgpmciXFSJ0R5.pgp Description: PGP signature
Re: Request for review, time_pps_fetch() enhancement
On Tue, Feb 12, 2013 at 09:03:39AM -0700, Ian Lepore wrote: On Sun, 2013-02-10 at 12:37 +0200, Konstantin Belousov wrote: On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote: On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote: On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote: I'd like feedback on the attached patch, which adds support to our time_pps_fetch() implementation for the blocking behaviors described in section 3.4.3 of RFC 2783. The existing implementation can only return the most recently captured data without blocking. These changes add the ability to block (forever or with timeout) until a new event occurs. Index: sys/kern/kern_tc.c === --- sys/kern/kern_tc.c(revision 246337) +++ sys/kern/kern_tc.c(working copy) @@ -1446,6 +1446,50 @@ * RFC 2783 PPS-API implementation. */ +static int +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps) +{ [snip] + aseq = pps-ppsinfo.assert_sequence; + cseq = pps-ppsinfo.clear_sequence; + while (aseq == pps-ppsinfo.assert_sequence + cseq == pps-ppsinfo.clear_sequence) { Note that compilers are allowed to optimize these accesses even over the sequential point, which is the tsleep() call. Only accesses to volatile objects are forbidden to be rearranged. I suggest to add volatile casts to pps in the loop condition. The memory pointed to by pps is global (other code may have a pointer to it); therefore, the compiler must assume that the tsleep() call (which invokes code in a different compilation unit) may modify it. Because volatile does not make concurrent access by multiple threads defined either, adding it here only seems to slow down the code (potentially). The volatile guarantees that the compiler indeed reloads the value on read access. Conceptually, the tsleep() does not modify or even access the checked fields, and compiler is allowed to note this by whatever methods (LTO ?). More, the standard says that an implementation is allowed to not evaluate part of the expression if no side effects are produced, even by calling a function. I agree that for practical means, the _currently_ used compilers should consider the tsleep() call as the sequential point. But then the volatile qualifier cast applied for the given access would not change the code as well. Doesn't this then imply that essentially every driver has this problem, and for that matter, every sequence of code anywhere in the base involving loop while repeatedly sleeping, then waking and checking the state of some data for changes? I sure haven't seen that many volatile qualifiers scattered around the code. No, it does not imply that every driver has this problem. A typical driver provides the mutual exclusion for access of the shared data, which means using locks. Locks include neccessary barries to ensure the visibility of the changes, in particular the compiler barriers. pgpoT0FDmflHp.pgp Description: PGP signature
Re: Request for review, time_pps_fetch() enhancement
On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote: On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote: On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote: I'd like feedback on the attached patch, which adds support to our time_pps_fetch() implementation for the blocking behaviors described in section 3.4.3 of RFC 2783. The existing implementation can only return the most recently captured data without blocking. These changes add the ability to block (forever or with timeout) until a new event occurs. Index: sys/kern/kern_tc.c === --- sys/kern/kern_tc.c(revision 246337) +++ sys/kern/kern_tc.c(working copy) @@ -1446,6 +1446,50 @@ * RFC 2783 PPS-API implementation. */ +static int +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps) +{ [snip] + aseq = pps-ppsinfo.assert_sequence; + cseq = pps-ppsinfo.clear_sequence; + while (aseq == pps-ppsinfo.assert_sequence + cseq == pps-ppsinfo.clear_sequence) { Note that compilers are allowed to optimize these accesses even over the sequential point, which is the tsleep() call. Only accesses to volatile objects are forbidden to be rearranged. I suggest to add volatile casts to pps in the loop condition. The memory pointed to by pps is global (other code may have a pointer to it); therefore, the compiler must assume that the tsleep() call (which invokes code in a different compilation unit) may modify it. Because volatile does not make concurrent access by multiple threads defined either, adding it here only seems to slow down the code (potentially). The volatile guarantees that the compiler indeed reloads the value on read access. Conceptually, the tsleep() does not modify or even access the checked fields, and compiler is allowed to note this by whatever methods (LTO ?). More, the standard says that an implementation is allowed to not evaluate part of the expression if no side effects are produced, even by calling a function. I agree that for practical means, the _currently_ used compilers should consider the tsleep() call as the sequential point. But then the volatile qualifier cast applied for the given access would not change the code as well. + err = tsleep(pps, PCATCH, ppsfch, timo); + if (err == EWOULDBLOCK fapi-timeout.tv_sec == -1) { + continue; + } else if (err != 0) { + return (err); + } + } + } -- Jilles Tjoelker pgpcFkMsoB_QP.pgp Description: PGP signature
Re: Request for review, time_pps_fetch() enhancement
On Fri, Feb 08, 2013 at 04:13:40PM -0700, Ian Lepore wrote: On Wed, 2013-02-06 at 17:58 +0200, Konstantin Belousov wrote: On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote: I'd like feedback on the attached patch, which adds support to our time_pps_fetch() implementation for the blocking behaviors described in section 3.4.3 of RFC 2783. The existing implementation can only return the most recently captured data without blocking. These changes add the ability to block (forever or with timeout) until a new event occurs. -- Ian Index: sys/kern/kern_tc.c === --- sys/kern/kern_tc.c(revision 246337) +++ sys/kern/kern_tc.c(working copy) @@ -1446,6 +1446,50 @@ * RFC 2783 PPS-API implementation. */ +static int +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps) +{ + int err, timo; + pps_seq_t aseq, cseq; + struct timeval tv; + + if (fapi-tsformat fapi-tsformat != PPS_TSFMT_TSPEC) + return (EINVAL); + + /* + * If no timeout is requested, immediately return whatever values were + * most recently captured. If timeout seconds is -1, that's a request + * to block without a timeout. WITNESS won't let us sleep forever + * without a lock (we really don't need a lock), so just repeatedly + * sleep a long time. + */ Regarding no need for the lock, it would just move the implementation into the low quality one, for the case when one timestamp capture is lost and caller of time_pps_fetch() sleeps until next pps event is generated. I understand the desire to avoid lock, esp. in the pps_event() called from the arbitrary driver context. But the race is also real. What race? A user of the pps interface understands that there is one event per second, and understands that if you ask to block until the next event at approximately the time that event is expected to occur, then it is ambiguous whether the call completes almost-immediately or in about 1 second. Looking at it another way, if a blocking call is made right around the time of the PPS, the thread could get preempted before getting to pps_fetch() function and not get control again until after the PPS has occurred. In that case it's going to block for about a full second, even though the call was made before top-of-second. That situation is exactly the same with or without locking, so what extra functionality is gained with locking? What guarantee does locking let us make to the caller that the lockless code doesn't? No guarantees, but I noted in the original reply that this is about the quality of the implementation and not about correctness. As I said there as well, I am not sure that any locking can be useful for the situation at all. pgpNVBaNfjQbj.pgp Description: PGP signature
Re: fcntl(2) F_READAHEAD set to zero doesn't work [patch]
On Fri, Feb 08, 2013 at 04:58:09PM -0700, Ian Lepore wrote: I discovered today that fcntl(fd, F_READAHEAD, 0) doesn't work as advertised. It's supposed to disable readahead, but instead it restores the default readahead behavior (if it had previously been changed), and there is no way to disable readahead.[1] I think the attached patch fixes it, but it's not immediately clear from the patch why; here's the deal... The amount of readahead is calculated by sequential_heuristic() in vfs_vnops.c. If the FRDAHEAD flag is set on the file it uses the value stored in the file's f_seqcount, otherwise it calculates a value (and updates f_seqcount, which doesn't ever happen when FRDAHEAD is set). So the patch causes the FRDAHEAD flag to be set even in the case of the readahead amount being zero. Because it seems like a useful concept, it still allows the readahead to be restored to default behavior, now by passing a negative value. Does this look right to those of you who understand this part of the system better than I do? It looks correct. I do wonder if it better to keep the actual behaviour as is and set the readahead amount to default on the arg == 0, disabling readahead for arg 0. It is too exteme, most likely, please proceed. -- Ian [1] No way using F_READAHEAD; I know about POSIX_FADV_RANDOM. Index: sys/kern/kern_descrip.c === --- sys/kern/kern_descrip.c (revision 246337) +++ sys/kern/kern_descrip.c (working copy) @@ -776,7 +776,7 @@ } fhold(fp); FILEDESC_SUNLOCK(fdp); - if (arg != 0) { + if (arg = 0) { vp = fp-f_vnode; error = vn_lock(vp, LK_SHARED); if (error != 0) { Index: lib/libc/sys/fcntl.2 === --- lib/libc/sys/fcntl.2 (revision 246337) +++ lib/libc/sys/fcntl.2 (working copy) @@ -28,7 +28,7 @@ .\ @(#)fcntl.2 8.2 (Berkeley) 1/12/94 .\ $FreeBSD$ .\ -.Dd July 27, 2012 +.Dd February 8, 2013 .Dt FCNTL 2 .Os .Sh NAME @@ -171,7 +171,7 @@ which is rounded up to the nearest block size. A zero value in .Fa arg -turns off read ahead. +turns off read ahead, a negative value restores the system default. .It Dv F_RDAHEAD Equivalent to Darwin counterpart which sets read ahead amount of 128KB when the third argument, ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org pgp_z8JoFEici.pgp Description: PGP signature
Re: Request for review, time_pps_fetch() enhancement
On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote: I'd like feedback on the attached patch, which adds support to our time_pps_fetch() implementation for the blocking behaviors described in section 3.4.3 of RFC 2783. The existing implementation can only return the most recently captured data without blocking. These changes add the ability to block (forever or with timeout) until a new event occurs. -- Ian Index: sys/kern/kern_tc.c === --- sys/kern/kern_tc.c(revision 246337) +++ sys/kern/kern_tc.c(working copy) @@ -1446,6 +1446,50 @@ * RFC 2783 PPS-API implementation. */ +static int +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps) +{ + int err, timo; + pps_seq_t aseq, cseq; + struct timeval tv; + + if (fapi-tsformat fapi-tsformat != PPS_TSFMT_TSPEC) + return (EINVAL); + + /* + * If no timeout is requested, immediately return whatever values were + * most recently captured. If timeout seconds is -1, that's a request + * to block without a timeout. WITNESS won't let us sleep forever + * without a lock (we really don't need a lock), so just repeatedly + * sleep a long time. + */ Regarding no need for the lock, it would just move the implementation into the low quality one, for the case when one timestamp capture is lost and caller of time_pps_fetch() sleeps until next pps event is generated. I understand the desire to avoid lock, esp. in the pps_event() called from the arbitrary driver context. But the race is also real. + if (fapi-timeout.tv_sec || fapi-timeout.tv_nsec) { + if (fapi-timeout.tv_sec == -1) + timo = 0x7fff; + else { + tv.tv_sec = fapi-timeout.tv_sec; + tv.tv_usec = fapi-timeout.tv_nsec / 1000; + timo = tvtohz(tv); + } + aseq = pps-ppsinfo.assert_sequence; + cseq = pps-ppsinfo.clear_sequence; + while (aseq == pps-ppsinfo.assert_sequence + cseq == pps-ppsinfo.clear_sequence) { Note that compilers are allowed to optimize these accesses even over the sequential point, which is the tsleep() call. Only accesses to volatile objects are forbidden to be rearranged. I suggest to add volatile casts to pps in the loop condition. + err = tsleep(pps, PCATCH, ppsfch, timo); + if (err == EWOULDBLOCK fapi-timeout.tv_sec == -1) { + continue; + } else if (err != 0) { + return (err); + } + } + } + + pps-ppsinfo.current_mode = pps-ppsparam.mode; + fapi-pps_info_buf = pps-ppsinfo; + + return (0); +} + int pps_ioctl(u_long cmd, caddr_t data, struct pps_state *pps) { @@ -1485,13 +1529,7 @@ return (0); case PPS_IOC_FETCH: fapi = (struct pps_fetch_args *)data; - if (fapi-tsformat fapi-tsformat != PPS_TSFMT_TSPEC) - return (EINVAL); - if (fapi-timeout.tv_sec || fapi-timeout.tv_nsec) - return (EOPNOTSUPP); - pps-ppsinfo.current_mode = pps-ppsparam.mode; - fapi-pps_info_buf = pps-ppsinfo; - return (0); + return (pps_fetch(fapi, pps)); #ifdef FFCLOCK case PPS_IOC_FETCH_FFCOUNTER: fapi_ffc = (struct pps_fetch_ffc_args *)data; @@ -1540,7 +1578,7 @@ void pps_init(struct pps_state *pps) { - pps-ppscap |= PPS_TSFMT_TSPEC; + pps-ppscap |= PPS_TSFMT_TSPEC | PPS_CANWAIT; if (pps-ppscap PPS_CAPTUREASSERT) pps-ppscap |= PPS_OFFSETASSERT; if (pps-ppscap PPS_CAPTURECLEAR) @@ -1680,6 +1718,9 @@ hardpps(tsp, ts.tv_nsec + 10 * ts.tv_sec); } #endif + + /* Wakeup anyone sleeping in pps_fetch(). */ + wakeup(pps); } /* ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org pgpO2zZzzQm3w.pgp Description: PGP signature
Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote: Hi, I have a patch to dynamically calculate NKPT for amd64 kernels. This should fix the various issues that people pointed out in the email thread. Please review and let me know if there are any objections to committing this. Also, thanks to Alan (alc@) for reviewing and providing feedback on the initial version of the patch. Patch (also available at http://people.freebsd.org/~neel/patches/nkpt_diff.txt): Index: sys/amd64/include/pmap.h === --- sys/amd64/include/pmap.h (revision 246277) +++ sys/amd64/include/pmap.h (working copy) @@ -113,13 +113,7 @@ ((unsigned long)(l2) PDRSHIFT) | \ ((unsigned long)(l1) PAGE_SHIFT)) -/* Initial number of kernel page tables. */ -#ifndef NKPT -#define NKPT32 -#endif - #define NKPML4E 1 /* number of kernel PML4 slots */ -#define NKPDPE howmany(NKPT, NPDEPG)/* number of kernel PDP slots */ #define NUPML4E (NPML4EPG/2)/* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ @@ -181,6 +175,7 @@ #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) +extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ Index: sys/amd64/amd64/minidump_machdep.c === --- sys/amd64/amd64/minidump_machdep.c(revision 246277) +++ sys/amd64/amd64/minidump_machdep.c(working copy) @@ -232,7 +232,7 @@ /* Walk page table pages, set bits in vm_page_dump */ pmapsize = 0; pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); ) { /* * We always write a page, even if it is zero. Each @@ -364,7 +364,7 @@ /* Dump kernel page directory pages */ bzero(fakepd, sizeof(fakepd)); pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); va += NBPDP) { i = (va PDPSHIFT) ((1ul NPDPEPGSHIFT) - 1); Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 246277) +++ sys/amd64/amd64/pmap.c(working copy) @@ -202,6 +202,10 @@ vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ +int nkpt; +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0, +Number of kernel page table pages allocated on bootup); + static int ndmpdp; static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; @@ -495,17 +499,42 @@ CTASSERT(powerof2(NDMPML4E)); +/* number of kernel PDP slots */ +#define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) + static void +nkpt_init(vm_paddr_t addr) +{ + int pt_pages; + +#ifdef NKPT + pt_pages = NKPT; +#else + pt_pages = howmany(addr, 1 PDRSHIFT); + pt_pages += NKPDPE(pt_pages); + + /* + * Add some slop beyond the bare minimum required for bootstrapping + * the kernel. + * + * This is quite important when allocating KVA for kernel modules. + * The modules are required to be linked in the negative 2GB of + * the address space. If we run out of KVA in this region then + * pmap_growkernel() will need to allocate page table pages to map + * the entire 512GB of KVA space which is an unnecessary tax on + * physical memory. + */ + pt_pages += 4; /* 8MB additional slop for kernel modules */ 8MB might be to low. I just checked one of my machines with fully modularized kernel, it takes slightly more than 6 MB to load 50 modules. I think that 16MB would be safer, but it probably needs to be scaled down based on the available phys memory. amd64 kernel could be booted on 128MB machine still. pgpp2cL7wub9D.pgp Description: PGP signature
Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On Tue, Feb 05, 2013 at 07:45:24AM -0800, m...@freebsd.org wrote: On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote: Hi, I have a patch to dynamically calculate NKPT for amd64 kernels. This should fix the various issues that people pointed out in the email thread. Please review and let me know if there are any objections to committing this. Also, thanks to Alan (alc@) for reviewing and providing feedback on the initial version of the patch. Patch (also available at http://people.freebsd.org/~neel/patches/nkpt_diff.txt): Index: sys/amd64/include/pmap.h === --- sys/amd64/include/pmap.h (revision 246277) +++ sys/amd64/include/pmap.h (working copy) @@ -113,13 +113,7 @@ ((unsigned long)(l2) PDRSHIFT) | \ ((unsigned long)(l1) PAGE_SHIFT)) -/* Initial number of kernel page tables. */ -#ifndef NKPT -#define NKPT32 -#endif - #define NKPML4E 1 /* number of kernel PML4 slots */ -#define NKPDPE howmany(NKPT, NPDEPG)/* number of kernel PDP slots */ #define NUPML4E (NPML4EPG/2)/* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ @@ -181,6 +175,7 @@ #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) +extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ Index: sys/amd64/amd64/minidump_machdep.c === --- sys/amd64/amd64/minidump_machdep.c(revision 246277) +++ sys/amd64/amd64/minidump_machdep.c(working copy) @@ -232,7 +232,7 @@ /* Walk page table pages, set bits in vm_page_dump */ pmapsize = 0; pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); ) { /* * We always write a page, even if it is zero. Each @@ -364,7 +364,7 @@ /* Dump kernel page directory pages */ bzero(fakepd, sizeof(fakepd)); pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); va += NBPDP) { i = (va PDPSHIFT) ((1ul NPDPEPGSHIFT) - 1); Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 246277) +++ sys/amd64/amd64/pmap.c(working copy) @@ -202,6 +202,10 @@ vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ +int nkpt; +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0, +Number of kernel page table pages allocated on bootup); + static int ndmpdp; static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; @@ -495,17 +499,42 @@ CTASSERT(powerof2(NDMPML4E)); +/* number of kernel PDP slots */ +#define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) + static void +nkpt_init(vm_paddr_t addr) +{ + int pt_pages; + +#ifdef NKPT + pt_pages = NKPT; +#else + pt_pages = howmany(addr, 1 PDRSHIFT); + pt_pages += NKPDPE(pt_pages); + + /* + * Add some slop beyond the bare minimum required for bootstrapping + * the kernel. + * + * This is quite important when allocating KVA for kernel modules. + * The modules are required to be linked in the negative 2GB of + * the address space. If we run out of KVA in this region then + * pmap_growkernel() will need to allocate page table pages to map + * the entire 512GB of KVA space which is an unnecessary tax on + * physical memory. + */ + pt_pages += 4; /* 8MB additional slop for kernel modules */ 8MB might be to low. I just checked one of my machines with fully modularized kernel, it takes slightly more than 6 MB to load 50 modules. I think that 16MB would be safer, but it probably needs to be scaled down based on the available phys memory. amd64 kernel could be booted on 128MB machine still. Is there no way to not map the entire 512GB? Otherwise this patch could really hose some vendors. E.g. the kernel module for the OneFS file system is around 8MB all by itself
Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On Tue, Feb 05, 2013 at 09:12:41AM -0800, Neel Natu wrote: Hi Konstantin, Sounds fine. I can bump it up to 8 pages. Also, wrt your comment about scaling this number based on available memory, I wonder if it makes sense to optimize for 16KB of additional space. If it boots on 128MB as is, I am completely fine with the existing patch (after slack increase) as well. I would much rather work with you and Alan to fix pmap_growkernel() so we don't need to care about this slack in the first place :-) Sure, if you consider this important enough. I do agree with mdf that having this happen without any user intervention is good. But having just a tunable for the slack might be much less labor-intensive and good enough. pgp7buUfBNk5B.pgp Description: PGP signature
Re: scheduler-swapper, SI_SUB_RUN_SCHEDULER-SI_SUB_LAST
On Sat, Feb 02, 2013 at 01:50:40PM +0200, Andriy Gapon wrote: I would like to propose the following mostly cosmetic change: http://people.freebsd.org/~avg/scheduler-swapper.diff This is something that bit me early in my FreeBSD days, so I am kind of obsessed with it. The current naming is confusing/misleading indeed. And magic properties of SI_SUB_RUN_SCHEDULER:SI_ORDER_LAST is a hidden gem. You may remove the Giant unlock from the scheduler()/swapper() as well then, doing it before the swapper() call in the mi_startup(). Note that the wait chain for the idle swapper is still called sched. pgp0gyHIR2xWU.pgp Description: PGP signature
Re: Sockets programming question
On Mon, Jan 28, 2013 at 08:11:47AM -0700, Ian Lepore wrote: I've got a question that isn't exactly freebsd-specific, but implemenation-specific behavior may be involved. I've got a server process that accepts connections from clients on a PF_LOCAL stream socket. Multiple clients can be connected at once; a list of them is tracked internally. The server occasionally sends data to each client. The time between messages to clients can range literally from milliseconds to months. Clients never send any data to the server, indeed they may shutdown that side of the connection and just receive data. The only way I can find to discover that a client has disappeared is by trying to send them a message and getting an error because they've closed the socket or died completely. At that point I can reap the resources and remove them from the client list. This is problem because of the months between messages thing. A lot of clients can come and go during those months and I've got this ever-growing list of open socket descriptors because I never had anything to say the whole time they were connected. By trial and error I've discovered that I can sort of poll for their presence by writing a zero-length message. If the other end of the connection is gone I get the expected error and can reap the client, otherwise it appears to quietly write nothing and return zero and have no other side effects than polling the status of the server-client side of the pipe. My problem with this polling is that I can't find anything in writing that sanctions this behavior. Would this amount to relying on a non-portable accident of the current implementation? Also, am I missing something simple and there's a cannonical way to handle this? In all the years I've done client/server stuff I've never had quite this type of interaction (or lack thereof) between client and server before. Check for the IN events as well. I would not trust the mere presence of the IN in the poll result, but consequent read should return EOF and this is good indicator of the dead client. pgpetdMu2vn_M.pgp Description: PGP signature
Re: Processes' FIBs
On Tue, Jan 17, 2012 at 01:21:27PM +0100, Oliver Fromme wrote: Kostik Belousov kostik...@gmail.com wrote: The patch misses compat32 bits and breaks compat32 ps/top. Right, thank you for pointing it out! I missed it because I only have i386 for testing. I've created new patch sets for releng8 and current. These include compat32 support and an entry for the manual page. Would someone with amd64 please test the compat32 part? I've been using this code on i386 for a few days without any problems. I've attached the patch for current below. Both patch sets are also available from this URL: http://www.secnetix.de/olli/tmp/ki_fibnum/ Testing is easy: Apply the patch, rebuild bin/ps and kernel. Make sure that your kernel config has options ROUTETABLES=16 so multiple FIBs are supported. Reboot. Open a shell with setfib, e.g. setfib 3 /bin/sh (no root required), type ps -ax -o user,pid,fib,command or something similar, and verify that the shell process and its children are listed with the correct FIB. When testing on amd64, use both the native ps and an i386 binary. Thank you very much! Best regards Oliver --- sys/sys/user.h.orig 2011-11-07 22:13:19.0 +0100 +++ sys/sys/user.h2012-01-17 11:33:59.0 +0100 @@ -83,7 +83,7 @@ * it in two places: function fill_kinfo_proc in sys/kern/kern_proc.c and * function kvm_proclist in lib/libkvm/kvm_proc.c . */ -#define KI_NSPARE_INT 9 +#define KI_NSPARE_INT 8 #define KI_NSPARE_LONG 12 #define KI_NSPARE_PTR 6 @@ -186,6 +186,7 @@ */ charki_sparestrings[50];/* spare string space */ int ki_spareints[KI_NSPARE_INT];/* spare room for growth */ + int ki_fibnum; /* Default FIB number */ u_int ki_cr_flags;/* Credential flags */ int ki_jid; /* Process jail ID */ int ki_numthreads; /* XXXKSE number of threads in total */ --- sys/kern/kern_proc.c.orig 2012-01-15 19:47:24.0 +0100 +++ sys/kern/kern_proc.c 2012-01-17 12:52:36.0 +0100 @@ -836,6 +836,7 @@ kp-ki_swtime = (ticks - p-p_swtick) / hz; kp-ki_pid = p-p_pid; kp-ki_nice = p-p_nice; + kp-ki_fibnum = p-p_fibnum; kp-ki_start = p-p_stats-p_start; timevaladd(kp-ki_start, boottime); PROC_SLOCK(p); @@ -1121,6 +1122,7 @@ bcopy(ki-ki_comm, ki32-ki_comm, COMMLEN + 1); bcopy(ki-ki_emul, ki32-ki_emul, KI_EMULNAMELEN + 1); bcopy(ki-ki_loginclass, ki32-ki_loginclass, LOGINCLASSLEN + 1); + CP(*ki, *ki32, ki_fibnum); CP(*ki, *ki32, ki_cr_flags); CP(*ki, *ki32, ki_jid); CP(*ki, *ki32, ki_numthreads); --- sys/compat/freebsd32/freebsd32.h.orig 2011-11-11 08:17:00.0 +0100 +++ sys/compat/freebsd32/freebsd32.h 2012-01-17 11:34:00.0 +0100 @@ -319,6 +319,7 @@ charki_loginclass[LOGINCLASSLEN+1]; charki_sparestrings[50]; int ki_spareints[KI_NSPARE_INT]; + int ki_fibnum; u_int ki_cr_flags; int ki_jid; int ki_numthreads; --- bin/ps/keyword.c.orig 2011-09-29 08:31:42.0 +0200 +++ bin/ps/keyword.c 2012-01-17 12:54:49.0 +0100 @@ -85,6 +85,7 @@ {etimes, ELAPSED, NULL, USER, elapseds, 0, CHAR, NULL, 0}, {euid, , uid, 0, NULL, 0, CHAR, NULL, 0}, {f, F, NULL, 0, kvar, KOFF(ki_flag), INT, x, 0}, + {fib, FIB, NULL, 0, kvar, NULL, 2, KOFF(ki_fibnum), INT, d, 0}, {flags, , f, 0, NULL, 0, CHAR, NULL, 0}, {gid, GID, NULL, 0, kvar, KOFF(ki_groups), UINT, UIDFMT, 0}, {group, GROUP, NULL, LJUST, egroupname, 0, CHAR, NULL, 0}, --- bin/ps/ps.1.orig 2011-11-22 22:53:06.0 +0100 +++ bin/ps/ps.1 2012-01-17 12:56:17.0 +0100 @@ -29,7 +29,7 @@ .\ @(#)ps.1 8.3 (Berkeley) 4/18/94 .\ $FreeBSD: src/bin/ps/ps.1,v 1.112 2011/11/22 21:53:06 trociny Exp $ .\ -.Dd November 22, 2011 +.Dd January 17, 2012 .Dt PS 1 .Os .Sh NAME @@ -506,6 +506,9 @@ minutes:seconds. .It Cm etimes elapsed running time, in decimal integer seconds +.It Cm fib +default FIB number, see +.Xr setfib 1 .It Cm flags the process flags, in hexadecimal (alias .Cm f ) Just reviving the recent thread after the ping. The patch looks fine to me, and is still not committed. pgpXl5s07HuIv.pgp Description: PGP signature
Re: Getting the current thread ID without a syscall?
On Tue, Jan 15, 2013 at 03:54:03PM -0500, Trent Nelson wrote: Howdy, I have an unusual requirement: I need to get the current thread ID in as few instructions as possible. On Windows, I managed to come up with this glorious hack: #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS # include intrin.h # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24)) That exploits the fact that Windows uses the FS/GS registers to store thread/process metadata. Could I use a similar approach on FreeBSD to get the thread ID without the need for syscalls? The layout of the per-thread structure used by libthr is private and is not guaranteed to be stable even on the stable branches. Yes, you could obtain the tid this way, but note explicitely that using it makes your application not binary compatible with any version of the FreeBSD except the one you compiled on. You could read the _thread_off_tid integer variable and use the value as offset from the %fs base to the long containing the unique thread id. But don't use this in anything except the private code. (I technically don't need the thread ID, I just need to get some form of unique identifier for the current thread such that I can compare it to a known global value that's been set to the main thread, in order to determine if I'm currently that thread or not. As long as it's unique for each thread, and static for the lifetime of the thread, that's fine.) The am I the main thread? comparison is made every ~50-100 opcodes, which is why it needs to have the lowest overhead possible. On newer CPUs in amd64 mode, there is getfsbase instruction which reads the %fs register base. System guarantees that %fs base is unique among live threads. pgpsp_lioKIEy.pgp Description: PGP signature
Re: Getting the current thread ID without a syscall?
On Tue, Jan 15, 2013 at 04:35:14PM -0500, Trent Nelson wrote: On Tue, Jan 15, 2013 at 01:16:41PM -0800, Konstantin Belousov wrote: On Tue, Jan 15, 2013 at 03:54:03PM -0500, Trent Nelson wrote: Howdy, I have an unusual requirement: I need to get the current thread ID in as few instructions as possible. On Windows, I managed to come up with this glorious hack: #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS # include intrin.h # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24)) That exploits the fact that Windows uses the FS/GS registers to store thread/process metadata. Could I use a similar approach on FreeBSD to get the thread ID without the need for syscalls? The layout of the per-thread structure used by libthr is private and is not guaranteed to be stable even on the stable branches. Yes, you could obtain the tid this way, but note explicitely that using it makes your application not binary compatible with any version of the FreeBSD except the one you compiled on. Luckily it's for an open source project (Python), so recompilation isn't a big deal. (I also check the intrinsic result versus the syscall result during startup to verify the same ID is returned, falling back to the syscall by default.) For you, may be. For your users, it definitely will be a problem. And worse, the problem will be blamed on the operating system and not to the broken application. You could read the _thread_off_tid integer variable and use the value as offset from the %fs base to the long containing the unique thread id. But don't use this in anything except the private code. Ah, thanks, that's what I was interested in knowing. (I technically don't need the thread ID, I just need to get some form of unique identifier for the current thread such that I can compare it to a known global value that's been set to the main thread, in order to determine if I'm currently that thread or not. As long as it's unique for each thread, and static for the lifetime of the thread, that's fine.) The am I the main thread? comparison is made every ~50-100 opcodes, which is why it needs to have the lowest overhead possible. On newer CPUs in amd64 mode, there is getfsbase instruction which reads the %fs register base. System guarantees that %fs base is unique among live threads. Interesting. I was aware of those instructions, but never assessed them in detail once I'd figured out the readgsdword approach. I definitely didn't realize they return unique values per thread (although it makes sense now that I think about it). Thanks Konstantin, very helpful. Trent. pgpm7aloC53Ee.pgp Description: PGP signature
Re: Getting the current thread ID without a syscall?
On Tue, Jan 15, 2013 at 06:03:30PM -0500, Trent Nelson wrote: On Tue, Jan 15, 2013 at 02:33:41PM -0800, Ian Lepore wrote: On Tue, 2013-01-15 at 14:29 -0800, Alfred Perlstein wrote: On 1/15/13 1:43 PM, Konstantin Belousov wrote: On Tue, Jan 15, 2013 at 04:35:14PM -0500, Trent Nelson wrote: Luckily it's for an open source project (Python), so recompilation isn't a big deal. (I also check the intrinsic result versus the syscall result during startup to verify the same ID is returned, falling back to the syscall by default.) For you, may be. For your users, it definitely will be a problem. And worse, the problem will be blamed on the operating system and not to the broken application. Anything we can do to avoid this would be best. The reason is that we are still dealing with an optimization that perl did, it reached inside of the opaque struct FILE to do nasty things. Now it is very difficult for us to fix struct FILE. We are still paying for this years later. Any way we can make this a supported interface? -Alfred Re-reading the original question, I've got to ask why pthread_self() isn't the right answer? The requirement wasn't I need to know what the OS calls me it was I need a unique ID per thread within a process. The identity check is performed hundreds of times per second. The overhead of (Py_MainThreadId == __readgsdword(0x48) ? A() : B()) is negligible -- I can't say the same for a system/function call. (I'm experimenting with an idea I had to parallelize Python such that it can exploit all cores without impeding the performance of normal single-threaded execution (like previous-GIL-removal attempts and STM). It's very promising so far -- presuming we can get the current thread ID in a couple of instructions. If not, single-threaded performance suffers too much.) If the only thing you ever need is to get a unique handle for the current thread, without the requirement that it corresponds to any other identifier, everything becomes much easier. On amd64, use 'movq %fs:0,%register', on i386 'movl %gs:0,%register'. This instruction is guaranteed to return the thread-unique address of the tcb. See an article about ELF TLS for more details. Even better, this instruction is portable among all ELF Unixes which support TLS. pgpnI6yuTRWOO.pgp Description: PGP signature
Re: malloc+utrace, tracking memory leaks in a running program.
On Thu, Jan 10, 2013 at 10:16:46AM -0500, Alfred Perlstein wrote: On 1/10/13 2:38 AM, Konstantin Belousov wrote: On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote: Here are more convenient links that give diffs against FreeBSD and jemalloc for the proposed changes: FreeBSD: https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2 Why do you need to expedite the records through the ktrace at all ? Wouldn't direct write(2)s to a file allow for better performance due to not stressing kernel memory allocator and single writing thread ? Also, the malloc coupling to the single-system interface would be prevented. I believe that other usermode tracers also behave in the similar way, using writes and not private kernel interface. Also, what /proc issues did you mentioned ? There is sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map and does not require /proc mounted. jemalloc: https://github.com/alfredperlstein/jemalloc/compare/master...utrace2 Konstantin, you are right, it is a strange thing this utrace. I am not sure why it was done this way. You are correct in that much more efficient system could be made using writes gathered into a single write(2). Even without writes gathering, non-coalesced writes should be faster than utrace. Do you think there is any reason they may have re-used the kernel paths for ktrace even at the cost of efficiency? I can only speculate. The utracing of the malloc calls in the context of the ktrace stream is useful for the human reading the trace. Instead of seeing the sequence of unexplanaible calls allocating and freeing memory, you would see something more penetrable. For example, you would see accept/malloc/read/write/free, which could be usefully interpreted as network server serving the client. This context is not needed for a leak detector. About kern.proc.vmmap I will look into that. -Alfred pgpMOJFXTDekU.pgp Description: PGP signature
Re: malloc+utrace, tracking memory leaks in a running program.
On Thu, Jan 10, 2013 at 01:29:38PM -0500, Alfred Perlstein wrote: On 1/10/13 1:05 PM, Konstantin Belousov wrote: On Thu, Jan 10, 2013 at 10:16:46AM -0500, Alfred Perlstein wrote: On 1/10/13 2:38 AM, Konstantin Belousov wrote: On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote: Here are more convenient links that give diffs against FreeBSD and jemalloc for the proposed changes: FreeBSD: https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2 Why do you need to expedite the records through the ktrace at all ? Wouldn't direct write(2)s to a file allow for better performance due to not stressing kernel memory allocator and single writing thread ? Also, the malloc coupling to the single-system interface would be prevented. I believe that other usermode tracers also behave in the similar way, using writes and not private kernel interface. Also, what /proc issues did you mentioned ? There is sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map and does not require /proc mounted. jemalloc: https://github.com/alfredperlstein/jemalloc/compare/master...utrace2 Konstantin, you are right, it is a strange thing this utrace. I am not sure why it was done this way. You are correct in that much more efficient system could be made using writes gathered into a single write(2). Even without writes gathering, non-coalesced writes should be faster than utrace. Do you think there is any reason they may have re-used the kernel paths for ktrace even at the cost of efficiency? I can only speculate. The utracing of the malloc calls in the context of the ktrace stream is useful for the human reading the trace. Instead of seeing the sequence of unexplanaible calls allocating and freeing memory, you would see something more penetrable. For example, you would see accept/malloc/read/write/free, which could be usefully interpreted as network server serving the client. This context is not needed for a leak detector. Now I may be wrong here, but I think it's an artifact of someone noticing how useful fitting this into the ktrace system and leveraging existing code. Even though there are significant performance deficiencies, the actual utility of the existing framework may have been such a stepping stool towards tracing that it was just used. Right now the code already exists, however it logs just {operation, size, ptr}, example: malloc, 512, - 0xdeadbeef free, 0, 0xdeadbeef realloc, 512, 0 - 0xdeadc0de realloc, 1024, 0xdeadc0de - 0x free, 0, 0x What do you think of just adding the address of the caller of malloc/free/realloc to these already existing tracepoints? In most real-world applications I saw, malloc() was not a function called to do the allocation. Usually, there is either an app-specific wrapper, or the language runtime system which calls malloc(), e.g. the new operator for the C++ code. Than, the caller address becomes constant for the whole duration of the program run. What would be useful is the full backtrace of each allocation. The tools like libunwind are indeed optimized for this usage pattern. From this POV, the libc malloc(3) might be better offering a set of the well-defined hooks for a pluggable tracer to utilize. I am on the fence there, you could override the malloc/free without hooks, by the ELF symbol interposing technique, but hooks would also offer features not easily implementable with the interposing. pgp3bpBpdU_be.pgp Description: PGP signature
Re: malloc+utrace, tracking memory leaks in a running program.
On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote: Here are more convenient links that give diffs against FreeBSD and jemalloc for the proposed changes: FreeBSD: https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2 Why do you need to expedite the records through the ktrace at all ? Wouldn't direct write(2)s to a file allow for better performance due to not stressing kernel memory allocator and single writing thread ? Also, the malloc coupling to the single-system interface would be prevented. I believe that other usermode tracers also behave in the similar way, using writes and not private kernel interface. Also, what /proc issues did you mentioned ? There is sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map and does not require /proc mounted. jemalloc: https://github.com/alfredperlstein/jemalloc/compare/master...utrace2 -Alfred ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org pgp3u3h6Pt05o.pgp Description: PGP signature
Re: cpufreq not working as module on i386/amd64
On Tue, Jan 08, 2013 at 01:15:59PM +0800, Jia-Shiun Li wrote: Hi all, move to -hackers as it seems a better place to discuss. Turns out, at acpi_cpu_attach(), the bus probe will query ACPI_GET_FEATURES() to get cpu_features flags from each cpufreq drivers. Since it is only executed once at booting, subsequent module loading after booting will not be called get_features() and will not be able to express interests in terms of ACPI_CAP_* flags. And if these flags were not set first, acpi_perf won't get attached. Later when loading cpufreq.ko, est_attach(), etc. will not be able to find acpi_perf and thus failed to attach. After referred to Intel doc. #302223 mentioned in dev/acpica/acpivar.h, I got confused by the meaning of these bits. According to the document, capability bits probably should be set as long as OSPM is *capable* of using them, no matter when they will be used. I am not familiar with ACPI, but I suppose it will expose states/methods to OSPM according to these bits? To confirm this, I add a line to simply OR'd (ACPI_CAP_PERF_MSRS | ACPI_CAP_C1_IO_HALT) to sc-cpu_features after the loop calling ACPI_GET_FEATURES(). And then I am able to load cpufreq.ko after boot. powerd also works fine with it. If this is true, then probably we don't need get_features() for acpi interface/drivers. It is sufficient to just turn on proper bits directly for acpi_cpu when related code is added to repository. Any comments? -8-8-8- # svn diff Index: sys/dev/acpica/acpi_cpu.c === --- sys/dev/acpica/acpi_cpu.c (revision 245148) +++ sys/dev/acpica/acpi_cpu.c (working copy) @@ -350,6 +350,7 @@ sc-cpu_features |= features; } free(drivers, M_TEMP); + sc-cpu_features |= ACPI_CAP_PERF_MSRS | ACPI_CAP_C1_IO_HALT; } I had almost word-to-word identical conversation with Andrey Gapon, might be a year ago. I Cc:ed him. /* -8-8-8- Jia-Shiun On Wed, Dec 26, 2012 at 2:10 AM, Jia-Shiun Li jiash...@gmail.com wrote: I was cleaning up hard drive and found these old logs. Anyway I added some printf() and saw the process failed at device_find_child(..., acpi_perf, ...) of est_acpi_info() i.e. it cannot find acpi_perf device. devinfo did confirmed the absence of acpi_perf. Comparing the dmesgs revealed the main difference: IST (i-state?) OEM tables in SSDT seems not loaded if cpufreq was not compiled into kernel, as it shows below. Before I diving into the ACPI part, can anyone familiar with how ACPI works shed me some light? -8-8-8- % diff -bu cpufreq-nb.log cpufreq-no.log ... @@ -158,17 +158,11 @@ acpi0: Power Button (fixed) cpu0: Processor \\_PR_.CPU0 (ACPI ID 1) - APIC ID 0 cpu0: ACPI CPU on acpi0 -ACPI: SSDT 0xbfd8dc98 00223 (v01 PmRef Cpu0Ist 3000 INTL 20051117) -ACPI: Dynamic OEM Table Load: -ACPI: SSDT 0 00223 (v01 PmRef Cpu0Ist 3000 INTL 20051117) ACPI: SSDT 0xbfd8b598 00537 (v01 PmRef Cpu0Cst 3001 INTL 20051117) ACPI: Dynamic OEM Table Load: ACPI: SSDT 0 00537 (v01 PmRef Cpu0Cst 3001 INTL 20051117) cpu1: Processor \\_PR_.CPU1 (ACPI ID 2) - APIC ID 1 cpu1: ACPI CPU on acpi0 -ACPI: SSDT 0xbfd8ce18 001CF (v01 PmRefApIst 3000 INTL 20051117) -ACPI: Dynamic OEM Table Load: -ACPI: SSDT 0 001CF (v01 PmRefApIst 3000 INTL 20051117) ACPI: SSDT 0xbfd8df18 0008D (v01 PmRefApCst 3000 INTL 20051117) ACPI: Dynamic OEM Table Load: ACPI: SSDT 0 0008D (v01 PmRefApCst 3000 INTL 20051117 On Sat, Jan 29, 2011 at 4:41 PM, Alexander Best arun...@freebsd.org wrote: On Sat Jan 29 11, Jia-Shiun Li wrote: Hi all, I found that cpufreq driver failed to attach when compiled as module and loaded, but it works fine when compiled into kernel. I am wondering if this is due to some kind of limitation, or can be fixed? that's rather odd. for me neither the module nor the kernel code works, since my cpu isn't supported by sys/x86/cpufreq/est.c. actually only pentium mobile cpus seem to be supported. maybe you can add some printf's to est.c:est_get_info() to identify at which point error gets set. also you might want to make est: cpu_vendor %s, msr %0jx\n, cpu_vendor, msr); non-conditional. maybe the output differy in kernel/module mode. cheers. alex Tested on a Pentium E5200 desktop (i386) and a Pentium T4200 laptop (amd64). Both got the same result. dmesg of T4200 attached. kldload module: -8-8-8- est0: Enhanced SpeedStep Frequency Control on cpu0 est: CPU supports Enhanced Speedstep, but is not recognized. est: cpu_vendor GenuineIntel, msr 6194c1a06004c1a device_attach: est0 attach returned 6 est1: Enhanced SpeedStep Frequency Control on cpu1 est: CPU supports Enhanced Speedstep, but is not recognized. est:
Re: About QUOTA support in stock kernel (resent)
On Tue, Dec 25, 2012 at 09:34:30PM +0800, Patrick Dung wrote: Hi, I would like to know why quota is not enabled in the stock kernel.. I remembered that it is not enabled since freebsd 3.5 or freebsd 4 generation. Now in freebsd 9.0, it still neeed a kernel rebuild. I have heard it has performance issue (GIANT lock) about quota. Enabling quota by default would cause small overhead, like one mutex acquire, for each inode and block alloc/dealloc, even for mount without quotas enabled. Might be, it is reasonable to just enable it now. Unless somebody provide valid objections and I do not forget, I will do it in a week for HEAD. pgp9ipGxCUQ5A.pgp Description: PGP signature
Re: About QUOTA support in stock kernel (resent)
On Tue, Dec 25, 2012 at 10:23:26AM -0500, Eitan Adler wrote: On 25 December 2012 10:07, Konstantin Belousov kostik...@gmail.com wrote: Enabling quota by default would cause small overhead, like one mutex acquire, for each inode and block alloc/dealloc, even for mount without quotas enabled. Why is this, and can it be avoided (for mounts without quotas)? Because system should check whether quota is enabled to do the accounting. Might be, it is reasonable to just enable it now. Unless somebody provide valid objections and I do not forget, I will do it in a week for HEAD. -- Eitan Adler pgpGnpuB7k1Hj.pgp Description: PGP signature
Re: Fix overlinking in base aka import pkgconf
On Sun, Dec 16, 2012 at 02:01:00PM +0100, Jeremie Le Hen wrote: Hi bapt, kib, On Sat, Dec 15, 2012 at 11:56:43AM +0100, Baptiste Daroussin wrote: On Sat, Dec 15, 2012 at 03:22:34AM +0200, Konstantin Belousov wrote: On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote: Hi, Some of our binary are overlinked, the way we handle the linking doesn't help for that. What do you mean there ? Do you mean that some libraries specified for the linking stage of the final binary are not needed for the execution ? I mean some library are registrer in the binary with DT_NEEDED tag where they don't need to. On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf which is BSD license pkg-config implementation 100% compatible with pkg-config. What I propose is to create a new PCADD variable for the Makefiles. PCADD will invoke pkgconf to gather the libraries and the cflags for a given project. The second thing would be to create .pc files for all of our libraries. for example: usr.bin/fstat dynamic build is overlinked And how this is better than just removing the unneeded library from the Makefile ? In that case to statically build you need -lkvm -lutil -lprocstat but if you build dynamically you will only need -lproctast meaning you don't need to be directly linked to libutil and libkvm. This means a breakage of libutil will only have inpact on procstat and no more on fstat for example. I think this could be solved by an implicit linker script contained in .so and .a files, pointing to the real libraries. We have already the SHLIB_LDSCRIPT variable to accomplish this for .so files. It may be possible to do the same for .a files, though this would require renaming the real archives to something like .a.0 (here I'm just taking a similar scheme to the one used for DSO). As a matter of fact, libprocstat.a would contain: GROUP ( /usr/lib/libprocstat.a.0 /usr/lib/libkvm.a /usr/lib/libutil.a ) Note that libkvm.a and libutil.a could be linker scripts as well. Kib, do you see any problem to this proposition? Wouldn't you need to completely rewrite the handling of the .a files in the share/mk ? I somewhat dislike the mere thought that .a is not an archive any longer. Does it make sense from the overhead and complexity POV, for such small goal ? pgpxuDm949pH3.pgp Description: PGP signature
Re: Fix overlinking in base aka import pkgconf
On Sat, Dec 15, 2012 at 08:58:14AM +0100, Bernhard Fr?hlich wrote: On Sa., 15. Dez. 2012 02:22:34 CET, Konstantin Belousov kostik...@gmail.com wrote: On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote: Hi, Some of our binary are overlinked, the way we handle the linking doesn't help for that. What do you mean there ? Do you mean that some libraries specified for the linking stage of the final binary are not needed for the execution ? On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf which is BSD license pkg-config implementation 100% compatible with pkg-config. What I propose is to create a new PCADD variable for the Makefiles. PCADD will invoke pkgconf to gather the libraries and the cflags for a given project. The second thing would be to create .pc files for all of our libraries. for example: usr.bin/fstat dynamic build is overlinked And how this is better than just removing the unneeded library from the Makefile ? For the port consumption, I believe that the better solution is to provide a pack of the .pc files describing base libraries, most likely as port. We should definitely generate some pc files for our base libraries. We already have quite a few ports that need to hack around because of missing pc files for ssl for example. This is fine. But making the base build depend on the pkgconf does not give us anything except complications and overhead, in my opinion. pgptYtmqz9fXe.pgp Description: PGP signature
Re: postgres, initdb, FreeBSD bug?
On Fri, Dec 14, 2012 at 10:43:00PM -0600, David Noel wrote: I've been fighting with a bug I can't quite seem to figure out and was told that this might be the place to come. I'm running postgresql-9.2.2 on FreeBSD 8.3-RELEASE-p5 and am having things break down when I try to run initdb. I got in contact with the pgsql-general mailing list and we debugged the issue to the point where it seemed that this might be a FreeBSD-related error. Relevant excerpts from several email are below that piece together the error: I'm running into the following error message when running initdb (FreeBSD host): ygg# /usr/local/etc/rc.d/postgresql initdb -D /zdb/pgsql/data --debug The files belonging to this database system will be owned by user pgsql. This user must also own the server process. The database cluster will be initialized with locales COLLATE: C CTYPE:en_US.UTF-8 MESSAGES: en_US.UTF-8 MONETARY: en_US.UTF-8 NUMERIC: en_US.UTF-8 TIME: en_US.UTF-8 The default text search configuration will be set to english. creating directory /zdb/pgsql/data ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 32MB creating configuration files ... ok creating template1 database in /zdb/pgsql/data/base/1 ... FATAL: could not open file pg_xlog/00010001 (log file 0, segment 1): No such file or directory child process exited with exit code 1 initdb: removing data directory /zdb/pgsql/data ... Interestingly, I have a second--virtually identical--server that I just tried initdb on. FreeBSD 8.3-RELEASE-p5, postgresql-server-9.2.2. Exact same FATAL: could not open file pg_xlog error. So it is reproducible. ... The relevant part of the ktrace output is 71502 postgres CALL unlink(0x7fffc130) 71502 postgres NAMI pg_xlog/xlogtemp.71502 71502 postgres RET unlink -1 errno 2 No such file or directory 71502 postgres CALL open(0x7fffc130,O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR) 71502 postgres NAMI pg_xlog/xlogtemp.71502 71502 postgres RET open 3 71502 postgres CALL write(0x3,0x801a56030,0x2000) 71502 postgres GIO fd 3 wrote 4096 bytes a lot of uninteresting write() calls snipped ... 71502 postgres RET write 8192/0x2000 71502 postgres CALL close(0x3) 71502 postgres RET close 0 71502 postgres CALL unlink(0x7fffbc60) 71502 postgres NAMI pg_xlog/00010001 71502 postgres RET unlink -1 errno 2 No such file or directory 71502 postgres CALL link(0x7fffc130,0x7fffbc60) 71502 postgres NAMI pg_xlog/xlogtemp.71502 71502 postgres NAMI pg_xlog/00010001 71502 postgres RET link -1 errno 1 Operation not permitted 71502 postgres CALL unlink(0x7fffc130) 71502 postgres NAMI pg_xlog/xlogtemp.71502 71502 postgres RET unlink 0 71502 postgres CALL open(0x7fffc530,O_RDWR,unused0x180) 71502 postgres NAMI pg_xlog/00010001 71502 postgres RET open -1 errno 2 No such file or directory This corresponds to the execution of XLogFileInit(), and what's evidently happening is that we successfully create and zero-fill the first xlog segment file under a temporary name, but then the attempt to rename it into place with link() fails with EPERM. This is really a WTF kind of failure, I think. The directory is certainly writable --- it was made under our own UID, and what's more we just managed to create the file there under its temp name. So how can we get an EPERM failure from link()? I think this is a kernel bug. regards, tom lane PS: one odd thing here is that the ereport(LOG) in InstallXLogFileSegment isn't doing anything; otherwise we'd have gotten a much more helpful error report about could not link file. I don't think we run the bootstrap mode with log_min_messages set high enough to disable LOG messages, so why isn't it printing? Nonetheless, this error shouldn't have occurred. ... Where to from here? The freebsd-datab...@freebsd.org mailing list? The postgresql port maintainer? Who should I be in touch with? ... You need to talk to some FreeBSD kernel hackers about why link() might be failing here. Since you see it on UFS too, we can probably exonerate the ZFS filesystem-specific code. I did some googling and found that EPERM can be issued if the filesystem doesn't support hard links (which shouldn't apply to ZFS I trust). Also, Linux has a protected_hardlinks option that causes certain attempts at creating hard links to fail --- but our use-case here doesn't fall foul of any of those restrictions AFAICS, and of course FreeBSD isn't Linux. Still, I wonder if you're running into some misdesigned or misimplemented
Re: Fix overlinking in base aka import pkgconf
On Sat, Dec 15, 2012 at 11:56:43AM +0100, Baptiste Daroussin wrote: On Sat, Dec 15, 2012 at 03:22:34AM +0200, Konstantin Belousov wrote: On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote: Hi, Some of our binary are overlinked, the way we handle the linking doesn't help for that. What do you mean there ? Do you mean that some libraries specified for the linking stage of the final binary are not needed for the execution ? I mean some library are registrer in the binary with DT_NEEDED tag where they don't need to. On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf which is BSD license pkg-config implementation 100% compatible with pkg-config. What I propose is to create a new PCADD variable for the Makefiles. PCADD will invoke pkgconf to gather the libraries and the cflags for a given project. The second thing would be to create .pc files for all of our libraries. for example: usr.bin/fstat dynamic build is overlinked And how this is better than just removing the unneeded library from the Makefile ? In that case to statically build you need -lkvm -lutil -lprocstat but if you build dynamically you will only need -lproctast meaning you don't need to be directly linked to libutil and libkvm. This means a breakage of libutil will only have inpact on procstat and no more on fstat for example. For the port consumption, I believe that the better solution is to provide a pack of the .pc files describing base libraries, most likely as port. Yeah the port is another thing which yes can probably be done that way. Using .pc for the base system build is overkill, it does not add anything that cannot be accomplished by our existing build system. IMO. Probably. The thing is with pkgconf, fstat does not need to know that procstat needs libkvm and libutil for static link, it just has to know it needs procstat and pkgconf does all the magic. and pkgconf is really small (only 48k on an amd64 box) Other solution would be to reinvent the same thing using our framework? Maybe a LDSADD (LD STATIC ADD) to differenciate both? First, there is a linker flag, --as-needed, which does exactly what you want: it only registers the dso as a dependency using DT_NEEDED dynamic tag when there are real references from the linked object to the symbols in the dso. On the other hand, I am not a big fun of this feature. Dynamic linker is permissive for the main binary to reference a symbols in implicit dependencies of the explicit dependencies. If sudden change causes the main binary to start using a symbol from 'second order' dsos, you cause unneccessary complications there. More, recently there was a trend in the Linux world to disable the static linker from resolving symbols from the second order dsos. See the --no-add-needed and its recent switch in the newer ld. Overall, I like the fact that we explicitely list all depended base libraries in the link command, for the base system. pgphsm2ISdNX9.pgp Description: PGP signature
Re: postgres, initdb, FreeBSD bug?
On Sat, Dec 15, 2012 at 09:02:34AM -0600, David Noel wrote: Ahh... the security.bsd. sysctl output had the answer: I had security.bsd.hardlink_check_gid and security.bsd.hardlink_check_uid set to 1 in sysctl.conf. Removing that fixed the problem. Then, why did you set it ? The sysctl indeed has an interesting interaction with the fact that the created inode inherits gid from the parent directory. If you use it, you must ensure that the directory group is in the group set of the process. Many thanks, -David Show the ktrace from the same error on UFS. 5179 postgres CALL unlink(0x7fffe570) 5179 postgres NAMI pg_notify/ 5179 postgres RET unlink 0 5179 postgres CALL getdirentries(0x3,0x801a4b000,0x1000,0x801a4a068) 5179 postgres RET getdirentries 0 5179 postgres CALL lseek(0x3,0,SEEK_SET) 5179 postgres RET lseek 0 5179 postgres CALL close(0x3) 5179 postgres RET close 0 5179 postgres CALL open(0x7fffe580,O_RDWR|O_CREAT,S_IRUSR|S_IWUSR) 5179 postgres NAMI pg_notify/ 5179 postgres RET open 3 5179 postgres CALL lseek(0x3,0,SEEK_SET) 5179 postgres RET lseek 0 5179 postgres CALL write(0x3,0x8041c1b40,0x2000) 5179 postgres GIO fd 3 wrote 4096 bytes ... 5179 postgres RET write 8192/0x2000 5179 postgres CALL close(0x3) 5179 postgres RET close 0 5179 postgres CALL unlink(0x7fffbc60) 5179 postgres NAMI pg_xlog/00010001 5179 postgres RET unlink -1 errno 2 No such file or directory 5179 postgres CALL link(0x7fffc130,0x7fffbc60) 5179 postgres NAMI pg_xlog/xlogtemp.5179 5179 postgres NAMI pg_xlog/00010001 5179 postgres RET link -1 errno 1 Operation not permitted 5179 postgres CALL unlink(0x7fffc130) 5179 postgres NAMI pg_xlog/xlogtemp.5179 5179 postgres RET unlink 0 5179 postgres CALL open(0x7fffc530,O_RDWR,unused0x180) 5179 postgres NAMI pg_xlog/00010001 5179 postgres RET open -1 errno 2 No such file or directory Show the security.bsd sysctl settings, in particular, harlink_check_{u,g}id.ygg# sysctl security.bsd. security.bsd.map_at_zero: 0 security.bsd.suser_enabled: 1 security.bsd.unprivileged_proc_debug: 0 security.bsd.conservative_signals: 1 security.bsd.see_other_gids: 0 security.bsd.see_other_uids: 0 security.bsd.unprivileged_idprio: 0 security.bsd.unprivileged_read_msgbuf: 0 security.bsd.hardlink_check_gid: 1 security.bsd.hardlink_check_uid: 1 security.bsd.unprivileged_get_quota: 0 security.bsd.stack_guard_page: 0 Show the ls -la output for the pg_xlog directory. ygg:~ ls -la /zdb/pgsql/data/pg_xlog/ total 5 drwx-- 3 pgsql wheel 3 Dec 15 08:39 . drwx-- 14 pgsql wheel 18 Dec 15 08:39 .. drwx-- 2 pgsql wheel 2 Dec 15 08:39 archive_status pgp1QMVmZ4ACT.pgp Description: PGP signature
Re: Fix overlinking in base aka import pkgconf
On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote: Hi, Some of our binary are overlinked, the way we handle the linking doesn't help for that. What do you mean there ? Do you mean that some libraries specified for the linking stage of the final binary are not needed for the execution ? On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf which is BSD license pkg-config implementation 100% compatible with pkg-config. What I propose is to create a new PCADD variable for the Makefiles. PCADD will invoke pkgconf to gather the libraries and the cflags for a given project. The second thing would be to create .pc files for all of our libraries. for example: usr.bin/fstat dynamic build is overlinked And how this is better than just removing the unneeded library from the Makefile ? For the port consumption, I believe that the better solution is to provide a pack of the .pc files describing base libraries, most likely as port. Using .pc for the base system build is overkill, it does not add anything that cannot be accomplished by our existing build system. IMO. pgpTUPcXFQuN1.pgp Description: PGP signature
Re: [PATCH] Shared library debug .symbols files
On Thu, Dec 13, 2012 at 04:08:47PM +, Ed Maste wrote: I've been working generating userland debugging symbols, with the goal that we'll build them for each release. The user could install them along with the system, or later on when needed for debugging. The symbols files will also be useful for profiling and tools such as Valgrind, without needing to build and install world first. This patch enables .symbols files for shared libraries when DEBUG_FLAGS is set. Future changes will be needed to address static libraries and base system binaries, and the release build bits. You cannot strip static libraries, at least if you want the result to be useful for the consequent use. pgp2AYosISFKr.pgp Description: PGP signature
Re: ELF symtab and ddbsymtab difference
On Wed, Dec 05, 2012 at 12:13:24PM +0530, Shrikanth Kamath wrote: This is regarding the fields in the structure elf_file_t in link_elf.c. For some kernel modules the symtab field is different from the ddbsymtab field for some it is the same, would like to know what is the difference between the two and how to enable ddbsymtab? Assuming we are talking about the link_elf.c and not about link_elf_obj.c. The symtab and ddbsymtab are first initialized from the dynamic symbol table in the module, and later, in the process of loading the module, if the non-dynamic symbol table is present, ddbsymtab is rewritten to point to the table. Does enabling -g in CFLAGS make the binary build the ddbsymtab different from symtab? No. It is strip done on the module which could result in the removal of the non-dynamic symtab. The problem is lookup for some symbols in the kernel module that I built returns with undefined, on inspecting it was getting a ENOENT from the function link_elf_lookup_symbol() { ... /* If we have not found it, look at the full table (if loaded) */ if (ef-symtab == ef-ddbsymtab) return (ENOENT); ... } It is not the problem. It just means that you dynamic symbol table does not contain the needed symbol, which is the problem. As a coincident, you also stripped your module, making the problem to be exposed. I guess that you should look at the EXPORT_SYMS feature of the module Makefiles. But I also remember there were some bugs. pgpYNFsprvOlA.pgp Description: PGP signature
Re: naming a .h file for kernel use only
On Tue, Dec 04, 2012 at 03:28:37PM -0500, Rick Macklem wrote: Hi, For my NFSv4.1 client work, I've taken a few definitions out of a kernel rpc .c file and put them in a .h file so that they can be included in other sys/rpc .c files. I've currently named the file _krpc.h. I thought I'd check if this is a reasonable name before doing the big commit of the NFSv4.1 stuff to head. (I have a vague notion that a leading _ would indicate not for public use, but I am not sure?) Thanks in advance for naming suggestions for this file, rick I believe that _thing.h is the convention for the headers that are never directly included by the .c files. It is there to hide some shared parts of other includes, for consumption by includes. Why not move the stuff to sys/nfs/krpc.h ? BTW, this file lacks the include guards. pgpN0KluGObMT.pgp Description: PGP signature
Re: Two Intel E31220L 9.0-Stable systems, 'kern.random' missing on one?
On Mon, Dec 03, 2012 at 03:11:21PM +, Karl Pielorz wrote: Hi, I have two SuperMicro E31220L based systems - both had identical /etc/sysctl.conf - I then shifted them from 9.0-R to 9.0-Stable (as of 2012/12/03). Now I've noticed of them complains at boot time that a bunch of OID's are missing - and sure enough: sysctl kern.random sysctl: unknown oid 'kern.random' But the other system returns these fine. One system ID's as 'Intel E31220L @ 2.20Ghz' - the other ID's as 'Intel E31220L V2 @ 2.30Ghz' the V2 system is apparently 'missing' the kern.random stuff. The only reason I can think for the OID's not being present is if one system is using hardware RNG? - Though 'man 4 random' states: The only hardware implementation currently is for the VIA C3 Nehemiah (stepping 3 or greater) CPU. More will be added in the future. Is there any other reason why they would have 'disappeared' on the non V2 system? (in fact, looking at the Feature2 line I can see 'RDRAND' on the V2 system, hmm so I'm guessing that's it?!) You noted that RDRAND is supported by your hardware. And indeed, FreeBSD got the RDRAND (aka Bull Mountain) hardware random number generator support recently. From my reading of the sys/dev/random code, the kern.random OID is only instantiated if the software implementation is selected. If you prefer software, you can disable RDRAND with hw.ivy_rng_enable=0 tunable. pgpiF9CoXmpgC.pgp Description: PGP signature
Re: FreeBSD needs Git to ensure repo integrity [was: 2012 incident]
On Sat, Nov 17, 2012 at 11:05:40PM -0800, Perry Hutchison wrote: [trimmed some of the lists] Chris Rees utis...@gmail.com wrote: ... git doesn't work with our workflow. I'm sure the workflow itself is documented somewhere, but is there a good writeup of _how_ git doesn't work with it, e.g. what capabilit{y,ies} is/are missing? Seems this might be of interest to the git developers, not because they necessarily want to support FreeBSD as such, but as an example of a real-world workflow that git currently does not handle well. Git would work well with our workflow. It supports the centralized repository model, which the project employs right now. The biggest issues, assuming the project indeed decides to move to Git right now, are the migration costs, both in the term of the technical efforts needed, and the learning curve for the most population of the committers. Relatively minor problem, at least with the current rate of the commits, would be a commit race, when the shared repo head forwarded due to the parallel commit. The issue should be somewhat mitigated by the Git allowance to push a set of changes in one push. pgpOinOOg50ux.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote: On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also expressed in KASSERT, like this: diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index d9e4692..f8a4f70 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags) { int pflags; + KASSERT((malloc_flags M_USE_RESERVE) == 0 || + (malloc_flags M_NOWAIT) != 0, + (M_USE_RESERVE requires M_NOWAIT)); pflags = (malloc_flags M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT : VM_ALLOC_SYSTEM; if ((malloc_flags M_ZERO) != 0) I understand that this could be added to places of the allocator's entries, but I think that the page allocations are fine too. pgptBhkylD1fK.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. pgpAl2UTJQyEa.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 11:35:42AM -0600, Alan Cox wrote: Agreed. Most recently I eliminated several uses from the arm pmap implementations. There is, however, one other use: ofed/include/linux/gfp.h:#defineGFP_ATOMIC (M_NOWAIT | M_USE_RESERVE) Yes, I forgot to mention this. I have no idea about semantic of GFP_ATOMIC compat flag. Below is the updated patch with two your notes applied. diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c index dc9c307..ab1e869 100644 --- a/sys/amd64/amd64/uma_machdep.c +++ b/sys/amd64/amd64/uma_machdep.c @@ -29,6 +29,7 @@ __FBSDID($FreeBSD$); #include sys/param.h #include sys/lock.h +#include sys/malloc.h #include sys/mutex.h #include sys/systm.h #include vm/vm.h @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) int pflags; *flags = UMA_SLAB_PRIV; - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags); if (m == NULL) { diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c index f60cdb1..75366e3 100644 --- a/sys/arm/arm/vm_machdep.c +++ b/sys/arm/arm/vm_machdep.c @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); return (ret); } - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); if (m == NULL) { diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c index 71caa29..2ce1ca6 100644 --- a/sys/fs/devfs/devfs_devs.c +++ b/sys/fs/devfs/devfs_devs.c @@ -121,7 +121,7 @@ devfs_alloc(int flags) struct cdev *cdev; struct timespec ts; - cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO | + cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO | ((flags MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); if (cdp == NULL) return (NULL); diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c index 37353ff..9f77762 100644 --- a/sys/ia64/ia64/uma_machdep.c +++ b/sys/ia64/ia64/uma_machdep.c @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) int pflags; *flags = UMA_SLAB_PRIV; - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c index 798e632..24baef0 100644 --- a/sys/mips/mips/uma_machdep.c +++ b/sys/mips/mips/uma_machdep.c @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) void *va; *flags = UMA_SLAB_PRIV; - - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT; - else - pflags = VM_ALLOC_SYSTEM; + pflags = m2vm_flags(wait, 0); for (;;) { m = pmap_alloc_direct_page(0, pflags); diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c index a491680..3e320b9 100644 --- a/sys/powerpc/aim/mmu_oea64.c +++ b/sys/powerpc/aim/mmu_oea64.c @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) *flags = UMA_SLAB_PRIV; needed_lock = !PMAP_LOCKED(kernel_pmap); -if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) -pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; -else -pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; -if (wait M_ZERO) -pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c index 162c7fb..3882bfa 100644 --- a/sys/powerpc/aim/slb.c +++ b/sys/powerpc/aim/slb.c @@ -483,12 +483,7 @@
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. pgpHI7rQOhvFP.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. pgpXUAix5bcxa.pgp Description: PGP signature
Re: [patch] rtld: fix fd leak with parallel dlopen and fork/exec
On Sun, Nov 04, 2012 at 03:37:27PM +0100, Jilles Tjoelker wrote: Rtld does not set FD_CLOEXEC on its internal file descriptors; therefore, such a file descriptor may be passed to a process created by another thread running in parallel to dlopen() or fdlopen(). The race is easy to trigger with the below dlopen_exec_race.c that performs the two in parallel repeatedly, for example ./dlopen_exec_race /lib/libedit.so.7 | grep lib No other threads are expected to be running during parsing of the hints and libmap files but the file descriptors need not be passed to child processes so I added O_CLOEXEC there as well. The O_CLOEXEC flag is ignored by older kernels so it will not cause breakage when people try the unsupported action of running new rtld on old kernel. However, the fcntl(F_DUPFD_CLOEXEC) will fail with [EINVAL] on an old kernel so this patch will break fdlopen() with old kernels. I suppose this is OK because fdlopen() is not used in the vital parts of buildworld and the like. The patch should be fine. We definitely do not support running newer binaries on older kernels, so the worries about the fdlopen(3) are not sustained. Please commit. Index: libexec/rtld-elf/libmap.c === --- libexec/rtld-elf/libmap.c (revision 240561) +++ libexec/rtld-elf/libmap.c (working copy) @@ -121,7 +121,7 @@ } } - fd = open(rpath, O_RDONLY); + fd = open(rpath, O_RDONLY | O_CLOEXEC); if (fd == -1) { dbg(lm_parse_file: open(\%s\) failed, %s, rpath, rtld_strerror(errno)); Index: libexec/rtld-elf/rtld.c === --- libexec/rtld-elf/rtld.c (revision 240561) +++ libexec/rtld-elf/rtld.c (working copy) @@ -1598,7 +1598,7 @@ /* Keep from trying again in case the hints file is bad. */ hints = ; - if ((fd = open(ld_elf_hints_path, O_RDONLY)) == -1) + if ((fd = open(ld_elf_hints_path, O_RDONLY | O_CLOEXEC)) == -1) return (NULL); if (read(fd, hdr, sizeof hdr) != sizeof hdr || hdr.magic != ELFHINTS_MAGIC || @@ -2046,13 +2046,13 @@ */ fd = -1; if (fd_u == -1) { - if ((fd = open(path, O_RDONLY)) == -1) { + if ((fd = open(path, O_RDONLY | O_CLOEXEC)) == -1) { _rtld_error(Cannot open \%s\, path); free(path); return (NULL); } } else { - fd = dup(fd_u); + fd = fcntl(fd_u, F_DUPFD_CLOEXEC, 0); if (fd == -1) { _rtld_error(Cannot dup fd); free(path); dlopen_exec_race.c: #include sys/types.h #include sys/wait.h #include dlfcn.h #include err.h #include errno.h #include pthread.h #include spawn.h #include stdio.h extern char **environ; static void * dlopen_thread(void *arg) { const char *path = arg; void *handle; for (;;) { handle = dlopen(path, RTLD_LOCAL | RTLD_NOW); if (handle == NULL) continue; dlclose(handle); } return NULL; } static void filestat_loop(void) { int error, status; pid_t pid, wpid; for (;;) { error = posix_spawnp(pid, sh, NULL, NULL, (char *[]){ sh, -c, procstat -f $$, NULL }, environ); if (error != 0) errc(1, error, posix_spawnp); do wpid = waitpid(pid, status, 0); while (wpid == -1 errno == EINTR); if (wpid == -1) err(1, waitpid); if (status != 0) errx(1, sh -c failed); } } int main(int argc, char *argv[]) { pthread_t td; int error; if (argc != 2) { fprintf(stderr, Usage: %s dso\n, argv[0]); return 1; } error = pthread_create(td, NULL, dlopen_thread, argv[1]); if (error != 0) errc(1, error, pthread_create); filestat_loop(); return 0; } fdlopen_exec_race.c: #include sys/types.h #include sys/wait.h #include dlfcn.h #include err.h #include errno.h #include fcntl.h #include pthread.h #include spawn.h #include stdio.h #include unistd.h extern char **environ; static void * dlopen_thread(void *arg) { const char *path = arg; int fd; void *handle; for (;;) { fd = open(path, O_RDONLY | O_CLOEXEC); if (fd == -1) err(1, open %s, path); handle = fdlopen(fd, RTLD_LOCAL | RTLD_NOW); close(fd); if (handle == NULL) continue; dlclose(handle); } return NULL; } static void
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Sat, Nov 03, 2012 at 01:11:17PM -0500, Alan Cox wrote: On Wed, Oct 31, 2012 at 2:06 PM, Konstantin Belousov kostik...@gmail.comwrote: On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote: On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org wrote: I think there are some things we should be investigating about the growth of memory usage. I just noticed this: Freebsd 6.2 on an arm processor: 369 root 1 8 -88 1752K 748K nanslp 3:00 0.00% watchdogd Freebsd 10.0 on the same system: 367 root 1 -52 r0 10232K 10160K nanslp 10:04 0.00% watchdogd The 10.0 system is built with MALLOC_PRODUCTION (without that defined the system won't even boot, it only has 64MB of ram). That's a crazy amount of growth for a relatively simple daemon. Would you please, _please_ do some digging into this? It's quite possible there's something in the libraries that are allocating some memory upon first call invocation - yes, that's jemalloc, but it could also be other things like stdio. We really, really need to fix this userland bloat; it's terribly ridiculous at this point. There's no reason a watchdog daemon should take 10megabytes of RAM. Watchdogd was recently changed to mlock its memory. This is the cause of the RSS increase. Is it also statically linked? No. I do not think that it is reasonable to statically link watchdogd. It might result in some memory saving, but I dislike the whole idea of static linkage on Tier 1 platforms. pgpNDz1ZE5tte.pgp Description: PGP signature
Re: watchdogd, jemalloc, and mlockall
On Sat, Nov 03, 2012 at 12:38:39PM -0600, Ian Lepore wrote: In an attempt to un-hijack the thread about memory usage increase between 6.4 and 9.x, I'm starting a new thread here related to my recent discovery that watchdogd uses a lot more memory since it began using mlockall(2). I tried statically linking watchdogd and it made a small difference in RSS, presumably because it doesn't wire down all of libc and libm. VSZ RSS 10236 10164 Dynamic 8624 8636 Static Those numbers are from ps -u on an arm platform. I just updated the PR (bin/173332) with some procstat -v output comparing with/without mlockall(). It appears that the bulk of the new RSS bloat comes from jemalloc allocating vmspace in 8MB chunks. With mlockall(MCL_FUTURE) in effect that leads to wiring 8MB to satisfy what probably amounts to a few hundred bytes of malloc'd memory. It would probably also be a good idea to remove the floating point from watchdogd to avoid wiring all of libm. The floating point is used just to turn the timeout-in-seconds into a power-of-two-nanoseconds value. There's probably a reasonably efficient way to do that without calling log(), considering that it only happens once at program startup. No, I propose to add a switch to turn on/off the mlockall() call. I have no opinion on the default value of the suggested switch. pgp7Jg9fdTcDk.pgp Description: PGP signature
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Wed, Oct 31, 2012 at 09:49:21AM +, Karl Pielorz wrote: --On 30 October 2012 19:51 +0200 Konstantin Belousov kostik...@gmail.com wrote: I suggest to take a look at where the actual memory goes. Start with procstat -v. Ok, running that for the milter PID I get seem to be able to see smallish chunks used for things like 'libmilter.so', and 'libthr.so' / 'libc.so' - e.g. Since you neglected to provide the verbatim output of procstat, nothing conclusive can be said. Obviously, you can make an investigation on your own. 2010 0x40 0x413000 r-x 190 1 0 CN-- vn /usr/local/sbin/milter 2010 0x613000 0x80 rw- 150 1 0 df 20100x8006130000x80062b000 r-x 240 97 0 CN-- vn /libexec/ld-elf.so.1 20100x80062b0000x800634000 rw-90 1 0 df 20100x8006340000x80065f000 rw- 330 1 0 df 20100x80082b0000x80082d000 rw-20 1 0 df 20100x80082d0000x800839000 r-x 12 12 2 1 CN-- vn /usr/lib/libmilter.so.5 20100x8008390000x800a38000 ---00 1 0 df 20100x800a380000x800a39000 rw-10 1 0 C--- vn /usr/lib/libmilter.so.5 20100x800a390000x800a3c000 rw-00 0 0 -- Then there's a bunch of 'large' blocks e.g.. PID STARTEND PRT RES PRES REF SHD FL TP PATH 20100x801c00x80280 rw- 28690 4 0 df 20100x802800x80340 rw- 18800 1 0 df 20100x803400x80380 rw- 9200 1 0 df 20100x803800x80400 rw- 8650 1 0 df 20100x804000x80440 rw- 10240 4 0 df 20100x804400x804c0 rw- 6270 1 0 df 20100x804c00x80500 rw- 10210 4 0 df Most likely, these are malloc arenas. Then lots of 'little' blocks, 2010 0x70161000 0x70181000 rw- 160 1 0 ---D df 2010 0x70362000 0x70382000 rw- 160 1 0 ---D df 2010 0x70563000 0x70583000 rw- 160 1 0 ---D df 2010 0x70764000 0x70784000 rw- 160 1 0 ---D df 2010 0x70965000 0x70985000 rw- 160 1 0 ---D df 2010 0x70b66000 0x70b86000 rw- 160 1 0 ---D df 2010 0x70d67000 0x70d87000 rw- 160 1 0 ---D df 2010 0x70f68000 0x70f88000 rw- 160 1 0 ---D df 2010 0x71169000 0x71189000 rw- 160 1 0 ---D df And those are thread stacks. The memory usage figures seem to have 'stabilized' now - SIZE/RES seem to track around 9 times the size of the 6.4 system, for a comparative load. We can probably live with this (as they have come back down overnight as load fell off) - i.e. they don't appear to be continuously growing (the binaries were heavily profiled under 6.x and found to have no internal leaks). -Karl pgpM3vONjl8vE.pgp Description: PGP signature
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Wed, Oct 31, 2012 at 02:44:05PM +, Karl Pielorz wrote: --On 31 October 2012 16:06 +0200 Konstantin Belousov kostik...@gmail.com wrote: Since you neglected to provide the verbatim output of procstat, nothing conclusive can be said. Obviously, you can make an investigation on your own. Sorry - when I ran it this morning the output was several hundred lines - I didn't want to post all of that to the list 99% of the lines are very similar. I can email it you off-list if having the whole lot will help? Just put is somewhere on http server and provide the URL. Then there's a bunch of 'large' blocks e.g.. PID STARTEND PRT RES PRES REF SHD FL TP PATH 20100x801c00x80280 rw- 28690 4 0 df 20100x802800x80340 rw- 18800 1 0 Most likely, these are malloc arenas. Ok, that's the heaviest usage. So it is, most likely, either app doing more malloc allocations, or the difference is due to more aggressive jemalloc behaviour. If this indeed bothers you, you could try to use plug-in replacements for malloc. They are usually LD_PRELOAD'ed, search the ports. But I would not bother. Then lots of 'little' blocks, 2010 0x70161000 0x70181000 rw- 160 1 0 ---D df And those are thread stacks. Ok, lots of those (lots of threads going on) - but they're all pretty small. My code only has a single call to malloc, which allocates around 20k per thread. You should look both at the virtual address space allocation for the block, which is specified by (start, end), and at the actual resident page count. Obviously there's other libraries and stuff running with the code - so would I be correct in guessing that they are more than likely for most of these large blocks? No idea. pgpvR0EStB2PG.pgp Description: PGP signature
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote: On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org wrote: I think there are some things we should be investigating about the growth of memory usage. I just noticed this: Freebsd 6.2 on an arm processor: 369 root 1 8 -88 1752K 748K nanslp 3:00 0.00% watchdogd Freebsd 10.0 on the same system: 367 root 1 -52 r0 10232K 10160K nanslp 10:04 0.00% watchdogd The 10.0 system is built with MALLOC_PRODUCTION (without that defined the system won't even boot, it only has 64MB of ram). That's a crazy amount of growth for a relatively simple daemon. Would you please, _please_ do some digging into this? It's quite possible there's something in the libraries that are allocating some memory upon first call invocation - yes, that's jemalloc, but it could also be other things like stdio. We really, really need to fix this userland bloat; it's terribly ridiculous at this point. There's no reason a watchdog daemon should take 10megabytes of RAM. Watchdogd was recently changed to mlock its memory. This is the cause of the RSS increase. If not wired, swapout might cause a delay of the next pat, leading to panic. pgp7hqYdVijEv.pgp Description: PGP signature
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Tue, Oct 30, 2012 at 10:48:03AM -0700, Alfred Perlstein wrote: Some suggestions here, jemalloc, kernel threads are good ones. Another issue may just be some change for default thread stack size. This would explain why the RESIDENT set is the same, but the VIRTUAL grew. I suggest to take a look at where the actual memory goes. Start with procstat -v. -Alfred On 10/30/12 9:56 AM, Karl Pielorz wrote: --On 30 October 2012 19:43 +0700 Erich Dollansky erichfreebsdl...@ovitrap.com wrote: Depends how you mean 'the same' - on the 6.4 system it shows: cc (GCC) 3.4.6 [FreeBSD] 20060305 And, on the 9.0-S it shows: cc (GCC) 4.2.1 20070831 patched [FreeBSD] So 'same' - but different versions. did you check the default data sizes? How do you mean? Now they've been running for an hour or so - they've gotten a little larger 552M/154M and 703M/75M. If it's not harmful I can live with it - it was just a bit of a surprise. And a reason to spend more money on memory. Knowing the real reason would be better. I can understand your surprise. Hehe, more 'concern' than surprise I guess now... The sendmail milter has grown to a SIZE/RES of 1045M / 454M under 9.0. The original 6.4 machine under heaver load (more connections) shows a SIZE/RES of 85M/52M. The TCP listener code is now showing a SIZE/REZ of 815M/80M under 9.0 with the original 6.4 box showing 44M/9.5M The 9.0 box says it has 185M active, 472M inactive, 693M wired, 543M buf, and 4554M free. At this stage I'm just a bit concerned that at least the milter code is going to grow, and grow - and die. I would think it would last over night so I'll see what the figures are in the morning. Thanks for the replies... -Karl ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org pgpGsStsnYyLq.pgp Description: PGP signature
Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program
On Fri, Oct 26, 2012 at 11:12:53AM -0700, Simon J. Gerraty wrote: On Fri, 26 Oct 2012 11:41:46 -0600, Warner Losh writes: It's called a transition period for a reason. The historical use has = permeated itself into many places, not all of which are obvious. It would seem that leaving FreeBSD make as make, for the transition period and installing bmake as bmake, would cause the least disruption to everyone. This was the original proposal presented at BSDCan in 2011. FreeBSD make already grok's the :tl and :tu modifiers, so it is quite simple for the two to coexist for some period. The only reason we are talking about having to frob ports etc now, is a new requirement introduced this year (by yourself I think) that bmake replace make in base rather than allow coexistence. If we are all happy to go back to the original plan, we can ease the concerns of the folk you speak of? The only downside is we wait a few more years for major build improvments. Can system build, initiated by make, call bmake immediately ? I suppose it could be fine even to error out if make is typed instead of bmake for src/. pgp1H8SYjf8vh.pgp Description: PGP signature