Re: mmap() question

2013-10-12 Thread Konstantin Belousov
On Fri, Oct 11, 2013 at 09:57:24AM +0400, Dmitry Sivachenko wrote:
 
 On 11.10.2013, at 9:17, Konstantin Belousov kostik...@gmail.com wrote:
 
  On Wed, Oct 09, 2013 at 03:42:27PM +0400, Dmitry Sivachenko wrote:
  Hello!
  
  I have a program which mmap()s a lot of large files (total size more that 
  RAM and I have no swap), but it needs only small parts of that files at a 
  time.
  
  My understanding is that when using mmap when I access some memory region 
  OS reads the relevant portion of that file from disk and caches the result 
  in memory.  If there is no free memory, OS will purge previously read part 
  of mmap'ed file to free memory for the new chunk.
  
  But this is not the case.  I use the following simple program which gets 
  list of files as command line arguments, mmap()s them all and then selects 
  random file and random 1K parts of that file and computes a XOR of bytes 
  from that region.
  After some time the program dies:
  pid 63251 (a.out), uid 1232, was killed: out of swap space
  
  It seems I incorrectly understand how mmap() works, can you please clarify 
  what's going wrong?
  
  I expect that program to run indefinitely, purging some regions out of RAM 
  and reading the relevant parts of files.
  
  
  You did not specified several very important parameters for your test:
  1. total amount of RAM installed
 
 
 24GB
 
 
  2. count of the test files and size of the files
 
 To be precise: I used 57 files with size varied form 74MB to 19GB.
 The total size of these files is 270GB.
 
  3. which filesystem files are located at
 
 
 UFS @ SSD drive
 
  4. version of the system.
 
 
 FreeBSD 9.2-PRERELEASE #0 r254880M: Wed Aug 28 11:07:54 MSK 2013

I was not able to reproduce the situation locally. I even tried to start
a lot of threads accessing the mapped regions, to try to outrun the
pagedaemon. The user threads sleep on the disk read, while pagedaemon
has a lot of time to rebalance the queues. It might be a case when SSD
indeed makes a difference.

Still, I see how this situation could appear. The code, which triggers
OOM, never fires if there is a free space in the swapfile, so the
absense of swap is neccessary condition to trigger the bug.  Next, OOM
calculation does not account for a possibility that almost all pages on
the queues can be reused. It just fires if free pages depleted too much
or free target cannot be reached.

IMO one of the possible solution is to account the queued pages in
addition to the swap space.  This is not entirely accurate, since some
pages on the queues cannot be reused, at least transiently.  Most precise
algorithm would count the hold and busy pages globally, and substract
this count from queues length, but it is probably too costly.

Instead, I think we could rely on the numbers which are counted by
pagedaemon threads during the passes.  Due to the transient nature of the
pagedaemon failures, this should be fine.

Below is the prototype patch, against HEAD.  It is not applicable to
stable, please use HEAD kernel for test.

diff --git a/sys/sys/vmmeter.h b/sys/sys/vmmeter.h
index d2ad920..ee5159a 100644
--- a/sys/sys/vmmeter.h
+++ b/sys/sys/vmmeter.h
@@ -93,9 +93,10 @@ struct vmmeter {
u_int v_free_min;   /* (c) pages desired free */
u_int v_free_count; /* (f) pages free */
u_int v_wire_count; /* (a) pages wired down */
-   u_int v_active_count;   /* (q) pages active */
+   u_int v_active_count;   /* (a) pages active */
u_int v_inactive_target; /* (c) pages desired inactive */
-   u_int v_inactive_count; /* (q) pages inactive */
+   u_int v_inactive_count; /* (a) pages inactive */
+   u_int v_queue_sticky;   /* (a) pages on queues but cannot process */
u_int v_cache_count;/* (f) pages on cache queue */
u_int v_cache_min;  /* (c) min pages desired on cache queue */
u_int v_cache_max;  /* (c) max pages in cached obj (unused) */
diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c
index 713a2be..4bb1f1f 100644
--- a/sys/vm/vm_meter.c
+++ b/sys/vm/vm_meter.c
@@ -316,6 +316,7 @@ VM_STATS_VM(v_active_count, Active pages);
 VM_STATS_VM(v_inactive_target, Desired inactive pages);
 VM_STATS_VM(v_inactive_count, Inactive pages);
 VM_STATS_VM(v_cache_count, Pages on cache queue);
+VM_STATS_VM(v_queue_sticky, Pages which cannot be moved from queues);
 VM_STATS_VM(v_cache_min, Min pages on cache queue);
 VM_STATS_VM(v_cache_max, Max pages on cached queue);
 VM_STATS_VM(v_pageout_free_min, Min pages reserved for kernel);
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 7846702..6943a0e 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -226,6 +226,7 @@ struct vm_domain {
long vmd_segs;  /* bitmask of the segments */
boolean_t vmd_oom;
int vmd_pass;   /* local pagedaemon pass */
+   int vmd_queue_sticky;   /* pages on queues which cannot be processed */
struct vm_page vmd_marker; /* marker for pagedaemon private

Re: mmap() question

2013-10-12 Thread Konstantin Belousov
On Sat, Oct 12, 2013 at 04:04:31PM +0400, Dmitry Sivachenko wrote:
 
 On 12.10.2013, at 13:59, Konstantin Belousov kostik...@gmail.com wrote:
  
  I was not able to reproduce the situation locally. I even tried to start
  a lot of threads accessing the mapped regions, to try to outrun the
  pagedaemon. The user threads sleep on the disk read, while pagedaemon
  has a lot of time to rebalance the queues. It might be a case when SSD
  indeed makes a difference.
  
 
 
 With ordinary SATA drive it will take hours just to read 20GB of data from 
 disk because of random access, it will do a lot of seeks and reading speed 
 will be extremely low.
 
 SSD dramatically improves reading speed.
 
 
  Still, I see how this situation could appear. The code, which triggers
  OOM, never fires if there is a free space in the swapfile, so the
  absense of swap is neccessary condition to trigger the bug.  Next, OOM
  calculation does not account for a possibility that almost all pages on
  the queues can be reused. It just fires if free pages depleted too much
  or free target cannot be reached.
 
 
 First I tried with some swap space configured.  The OS started to swap out my 
 process after it reached about 20GB which is also not what I expected:  what 
 is the reason to swap out regions of read-only mmap()ed files?  Is it the 
 expected behaviour?
 
How did you concluded that the pages from your r/o mappings were paged out ?
VM never does this.  Only anonymous memory could be written to swap file,
including the shadow pages for the writeable COW mappings.  I suspect that
you have another 20GB of something used on the machine meantime.

 
  
  Below is the prototype patch, against HEAD.  It is not applicable to
  stable, please use HEAD kernel for test.
 
 
 Thanks, I will test the patch soon and report the results.


pgp4mxTG6rGdf.pgp
Description: PGP signature


Re: mmap() question

2013-10-10 Thread Konstantin Belousov
On Wed, Oct 09, 2013 at 03:42:27PM +0400, Dmitry Sivachenko wrote:
 Hello!
 
 I have a program which mmap()s a lot of large files (total size more that RAM 
 and I have no swap), but it needs only small parts of that files at a time.
 
 My understanding is that when using mmap when I access some memory region OS 
 reads the relevant portion of that file from disk and caches the result in 
 memory.  If there is no free memory, OS will purge previously read part of 
 mmap'ed file to free memory for the new chunk.
 
 But this is not the case.  I use the following simple program which gets list 
 of files as command line arguments, mmap()s them all and then selects random 
 file and random 1K parts of that file and computes a XOR of bytes from that 
 region.
 After some time the program dies:
 pid 63251 (a.out), uid 1232, was killed: out of swap space
 
 It seems I incorrectly understand how mmap() works, can you please clarify 
 what's going wrong?
 
 I expect that program to run indefinitely, purging some regions out of RAM 
 and reading the relevant parts of files.
 

You did not specified several very important parameters for your test:
1. total amount of RAM installed
2. count of the test files and size of the files
3. which filesystem files are located at
4. version of the system.


pgpu_xJMm2QsF.pgp
Description: PGP signature


Re: Mixing amd64 kernel with i386 world

2013-09-28 Thread Konstantin Belousov
On Sat, Sep 28, 2013 at 01:25:39PM +0200, Dimitry Andric wrote:
 On Sep 28, 2013, at 12:37, Peter Jeremy pe...@rulingia.com wrote:
  I have a system with 4GB RAM and hence need to use an amd64 kernel to use
  all the RAM (I can only access 3GB RAM with an i386 kernel).  OTOH, amd64
  processes are significantly (50-100%) larger than equivalent i386 processes
  and none none of the applications I'll be running on the system need to be
  64-bit.
  
  This implies that the optimal approach is an amd64 kernel with i386
  userland (I'm ignoring PAE as a useable approach).  I've successfully
  run i386 jails on amd64 systems so I know this mostly works.  I also
  know that there are some gotchas:
  - kdump needs to match the kernel
  - anything accessing /dev/mem or /dev/kmem (which implies anything that
   uses libkvm) probably needs to match the kernel.
  
  Has anyone investigated this approach?
 
 I have only run i386 jails on amd64, just like you.  Though for the use
 case you are describing, the really optimal approach would be to have
 real x32 support.  Unfortunately, that is quite a lot of work... :)

The management interfaces usually do not work for compat32 on amd64. The
biggest exception I am aware of is nmount(2). But you would be without
network. The same is true for almost all sysctls, again kern.proc.*
hierarchy is an exception and works.

Syscalls required to normal usermode operation do work, at least most
developers care about this, except for capsicum.

Using 32bit jail or, even easier, chroot, for now, is the easiest proposition.


pgpAZZ52Y_U8w.pgp
Description: PGP signature


Re: mmap on emulated i386

2013-09-18 Thread Konstantin Belousov
On Tue, Sep 17, 2013 at 07:01:16PM -0700, Rui Paulo wrote:
 Hi,
 
 I'm trying to figure out why the following fails when compiled on amd64 with 
 -m32:
 
   mmap(NULL, 0x7, PROT_READ|PROT_WRITE|PROT_EXEC, 
 MAP_ANON|MAP_PRIVATE, -1, 0);
 
 It returns EINVAL. I looked around everywhere but I couldn't find where the 
 EINVAL is coming from. The length argument doesn't really make any 
 difference. At this point I'm thinking it's a bug...

What is the version of your system, including the userspace ? -m32 only
works on HEAD.  Do you have r255657/r255658 in your kernel ?

Please provide kdump of the ktraced execution of your test program, as
well as the test program itself.

N.B. The following worked fine for me, both in 64 and 32 bit binary.

#include sys/mman.h
#include err.h
#include stdlib.h

int
main(void)
{
char *p;

p = mmap(NULL, 0x7, PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_ANON | MAP_PRIVATE, -1, 0);
if (p == MAP_FAILED)
err(1, mmap);
return (0);
}



pgpIkDQh3vpHS.pgp
Description: PGP signature


Re: ps_strings

2013-08-20 Thread Konstantin Belousov
On Mon, Aug 19, 2013 at 06:49:55PM -0600, Chris Torek wrote:
 Yes, p_args caches the arguments, but not always.  Right now, kernel
 does not cache arguments if the string is longer than 256 bytes.  Look
 for ps_arg_cache_limit in kern_exec.c.
 
 setproctitle() always informs the kernel with sysctl and sets the
 pointers in ps_strings. kern.proc.args sysctl first tries the p_args,
 and falls back to reading ps_strings and following the pointers if
 p_args is NULL.
 
 Ah, that's what I get for scanning through years of updates too fast.
 :-)
 
 This seems a bit of a worst of both worlds: there's now some
 extra kernel code for poking through the ps_strings and the
 pointer-vectors (this code is no longer in libkvm at all --
 that was where I looked first and found the sysctl), for the no
 p_args case.  It seems like perhaps there could just be a sysctl
 to return the ps_strings address, and leave the follow argv
 pointers code in libkvm, if there is to be code for that.
There is a demand for other things besides arguments, e.g. environment,
and most important for the debuggers, ELF aux vector. Also, moving
this code to libkvm would mean that mismatched ABIs cannot be easily
supported, like 64bit binary trying to get 32bit binary information.

I would say that the current placement fullfill its goals.
 
 (The kernel saves a bit of time for the presumably-usual p_args
 not NULL case, and finding the location of the ps_strings
 structure when it *is* used is automatically correct.  So that
 part is a straight-up improvement, at least.)
 
 Not that big a deal either way, but it does seem as though there
 should be documentation for ps_strings.
 
 Chris


pgpeki4CRh2uH.pgp
Description: PGP signature


Re: ps_strings

2013-08-19 Thread Konstantin Belousov
On Sun, Aug 18, 2013 at 04:05:11PM -0600, Chris Torek wrote:
 FreeBSD now, however, uses a per-process p_args field in the
 proc structure, with sysctl()s to set and get p_args.  (I had
 nothing to do with this new code, but I approve, as if anyone
 cares. :-) )  This removes the fixed-virtual-address limitation.
 The cost is a bit more kernel code (for the sysctl()s) and this
 per-process data, but there is no more messing-about with where
 is ps_strings in this memory-layout / emulation etc.  (Meanwhile
 libkvm still retrieves the arguments.  It just does it now with
 sysctl().)

Yes, p_args caches the arguments, but not always.  Right now, kernel
does not cache arguments if the string is longer than 256 bytes.  Look
for ps_arg_cache_limit in kern_exec.c.

setproctitle() always informs the kernel with sysctl and sets the
pointers in ps_strings. kern.proc.args sysctl first tries the p_args,
and falls back to reading ps_strings and following the pointers if
p_args is NULL.


pgpfNuTdP710t.pgp
Description: PGP signature


Re: kldload ipfw, with IPFIREWALL_DEFAULT_TO_ACCEPT

2013-07-29 Thread Konstantin Belousov
On Mon, Jul 29, 2013 at 12:27:40PM +0100, Karl Pielorz wrote:
 
 
 --On 29 July 2013 13:02 +0200 Stefan Esser s...@freebsd.org wrote:
 
  I guess you were looking for:
 
  net.inet.ip.fw.default_to_accept=1
 
  which is a tunable to be set in /boot/loader.conf ...
 
 Very probably - but that's at boot time :( - Is there nothing I can do at 
 kldload time to have the initial kldload give me a 'allow ip from any to 
 any' rule as it loads? (thus not affecting traffic on the machine, or more 
 importantly the CARP interfaces)?

kenv net.inet.ip.fw.default_to_accept=1
should have the same effect after the usermode is booted.  Kenv must
be set before the module is loaded.


pgpMRufLjGHVb.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-29 Thread Konstantin Belousov
On Sat, Jun 29, 2013 at 10:06:11AM +0300, Alexander Motin wrote:
 I understand that lock attempt will steal cache line from lock owner. 
 What I don't very understand is why avoiding it helps performance in 
 this case. Indeed, having mutex on own cache line will not let other 
 cores to steal also bswlist, but it also means that bswlist should be 
 prefetched separately (and profiling shows resource stalls there). Or in 
 this case separate speculative prefetch will be better then forced one 
 which could be stolen? Is there cases when it is not, or the only reason 
 to not pad all global mutexes is only saving memory?

I can speculate that it is the case when speculative execution helps.
If mutex and list head are on the different cache lines, then cpu
could speculatively read the head, and then prove that executing the
read before the lock acquisition does not break the ordering rules
(because lock protects the head, other core indeed cannot modify
the head if the lock acquisition was successfull).

I think it is very similar reason why locked instructions as barriers
are faster then the explicit barriers, cpu could still do the speculative
execution after the lock prefix if the ordering is provable consistent.

Please see the Intel IA32 architecture optimization manual 8.4.5 for
the recommendations (but not much explanation).

Yes, I think putting all locks on dedicated cache lines is the waste,
only hot locks need this.


pgpREr_akYKys.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:
 Hi.
 
 While doing some profiles of GEOM/CAM IOPS scalability, on some test 
 patterns I've noticed serious congestion with spinning on global 
 pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is 
 already very simple, I've tried to optimize probably the only thing 
 possible there: switch bswlist from TAILQ to SLIST. As I can see, 
 b_freelist field of struct buf is really used as TAILQ in some other 
 places, so I've just added another SLIST_ENTRY field. And result 
 appeared to be surprising -- I can no longer reproduce the issue at all. 
 May be it was just unlucky synchronization of specific test, but I've 
 seen in on two different systems and rechecked results with/without 
 patch three times.
This is too unbelievable.  Could it be, e.g. some cache line conflicts
which cause the trashing, in fact ? Does it help if you add void *b_pad
before b_freelist instead of adding b_freeslist ?

 
 The present patch is here:
 http://people.freebsd.org/~mav/buf_slist.patch
 
 The question is how to do it better? What is the KPI/KBI policy for 
 struct buf? I could replace b_freelist by a union and keep KBI, but 
 partially break KPI. Or I could add another field, probably breaking 
 KBI, but keeping KPI. Or I could do something handmade with no breakage. 
 Or this change is just a bad idea?
The same question about using union for b_freelist/b_freeslist, does the
effect of magically fixing the contention still there if b_freeslist
is on the same offset as the b_freelist ?

There are no K{B,P}I policy for struct buf in HEAD, just change it as
it fits.


pgpGt3DQKL6vB.pgp
Description: PGP signature


Re: expanding amd64 past the 1TB limit

2013-06-28 Thread Konstantin Belousov
On Thu, Jun 27, 2013 at 03:23:36PM -0600, Chris Torek wrote:
 OK, I wasted :-) way too much time, but here's a text file that
 can be comment-ified or stored somewhere alongside the code or
 whatever...
I think it would be a nice addition to the VM article in the doc
collection.  The content is correct and useful, IMO.
See some notes below.

 The reason for having those empty (all-zero) PTE pages is that
 whenever new processes are created, in pmap_pinit(), they have
 their (new) L4 PTE page set up to point to the *same* physical
 pages that the kernel is using.  Thus, if the kernel creates or
 destroys any level-3-or-below mapping by writing into any of the
 above four pages, that mapping is also created/destroyed in all
 processes.  Similarly, the NDMPML4 pages starting at DMPDPphys are
 mapped identically in all processes.  The kernel can therefore
 borrow a user pmap at any time, i.e., there's no need to adjust
 the CPU's CR4 on entry to the kernel.
It is %cr3.

 
 (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
 entries into the new pmap, the L3 pages would not have to be
 physically contiguous, but the KVA ones would still all have to
 exist.  It's free to allocate physically contiguous pages here
 anyway though.)
I do not see how the physical continuity of the allocated page table
pages is relevant there.

Copying the L4 or L3 PTEs would cause serious complications.  In fact,
this is how i386 operates, since due to limited VA, pmap does not have
a luxurly of allocating entire L(top-1) page table pages to the kernel.
As result, pmap_kenter_pde() have to iterate over the whole pmap
list, which is maintained just to be able to update KVA mappings.

 
 So, the last NKPML4E slots in KPML4phys point to the following
 page tables, which use all of L3, L2, and L1 style PTEs.  (Note
 that we did not need any L1 PTEs for the direct map, which always
 uses 2MB or 1GB super-pages.)
This is not quite true. In the initial state, indeed all PTEs for direct
map are superpages, either 1G or 2M. But Intel states that a situation
when the physical page has mappings with different caching modes causes
undefined behaviour. As result, if a page is remapped with non-write
back caching attributes, the direct map has to demote the superpage and
adjust the mapping attribute of the page frame for the page.

As result, it is possible to have small mappings in the direct map.
See pmap_page_set_memattr(), which delegates the work to
pmap_change_attr(_locked).


pgpHAwuqmlvZy.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Fri, Jun 28, 2013 at 08:14:42AM -0700, Adrian Chadd wrote:
 .. i'd rather you narrow down _why_ it's performing better before committing 
 it.
 
 Otherwise it may just creep up again after someone does another change
 in an unrelated part of the kernel.
Or penalize some other set of machines where this is currently not a problem.

The cause should be identified before any change is committed.
 
 You're using instructions-retired; how about using l1/l2 cache loads,
 stores, etc? There's a lot more CPU counters available.
 
 You have a very cool problem to solve. If I could reproduce it locally
 I'd give you a hand.
 
 Thanks,
 
 
 
 -adrian


pgpY5K5ioAQsN.pgp
Description: PGP signature


Re: expanding amd64 past the 1TB limit

2013-06-28 Thread Konstantin Belousov
On Wed, Jun 26, 2013 at 10:11:44AM -0600, Chris Torek wrote:
 diff --git a/conf/options.amd64 b/conf/options.amd64
 index 90348b7..f3ce505 100644
 --- a/conf/options.amd64
 +++ b/conf/options.amd64
 @@ -1,6 +1,7 @@
  # $FreeBSD$
  # Options specific to AMD64 platform kernels
  
 +AMD64_HUGE   opt_global.h

Is this option needed ? The SMAP is already parsed at the time of
pmap_bootstrap() call, so you could determine the amount of physical
memory and size the KVA map accordingly ?


pgpuR4X3THSPE.pgp
Description: PGP signature


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread Konstantin Belousov
On Sat, Jun 29, 2013 at 01:15:19AM +0300, Alexander Motin wrote:
 On 28.06.2013 09:57, Konstantin Belousov wrote:
  On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote:
  While doing some profiles of GEOM/CAM IOPS scalability, on some test
  patterns I've noticed serious congestion with spinning on global
  pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is
  already very simple, I've tried to optimize probably the only thing
  possible there: switch bswlist from TAILQ to SLIST. As I can see,
  b_freelist field of struct buf is really used as TAILQ in some other
  places, so I've just added another SLIST_ENTRY field. And result
  appeared to be surprising -- I can no longer reproduce the issue at all.
  May be it was just unlucky synchronization of specific test, but I've
  seen in on two different systems and rechecked results with/without
  patch three times.
  This is too unbelievable.  Could it be, e.g. some cache line conflicts
  which cause the trashing, in fact ?
 
 I think it indeed may be a cache trashing. I've made some profiling for 
 getpbuf()/relpbuf() and found interesting results. With patched kernel 
 using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in 
 relpbuf() -- first lock acquisition causes 78% of them. Later memory 
 accesses including the lock release are hitting the same cache line and 
 almost free. With clean kernel using TAILQ I see RESOURCE_STALLS.ANY 
 spread almost equally between lock acquisition, bswlist access and lock 
 release. It looks like the cache line is constantly erased by something.
 
 My guess was that patch somehow changed cache line sharing. But several 
 checks with nm shown that, while memory allocation indeed changed 
 slightly, in both cases content of the cache line in question is 
 absolutely the same, just shifted in memory by 128 bytes.
 
 I guess the cache line could be trashed by threads doing adaptive 
 spinning on lock after collision happened. That trashing increases lock 
 hold time and even more increases chance of additional collisions. May 
 be switch from TAILQ to SLIST slightly reduces lock hold time, reducing 
 chance of cumulative effect. The difference is not big, but in this test 
 this global lock acquired 1.5M times per second by 256 threads on 24 
 CPUs (12xL2 and 2xL3 caches).
 
 Another guess was that we have some bad case of false cache line 
 sharing, but I don't know how that can be either checked or avoided.
 
 At the last moment mostly for luck I've tried to switch pbuf_mtx from 
 mtx to mtx_padalign on clean kernel. For my surprise that also seems 
 fixed the congestion problem, but I can't explain why. 
 RESOURCE_STALLS.ANY still show there is cache trashing, but the lock 
 spinning has gone.
 
 Any ideas about what is going on there?

FWIW, Jeff just changed pbuf_mtx allocation to use padalign, it is a
somewhat unrelated change in r252330.

Are pbuf_mtx and bswlist are located next to next in your kernel ?

If yes, then I would expect that the explanation is how the MESI
protocol and atomics work. When performing the locked op, CPU takes the
whole cache line into the exclusive ownership. Since our locks try the
cmpset as the first operation, and then 'adaptive' loop interleaving
cmpset and check for the ownership, false cache line sharing between
pbuf_mtx and bswlist should result exactly in such effects.  Different
cores should bounce the ownership of the cache line, slowing down
the accesses.

AFAIR Intel exports some MESI events as performance counters, but I was
not able to find a reference in the architecturally defined events. It
seems to be present in the model-specific part.


pgpzr8SbYK8oQ.pgp
Description: PGP signature


Re: please review, patch for lost camisr

2013-06-18 Thread Konstantin Belousov
On Wed, May 29, 2013 at 10:16:45AM +0300, Konstantin Belousov wrote:
 Well, if you and I are right, the minimal patch should be
 
 diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c
 index 8d63c9b..7c21015 100644
 --- a/sys/kern/kern_intr.c
 +++ b/sys/kern/kern_intr.c
 @@ -1349,7 +1349,7 @@ ithread_loop(void *arg)
* we are running, it will set it_need to note that we
* should make another pass.
*/
 - while (ithd-it_need) {
 + while (atomic_load_acq_int(ithd-it_need)) {
   /*
* This might need a full read and write barrier
* to make sure that this write posts before any
 

So, was it tested ? If yes, did the patch helped ?


pgpDMd2znDb_c.pgp
Description: PGP signature


Re: Where is stack of program?

2013-06-10 Thread Konstantin Belousov
On Mon, Jun 10, 2013 at 08:44:25PM +0400, Vagner wrote:
 Hi!
 I need method of finding in struct vm_map or vm_object segments of
 memory which is situated in the stack.
 Can you help me please?

Note that the stack is per-thread.  The concept is somewhat machine-specific,
some architectures utilize two stacks, on some the stack is purely software
convention.

You did not specified what context your code is to be run, e.g. is
it kernel or user space ? Assuming it is kernel since you mentioned
vm_something. The least error prone route is the thread context (frame)
- tf_esp on i386 (or tf_rsp on amd64) - lookup of the map entry in the
process p_vmspace.

In reality, the stack is often fragmented, since the stack grow code
does not coalesce the adjacent grow-down map entries.


pgpcOQLckCgs_.pgp
Description: PGP signature


Re: please review, patch for lost camisr

2013-05-29 Thread Konstantin Belousov
On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote:
 On 5/28/13 10:08 PM, Konstantin Belousov wrote:
  On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
  [[  moved to hackers@ from private mail. ]]
 
  On 5/28/13 1:13 PM, John Baldwin wrote:
  On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
  On 5/28/13 9:04 AM, John Baldwin wrote:
  On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
  Hey folks,
 
  I had a talk with Nathan Whitehorn about the camisr issue.  The issue 
  we
  are seeing we mostly know, but to summarize, we are losing the camisr
  signal and the camisr is not being run.
 
  I gave him a summary of what we have been seeing and pointed him to the
  code I am concerned about here:
  http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
 
  What I think that is happening is that the setting of it_need to 0
  inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
  correctly and it is being delayed until AFTER the call to
  ithread_execute_handlers() right below the atomic_store_rel_int().
  This seems highly unlikely, to the extent that if this were true all our
  locking primitives would be broken.  The store_rel is actually a release
  barrier which acts like more a read/write fence.  No memory accesses 
  (read or
  write) from before the release can be scheduled after the associated 
  store,
  either by the compiler or CPU.  That is what Konstantin is referring to 
  in his
  commit when he says release semantics.
  Yes, that makes sense, however does it specify that the writes *must*
  occur at that *point*?  If it only enforces ordering then we may have
  some issue, specifically because the setting of it to '1' inside of
  intr_event_schedule_thread has no barrier other than the acq semantics
  of the thread lock.  I am wondering what is forcing out the '1' there.
  Nothing ever forces writes.  You would have to invalidate the cache to do 
  that
  and that is horribly expensive.  It is always only about ordering and 
  knowing
  that if you can complete another operation on the same cookie variable 
  with
  acquire semantics that earlier writes will be visible.
  By cookie, you mean a specific memory address, basically a lock? This is
  starting to reinforce my suspicions as the setting of it_need is done
  with release semantics, however the acq on the other CPU is done on the
  thread lock.  Maybe that is irrelevant.  We will find out shortly.
 
  See below as I think we have proof that this is somehow happening.
  Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest 
  fix
  is probably to stop using atomics on it_need and just grab the thread lock
  in the main ithread loop and hold it while checking and clearing it_need.
 
  OK, we have some code that will either prove this, or perturb the memory
  ordering enough to make the bug go away, or prove this assertion wrong.
 
  We will update on our findings hopefully in the next few days.
  IMO the read of it_need in the 'while (ithd-it_need)' should
  have acquire semantic, otherwise the future reads in the
  ithread_execute_handlers(), in particular, of the ih_need, could pass
  the read of it_need and cause the situation you reported.  I do not
  see any acquire barrier between a condition in the while() statement
  and the read of ih_need in the execute_handlers().
 
  It is probably true that the issue you see was caused by r236456, in the
  sense that implicitely locked xchgl instruction on x86 has a full barrier
  semantic.  As result, the store_rel() was actually an acquire too, making
  this reordering impossible.  I argue that this is not a bug in r236456,
  but the issue in the kern_intr.c.
 If I remember the code correctly that would probably explain why we see 
 it only on 9.1 system.
 
  On the other hand, the John' suggestion to move the manipulations of
  it_need under the lock is probably the best anyway.
 
 I was wondering if it would be lower latency to maintain it_need, 
 however to keep another variable it_needlocked under the thread lock.  
 This would result in potential superfluous interrupts, however under 
 load you would allow the ithread to loop without taking the thread lock 
 some number of times.
 
 I am not really sure if this is really worth the optimization 
 (especially since it can result in superfluous interrupts) however it 
 may reduce latency and that might be important.
 
 Is there some people that I can pass the patch onto for help with 
 performance once we confirm that this is the actual bug?   We can do 
 internal testing, but I am worried about regressing performance of any 
 form of IO for the kernel.
 
 I'll show the patch soon.
 
 Thank you for the information.  This is promising.

Well, if you and I are right, the minimal patch should be

diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c
index 8d63c9b..7c21015 100644
--- a/sys/kern/kern_intr.c
+++ b/sys/kern/kern_intr.c
@@ -1349,7

Re: please review, patch for lost camisr

2013-05-29 Thread Konstantin Belousov
On Wed, May 29, 2013 at 12:27:46AM -0700, Alfred Perlstein wrote:
 On 5/29/13 12:16 AM, Konstantin Belousov wrote:
  On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote:
  On 5/28/13 10:08 PM, Konstantin Belousov wrote:
  On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
  [[  moved to hackers@ from private mail. ]]
 
  On 5/28/13 1:13 PM, John Baldwin wrote:
  On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
  On 5/28/13 9:04 AM, John Baldwin wrote:
  On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
  Hey folks,
 
  I had a talk with Nathan Whitehorn about the camisr issue.  The 
  issue we
  are seeing we mostly know, but to summarize, we are losing the camisr
  signal and the camisr is not being run.
 
  I gave him a summary of what we have been seeing and pointed him to 
  the
  code I am concerned about here:
  http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
 
  What I think that is happening is that the setting of it_need to 0
  inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
  correctly and it is being delayed until AFTER the call to
  ithread_execute_handlers() right below the atomic_store_rel_int().
  This seems highly unlikely, to the extent that if this were true all 
  our
  locking primitives would be broken.  The store_rel is actually a 
  release
  barrier which acts like more a read/write fence.  No memory accesses 
  (read or
  write) from before the release can be scheduled after the associated 
  store,
  either by the compiler or CPU.  That is what Konstantin is referring 
  to in his
  commit when he says release semantics.
  Yes, that makes sense, however does it specify that the writes *must*
  occur at that *point*?  If it only enforces ordering then we may have
  some issue, specifically because the setting of it to '1' inside of
  intr_event_schedule_thread has no barrier other than the acq semantics
  of the thread lock.  I am wondering what is forcing out the '1' there.
  Nothing ever forces writes.  You would have to invalidate the cache to 
  do that
  and that is horribly expensive.  It is always only about ordering and 
  knowing
  that if you can complete another operation on the same cookie 
  variable with
  acquire semantics that earlier writes will be visible.
  By cookie, you mean a specific memory address, basically a lock? This is
  starting to reinforce my suspicions as the setting of it_need is done
  with release semantics, however the acq on the other CPU is done on the
  thread lock.  Maybe that is irrelevant.  We will find out shortly.
 
  See below as I think we have proof that this is somehow happening.
  Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest 
  fix
  is probably to stop using atomics on it_need and just grab the thread 
  lock
  in the main ithread loop and hold it while checking and clearing 
  it_need.
 
  OK, we have some code that will either prove this, or perturb the memory
  ordering enough to make the bug go away, or prove this assertion wrong.
 
  We will update on our findings hopefully in the next few days.
  IMO the read of it_need in the 'while (ithd-it_need)' should
  have acquire semantic, otherwise the future reads in the
  ithread_execute_handlers(), in particular, of the ih_need, could pass
  the read of it_need and cause the situation you reported.  I do not
  see any acquire barrier between a condition in the while() statement
  and the read of ih_need in the execute_handlers().
 
  It is probably true that the issue you see was caused by r236456, in the
  sense that implicitely locked xchgl instruction on x86 has a full barrier
  semantic.  As result, the store_rel() was actually an acquire too, making
  this reordering impossible.  I argue that this is not a bug in r236456,
  but the issue in the kern_intr.c.
  If I remember the code correctly that would probably explain why we see
  it only on 9.1 system.
  On the other hand, the John' suggestion to move the manipulations of
  it_need under the lock is probably the best anyway.
 
  I was wondering if it would be lower latency to maintain it_need,
  however to keep another variable it_needlocked under the thread lock.
  This would result in potential superfluous interrupts, however under
  load you would allow the ithread to loop without taking the thread lock
  some number of times.
 
  I am not really sure if this is really worth the optimization
  (especially since it can result in superfluous interrupts) however it
  may reduce latency and that might be important.
 
  Is there some people that I can pass the patch onto for help with
  performance once we confirm that this is the actual bug?   We can do
  internal testing, but I am worried about regressing performance of any
  form of IO for the kernel.
 
  I'll show the patch soon.
 
  Thank you for the information.  This is promising.
  Well, if you and I are right, the minimal patch should be
 
  diff --git

Re: please review, patch for lost camisr

2013-05-28 Thread Konstantin Belousov
On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
 [[  moved to hackers@ from private mail. ]]
 
 On 5/28/13 1:13 PM, John Baldwin wrote:
  On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
  On 5/28/13 9:04 AM, John Baldwin wrote:
  On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
  Hey folks,
 
  I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
  are seeing we mostly know, but to summarize, we are losing the camisr
  signal and the camisr is not being run.
 
  I gave him a summary of what we have been seeing and pointed him to the
  code I am concerned about here:
  http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
 
  What I think that is happening is that the setting of it_need to 0
  inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
  correctly and it is being delayed until AFTER the call to
  ithread_execute_handlers() right below the atomic_store_rel_int().
  This seems highly unlikely, to the extent that if this were true all our
  locking primitives would be broken.  The store_rel is actually a release
  barrier which acts like more a read/write fence.  No memory accesses 
  (read or
  write) from before the release can be scheduled after the associated 
  store,
  either by the compiler or CPU.  That is what Konstantin is referring to 
  in his
  commit when he says release semantics.
  Yes, that makes sense, however does it specify that the writes *must*
  occur at that *point*?  If it only enforces ordering then we may have
  some issue, specifically because the setting of it to '1' inside of
  intr_event_schedule_thread has no barrier other than the acq semantics
  of the thread lock.  I am wondering what is forcing out the '1' there.
  Nothing ever forces writes.  You would have to invalidate the cache to do 
  that
  and that is horribly expensive.  It is always only about ordering and 
  knowing
  that if you can complete another operation on the same cookie variable 
  with
  acquire semantics that earlier writes will be visible.
 
 By cookie, you mean a specific memory address, basically a lock? This is 
 starting to reinforce my suspicions as the setting of it_need is done 
 with release semantics, however the acq on the other CPU is done on the 
 thread lock.  Maybe that is irrelevant.  We will find out shortly.
 
 
  See below as I think we have proof that this is somehow happening.
  Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
  is probably to stop using atomics on it_need and just grab the thread lock
  in the main ithread loop and hold it while checking and clearing it_need.
 
 
 OK, we have some code that will either prove this, or perturb the memory 
 ordering enough to make the bug go away, or prove this assertion wrong.
 
 We will update on our findings hopefully in the next few days.

IMO the read of it_need in the 'while (ithd-it_need)' should
have acquire semantic, otherwise the future reads in the
ithread_execute_handlers(), in particular, of the ih_need, could pass
the read of it_need and cause the situation you reported.  I do not
see any acquire barrier between a condition in the while() statement
and the read of ih_need in the execute_handlers().

It is probably true that the issue you see was caused by r236456, in the
sense that implicitely locked xchgl instruction on x86 has a full barrier
semantic.  As result, the store_rel() was actually an acquire too, making
this reordering impossible.  I argue that this is not a bug in r236456,
but the issue in the kern_intr.c.

On the other hand, the John' suggestion to move the manipulations of
it_need under the lock is probably the best anyway.



pgpL971MsQjuU.pgp
Description: PGP signature


Re: preemptive kernel

2013-05-27 Thread Konstantin Belousov
On Mon, May 27, 2013 at 05:00:13AM +, Orit Moskovich wrote:
 Just to be more specific - On x86, during a filter routine all interrupts are 
 disabled on the cpu currently handling a filter routine or only interrupts on 
 the IRQ that generated the interrupt? 

The CPU enters the handler through the interrupt gate, which clears the
PSL_I.  Interrupts are not re-enabled until the handler return.

 Is it possible that on a different cpu an interrupt (filter or ithread) of 
 the same IRQ will run? Should worry about locking data because an ithread can 
 preempt the same ithread or they'll run concurrently on different cpus?

The APIC or PCI MSI registers are programmed in the physical destination
mode.  This, together with the fact that interrupts are disabled on the
target CPU, prevents parallel or nested execution of the filter.

Having both filter and ithread for the same interrupt is apparently possible
but weird.  I do not see anything which would prevent interrupt filter
from being executed while the ithread is running.  But again, this is
very unusual setup.

ithread has single stack, it cannot be run simultaneously on several CPUs.

 
 -Original Message-
 From: Konstantin Belousov [mailto:kostik...@gmail.com] 
 Sent: Sunday, May 26, 2013 06:48 PM
 To: Orit Moskovich
 Cc: freebsd-hackers@freebsd.org
 Subject: Re: preemptive kernel
 
 On Sun, May 26, 2013 at 11:09:03AM +, Orit Moskovich wrote:
  Can a filter routine preempt another filter routine? And can an interrupt 
  thread (or a filter routine) preempt another ithread?
 
 Filter handler borrows the context from the thread executed at the time of 
 the interrupt.  At least on x86 (i.e. i386 and amd64) interrupt handlers are 
 executed with the interrupts disabled.  So the filter cannot interrupt 
 another filter.
 
 Interrupt threads are executed with the interrupts enabled. So if an 
 interrupt is delivered to the CPU while the CPU executed the ithread, the 
 filter 'preempts' the ithread by executing in its context.  For the same 
 reason, if the interrupt delivered causes a higher-priority thread become 
 ready to run, then the new runnable thread could preempt the interrupt thread 
 according to the priority, affinity and other factors considered by a 
 scheduler.


pgpn6qvXMc7PX.pgp
Description: PGP signature


Re: preemptive kernel

2013-05-26 Thread Konstantin Belousov
On Sun, May 26, 2013 at 11:09:03AM +, Orit Moskovich wrote:
 Can a filter routine preempt another filter routine? And can an interrupt 
 thread (or a filter routine) preempt another ithread?

Filter handler borrows the context from the thread executed at the
time of the interrupt.  At least on x86 (i.e. i386 and amd64) interrupt
handlers are executed with the interrupts disabled.  So the filter
cannot interrupt another filter.

Interrupt threads are executed with the interrupts enabled. So if an
interrupt is delivered to the CPU while the CPU executed the ithread,
the filter 'preempts' the ithread by executing in its context.  For
the same reason, if the interrupt delivered causes a higher-priority
thread become ready to run, then the new runnable thread could preempt
the interrupt thread according to the priority, affinity and other
factors considered by a scheduler.


pgpsGRsiThClG.pgp
Description: PGP signature


Re: How to get ucred/xucred in user space?

2013-05-14 Thread Konstantin Belousov
On Tue, May 14, 2013 at 10:00:21AM +0200, Mike Ma wrote:
 Hi folks,
 
 Can I ask if there's any way to get ucred/xucred of a process in user space?
 As I'm trying to port glustertfs and it's a userland filesystem, I need to
 get secondary groups of a process.
 
 AFAIK, Linux gets them in /proc and NetBSD gets them in this way:
 int name[] = { CTL_KERN, KERN_PROC, KERN_PROC_PID, frame-root-pid
 };
 size_t namelen = sizeof name / sizeof name[0];
 struct kinfo_proc kp;
 size_t kplen = sizeof(kp);
 int i, ngroups;
 if (sysctl(name, namelen, kp, kplen, NULL, 0) != 0)
 return;
 ngroups = MIN(kp.kp_eproc.e_ucred.cr_ngroups, GF_REQUEST_MAXGROUPS);
 
 I realized none of them would work in FreeBSD.
 I'm wondering if there's any alternative way to get group information?

The sysctl to retrieve the list of the groups the process belongs to
is CTL_KERN/KERN_PROC/KERN_PROC_GROUPS. On HEAD, the libprocstat(3)
contains the helper procstat_getgroups() which would be more
conventient to use, or you could borrow a code from it.


pgpmzj11Nd2Ld.pgp
Description: PGP signature


Re: [patch] export CPU physical and virtual address sizes in sysctl oids using do_cpuid

2013-05-05 Thread Konstantin Belousov
On Mon, May 06, 2013 at 01:37:01AM +0200, Sofian Brabez wrote:
 This patch uses do_cpuid function to fetch CPU Physical
 and Virtual address sizes and adds 2 new sysctl machine
 dependant OIDs (machdep.cpu_physical_address_bits and
 machdep.cpu_virtual_address_bits).

 In order to retrieve the information, it calls do_cpuid by setting eax
 register to 0x8008 value like referenced in chapter 5.2.7 in the
 CPUID specification [1]
The patch itself is not quite right. The cpuid with the high %eax value
should be only called when the high value is supported by processor. If
you look at the function you changing more closely, you would see the
already present conditional call to do_cpuid(0x8008), done in the
correct way.

Also, I think that the patch breaks i386 build.

 Apple, Inc. in xnu kernel do the same thing but they created
 a specific node called 'machdep.cpu' to store CPU Vendor
 information, the sysctls are machdep.cpu.address_bits.virtual and
 machdep.cpu.address_bits.virtual.

 I really would like to see this patch in our operating system because
 it's a valuable information nowdays and should be provided like
 others.
And, the real question about your patch is why this information is of
_any_ interest for the usermode. The width of the physical address space
only affects BIOS, I believe, since the reported available memory cannot
be bigger then the width. The width of the virtual address space is more
interesting for the OS, but the only layer where the value makes any
sense is pmap, and all existing CPUs only support the 4-level pages,
which makes the reported width rather theoretical. Pmap hardwires the
4-level pages and the 48-bit of the linear address, which is also
encoded in the memory map of a 64-bit processes.

Also, note that cpuid is not a privileged instruction, so any code which
calls sysctl(2) could as well perform cpuid and get all the information it
needs, instead of interpreting second-hand bits. For example, you can
take a look at sysutils/x86info.


 Thus, I would like advices to see if before to be imported it needs
 modification to fit like Apple (i.e using a new sysctl node) or stay
 like that.

 Also, I profited of this changes to patch sys/modules/linprocfs
 in order to display the address sizes values in the output of
 /usr/compat/linux/proc/cpuinfo like it's done in Linux procfs [2].
For linux cpuinfo, I agree that the information needs to be provided
if native linux exports it. I believe it is better to call do_cpuid()
in the linprocfs_docpuinfo() instead.


 I filled a pr referenced as amd64/178357 [3]. My kernel was rebuilt
 without any problems in r250287 and it works as expected.


pgpIJxSgHAKWJ.pgp
Description: PGP signature


Re: What is the correct way to declare assembler global variable ?

2013-05-04 Thread Konstantin Belousov
On Fri, May 03, 2013 at 03:26:36PM -0700, Yuri wrote:
 On 05/03/2013 13:30, Konstantin Belousov wrote:
  Formal answer is for you to read about the .type directive in the GNU
  as manual. Also, you need to read about either common symbols, or about
  the .size directive.
 
  But, note that you cannot access hidden libc symbols from the code which
  links to libc (dynamically). You probably need to re-consider higher-level
  approach to your issue.
 
 
 I didn't write this code. This is google-perftools-2.0 I am trying to port.
 They added the procedure do_sbrk there (see below), that attempts to 
 emulate what libc is doing for sbrk. It fails, they probably didn't test it.
 The idea is that they need to have hooks before and after sbrk call.
 I am not sure what the best approach would be here?
 How can the memory allocation library override sbrk in libc and still be 
 able to attach to the original libc version of sbrk?
 
 On Linux there is another symbol __sbrk, and they just use it to conenct 
 to the original sbrk. But there is no such thing of FreeBSD.
 
 Yuri
 
 
 static inline void* do_sbrk(intptr_t increment) {
void* curbrk = 0;
 
 #if defined(__x86_64__) || defined(__amd64__)
 # ifdef PIC
__asm__ __volatile__(
movq .curbrk@GOTPCREL(%%rip), %%rdx;
movq (%%rdx), %%rax;
movq %%rax, %0;
: =r (curbrk)
:: %rdx, %rax);
 # else
__asm__ __volatile__(
movq .curbrk(%%rip), %%rax;
movq %%rax, %0;
: =r (curbrk)
:: %rax);
 # endif
 #else
__asm__ __volatile__(
movl .curbrk, %%eax;
movl %%eax, %0;
: =r (curbrk)
:: %eax);
 #endif
 
if (increment == 0) {
  return curbrk;
}
 
char* prevbrk = static_castchar*(curbrk);
void* newbrk = prevbrk + increment;
 
if (brk(newbrk) == -1) {
  return reinterpret_castvoid*(static_castintptr_t(-1));
}
 
return prevbrk;
 }
 
 extern C void* sbrk(intptr_t increment) __THROW {
MallocHook::InvokePreSbrkHook(increment);
void *result = do_sbrk(increment);
MallocHook::InvokeSbrkHook(result, increment);
return result;
 }

So all the code needs is to call hooks before and after the sbrk call,
right ?

For the !PIC case, what is cited above would probably work, but it requires
that libc is also linked static.

For the dynamic case, it should be enough to use the dlsym(RTLD_NEXT,
sbrk) to get the pointer to interposed symbol in the wrapper. It
assumes that the library which interposes sbrk is loaded before libc,
but it is relatively non-trivial to fail this. Something along the
lines, obviously not even compiled:

#ifdef PIC
void *
sbrk(intptr_t incr)
{
static void *(*libc_sbrk)(intptr_t);
void *ret;

if (libc_sbrk == NULL)
libc_sbrk = dlsym(RTLD_NEXT, sbrk);
sbrk_pre_hook(incr);
ret = libc_sbrk(incr);
sbrk_post_hook(ret, incr);
return (ret);
}
#endif

The sample is in fact thread-safe, or rather, not any more unsafe than
an sbrk() use itself.


pgpmpPyOYHTwM.pgp
Description: PGP signature


Re: What is the correct way to declare assembler global variable ?

2013-05-03 Thread Konstantin Belousov
On Fri, May 03, 2013 at 01:10:29PM -0700, Yuri wrote:
 I am trying to compile this code fragment into my program (taken from 
 lib/libc/amd64/sys/sbrk.S):
 void my_func() {
   ...
__asm__ __volatile__(
movq .curbrk(%%rip), %%rax;
lea  .curbrk(%%rip), %%rdx;
movq %%rax, %0;
movq %%rdx, %1;
: =r (my_curbrk),
  =r (my_curbrk_ptr)
:: %rax, %rdx);
...
 }
 
 I get a warning:
 /usr/bin/ld: warning: type and size of dynamic symbol 
 `.curbrk@@FBSDprivate_1.0' are not defined
 
 What is the correct way to declare .curbrk in in-place assembly?

Formal answer is for you to read about the .type directive in the GNU
as manual. Also, you need to read about either common symbols, or about
the .size directive.

But, note that you cannot access hidden libc symbols from the code which
links to libc (dynamically). You probably need to re-consider higher-level
approach to your issue.



pgp9v5aJINKcp.pgp
Description: PGP signature


Re: potential future proofing fix for aicasm build.

2013-05-02 Thread Konstantin Belousov
On Thu, May 02, 2013 at 10:57:23AM -0700, Xin Li wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA512
 
 On 05/02/13 00:19, Dimitry Andric wrote:
  On May 1, 2013, at 18:44, Alfred Perlstein alf...@ixsystems.com
  wrote:
  I took a shot at fixing this issue with building aicasm as part
  of buildkernel of an older 9.0 src on a machine running HEAD.
  
  aicasm.o: In function `__getCurrentRuneLocale': 
  /usr/include/runetype.h:96: undefined reference to
  `_ThreadRuneLocale'
  
  I don't understand this error message... It seems like a linker
  error, but it also seems to refer to an incorrect include file?  Is
  this during linking or compiling?
 
 This is because the locale code being a macro:
 
 #if defined(__NO_TLS) || defined(__RUNETYPE_INTERNAL)
 extern const _RuneLocale *__getCurrentRuneLocale(void);
 #else
 extern _Thread_local const _RuneLocale *_ThreadRuneLocale;
 static __inline const _RuneLocale *__getCurrentRuneLocale(void)
 {
 
 if (_ThreadRuneLocale)
 return _ThreadRuneLocale;
 if (_CurrentRuneLocale)
 return _CurrentRuneLocale;
 return _DefaultRuneLocale;
 }
 #endif /* __NO_TLS || __RUNETYPE_INTERNAL */
 
 What really puzzles me is that why the build picks up headers from the
 running system.

Because the utitily being build is intended to be run on the running system.


pgpuuWI9Zum8E.pgp
Description: PGP signature


Re: vmspace_fork()

2013-04-23 Thread Konstantin Belousov
On Tue, Apr 23, 2013 at 10:44:56AM -0700, Vijay Singh wrote:
 Hackers, what does hitting the following assert in vmspace_fork() imply?
 
  3101 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3101
 new_map = vm2 http://fxr.watson.org/fxr/ident?im=3;i=vm2-vm_map
 http://fxr.watson.org/fxr/ident?im=3;i=vm_map;
  3102 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3102
 locked http://fxr.watson.org/fxr/ident?im=3;i=locked =
 vm_map_trylock 
 http://fxr.watson.org/fxr/ident?im=3;i=vm_map_trylock(new_map);
 */* trylock to silence WITNESS */*
  3103 http://fxr.watson.org/fxr/source/vm/vm_map.c#L3103
 KASSERT http://fxr.watson.org/fxr/ident?im=3;i=KASSERT(locked
 http://fxr.watson.org/fxr/ident?im=3;i=locked, (*vmspace_fork: lock
 failed*));
I have hard time reading this html.

 
 I am hitting the assert in line 3103 and it seems like the assumption
 is that the trylock will
 always get the lock?
Assuming you are referencing line 3103 of the sys/vm/vm_map.c, yes,
the try lock operation must always succeed. The reason is that the new_map
AKA vm2 is freshly allocated and its pointer is not yet published for other
threads to see. So nothing could own the lock.

The try is done to silence the witness warning about the same lock type,
as indicated by the comment.


pgpkS59pQBGmW.pgp
Description: PGP signature


Re: [patch] pipe2

2013-04-22 Thread Konstantin Belousov
On Sun, Apr 21, 2013 at 11:02:43PM +0200, Jilles Tjoelker wrote:
 On Sat, Apr 20, 2013 at 12:48:39AM +0200, Jilles Tjoelker wrote:
  I'm also working on pipe2() (using linuxulator work) and dup3() (patch
  from Jukka A. Ukkonen).
 
 This is an implementation of pipe2. As with the accept4 patch, make
 sysent needs to be run in sys/kern and sys/compat/freebsd32.
 
 As a bonus, new architectures might implement pipe(p) as pipe2(p, 0)
 avoiding the need for assembler (but behaviour is different if the
 pointer is invalid).

I do not see anything wrong with the patch.  That said, I prefer the
pipe(2) approach of delegating the access to the usermode memory to
usermode, which avoids the need of the calls to kern_close().


pgphScrSccGS0.pgp
Description: PGP signature


Re: MADV_FREE and wait4 EFAULT

2013-04-20 Thread Konstantin Belousov
On Fri, Apr 19, 2013 at 11:53:22AM -0700, Carl Shapiro wrote:
 On Fri, Apr 19, 2013 at 5:49 AM, Konstantin Belousov 
 kostik...@gmail.comwrote:
 
  It would be of some interest to see the evidence.
 
 
 Certainly.  Here is some of the debugging messages that I added to my
 application.  The first line is a print statement that executes after the
 system call returns.  (As an aside, we issue system calls directly and do
 not link against the C library.)  The other 2 lines of interest are output
 from the dump of /proc/curproc/map that correspond to the status and rusage
 addresses.
 
 wait4 returned EFAULT, status is 0xc20021c0e8 rusage is 0xc2000eaf30
 ...
 0xc1 0xc20010 245 0 0xfe07cd9ebbc8 rw- 1 0 0x3000 COW NNC
 default - CH 1001
 ...
 0xc20020 0xc20030 250 0 0xfe0215e3ed98 rw- 1 0 0x3000 COW NNC
 default - CH 1001
 
 I realize this might not be satisfying but I am happy to provide any other
 information you might be interested in.
 
 Is your code multithreaded ?
 
 
 Yes.
Just observations/speculations:

the addresses you shown are not the (usual) addresses for the malloc heap
on amd64. It is possible to allocate and map enough shared libraries to
make malloc start using these addresses, but more likely your app is
using custom mmap() calls, possibly with the explicite address hints or
MAP_FIXED. Am I right ?

If yes and your app is multithreaded, it is possible that other thread
performs some manipulations on the address space while current thread
tries to access the range.

 
 The test case is required to decide whether the bug is in the application
  or in the OS.
 
 
 To be clear, I do not have a strong reason to believe there bug is in
 FreeBSD.  My original enquiry was solely into whether we were misusing
 MADV_FREE pages.  However, the wait4 failure is very suspicious because the
 only two addresses written to are out parameters.
In fact, the question is, why are you trying to access the memory after the
MADV_FREE. Was it correctly marked as 'still used' to prevent mmaps there ?

Note that all this is pure speculation.


pgp9JSagGP6_S.pgp
Description: PGP signature


Re: MADV_FREE and wait4 EFAULT

2013-04-19 Thread Konstantin Belousov
On Thu, Apr 18, 2013 at 02:51:43PM -0700, Carl Shapiro wrote:
 On Wed, Apr 17, 2013 at 1:21 AM, Konstantin Belousov 
 kostik...@gmail.comwrote:
 
  Did you ensured with e.g. ktrace and procstat -v that your assumptions
  hold, i.e. the addresses supplied as wait4(2) arguments are valid ?
  Please provide the minimal test case demonstrating the behaviour.
 
 
 Yes.  I instrumented my code to check for a wait4 failure, print the
 addresses of the status and rusage arguments, and dump the contents of
 /proc/curproc/map.  The addresses of the status and rusage arguments are
 always in the range of a mapping and marked as read write.
It would be of some interest to see the evidence.

Is your code multithreaded ?

 
 I have yet to distill the failure to a minimal test case.  The test case I
 do have is the test harness for the Go language.  After running for about
 45 minutes I can observe a failure.  I have been working to produce
 something smaller and faster.
The test case is required to decide whether the bug is in the application
or in the OS.

 
 
  MADV_FREE should only result in the possible lost of the previous
  content of the page, not in the faulting of the page access. From the
  inspection of the code, I do not see how MADV_FREE could result in
  the memory address becoming invalid.
 
 
 I see.  What has lead us to believe this might be an issue with page faults
 is that writing zeroes to the page with memset before passing it to wait4
 makes the error go away.
There is no difference in the access performed by copyout vs. access caused
by the usermode write.

 
 Do you have any advice about how one might go about instrumenting wait4 to
 generate more information about a failed copyout?  Are tools such as dtrace
 useful in these situations or might it be too invasive?  Because of the
 protracted test cycle and my lack of knowledge in this area, conducting
 experiments is quite painful at the moment.

No, I cannot give an advice, I think we should first decide which code
to blame.

BTW, you could try enabling sysctl machdep.uprintf_signal. Oh, you did not
specified the architecture and version of the system.


pgp0WoI0ucOnE.pgp
Description: PGP signature


Re: Synchronizing TSC

2013-04-18 Thread Konstantin Belousov
On Thu, Apr 18, 2013 at 05:16:22PM +0300, Alexander Motin wrote:
 On 17.04.2013 11:50, Konstantin Belousov wrote:
  On Wed, Apr 17, 2013 at 09:46:15AM +0300, Alexander Motin wrote:
  On 17.04.2013 03:25, Jim Harris wrote:
 
  On Tue, Apr 16, 2013 at 3:04 PM, Alexander Motin m...@freebsd.org
  mailto:m...@freebsd.org wrote:
 
   Hi.
 
   Recently I've got 6-core/12-thread system on Sandy Bridge-E Core
   i7-3930K CPU and was unpleasantly surprised to see that TSCs are not
   synchronized there. While all 11 APs were synchronized, BSP was far
   behind them. Since it is single-socket system, I don't know any good
   reason for such behavior except some BIOS bug. But I've recalled
   that somewhere was some discussions about possible TSC
   synchronization. I've implemented patch below that allows to adjust
   TSC values of BSPs to AP's one on boot using CPU MSRs, hoping that
   they should not diverge after that:
   http://people.freebsd.org/~__mav/tsc_adj2.patch
   http://people.freebsd.org/~mav/tsc_adj2.patch
 
   I don't know very much about all different TSC hardware to predict
   when it is safe to enable the functionality, but at least on my
   system being enabled via loader tunable it seems working well.
 
   Comments?
 
 
  You may be remembering this thread on r238755 last year:
 
  http://lists.freebsd.org/pipermail/svn-src-head/2012-July/038992.html
 
  This was a bug fix in the TSC synchronization test code though, not
  anything for trying to adjust out-of-sync TSCs.
 
  I remember that thread, but I think I've seen somebody told somewhere
  that it could be interesting to implement some MI mechanism. Never mind.
 
  The Intel SDM (volume 3, section 17.13 of March 2013 revision) says
  earlier models can only write to lower 32 bits of
  IA32_TIME_STAMP_COUNTER, but these models also should not have invariant
  TSC so they would never even get to your new routine.  So your patch
  seems OK for Intel CPUs, at least as a tunable that is disabled by 
  default.
 
  Thanks.
 
  My only concern would be why TSC on the BSP started out-of-sync on your
  system.  Theoretically, BIOS could adjust TSCs in SMM to try to hide SMI
  code execution from the OS, which could then make them out-of-sync
  again.  Not sure if that's what's happening here, but might be worth a
  test putting the TSC test code on a periodic timer to see if they ever
  get out of sync again.
 
  I did one more interesting observation: on every reboot drift between
  BSP and APs is growing proportionally to the previous system power-on
  time. On first boot it is -3878361036 (just above one second), after
  reboot some minutes later it is -1123454492776 (about 6 minutes), after
  another reboot it is -1853033521804 (about 10 minutes).
 
  Unless my adjustment code would be active, I would guess that AP's TSC
  is running linearly while BSP's for some reason reset to zero on every
  reboot. But since I am synchronizing them on each boot, the only
  possibility for it I see is that there is some other timer(s) /
  counter(s) not affected by MSR writes that ticks linearly and reloading
  AP's TSC, but for some reason not reloading BSP's.
 
  For me it sounds as the BIOS bug, indeed. Could you verify the content
  of IA32_TSC_ADJUST on all cores (I believe it is present on E5) ?
  Also, using TSC_ADJUST to correct the skew seems to be preferrable,
  according to the Intel docs.
 
 IA32_TSC_ADJUST register seems not present there. At least cpucontrol 
 doesn't want to read it. In Intel docs I also see it mentioned only in 
 context of future Haswell generation. And I don't see Standard Extended 
 Features line in dmesg.
 
  Why do you use cpuid in the assembly sequence ? As I understand, you
  ensure that there is a serialization point, but why do you need it ?
 
 The idea was to minimize time distance between following MSR read and 
 write. But may be it is not needed, I am not exactly sure about that magic.

Well, I do not believe that such trick is useful.

Patch with removed cpuid and with resync disabled by default (as it is
now, AFAIR) would be IMO fine for the commit.


pgpzjLRbnEjZP.pgp
Description: PGP signature


Re: MADV_FREE and wait4 EFAULT

2013-04-17 Thread Konstantin Belousov
On Tue, Apr 16, 2013 at 02:12:54PM -0700, Carl Shapiro wrote:
 I am seeing wait4 system calls failing with an EFAULT and I am trying to
 understand what might be going wrong.
 
 An inspection of the wait4 implementation suggests the  opportunity for
 EFAULT is within its invocations of copyout.  In my situation, the status
 and rusage pointer arguments contain addresses to mmaped pages which have
 been madvised as MADV_FREE.
 
 Is it permissible to pass pages which have been madvised MADV_FREE to wait4
 or any other system call for that matter?  Might there be another
 opportunity for a wait4 to EFAULT?

Did you ensured with e.g. ktrace and procstat -v that your assumptions
hold, i.e. the addresses supplied as wait4(2) arguments are valid ?
Please provide the minimal test case demonstrating the behaviour.

MADV_FREE should only result in the possible lost of the previous
content of the page, not in the faulting of the page access. From the
inspection of the code, I do not see how MADV_FREE could result in
the memory address becoming invalid.


pgprDGMZd7Ffy.pgp
Description: PGP signature


Re: Synchronizing TSC

2013-04-17 Thread Konstantin Belousov
On Wed, Apr 17, 2013 at 09:46:15AM +0300, Alexander Motin wrote:
 On 17.04.2013 03:25, Jim Harris wrote:
 
  On Tue, Apr 16, 2013 at 3:04 PM, Alexander Motin m...@freebsd.org
  mailto:m...@freebsd.org wrote:
 
  Hi.
 
  Recently I've got 6-core/12-thread system on Sandy Bridge-E Core
  i7-3930K CPU and was unpleasantly surprised to see that TSCs are not
  synchronized there. While all 11 APs were synchronized, BSP was far
  behind them. Since it is single-socket system, I don't know any good
  reason for such behavior except some BIOS bug. But I've recalled
  that somewhere was some discussions about possible TSC
  synchronization. I've implemented patch below that allows to adjust
  TSC values of BSPs to AP's one on boot using CPU MSRs, hoping that
  they should not diverge after that:
  http://people.freebsd.org/~__mav/tsc_adj2.patch
  http://people.freebsd.org/~mav/tsc_adj2.patch
 
  I don't know very much about all different TSC hardware to predict
  when it is safe to enable the functionality, but at least on my
  system being enabled via loader tunable it seems working well.
 
  Comments?
 
 
  You may be remembering this thread on r238755 last year:
 
  http://lists.freebsd.org/pipermail/svn-src-head/2012-July/038992.html
 
  This was a bug fix in the TSC synchronization test code though, not
  anything for trying to adjust out-of-sync TSCs.
 
 I remember that thread, but I think I've seen somebody told somewhere 
 that it could be interesting to implement some MI mechanism. Never mind.
 
  The Intel SDM (volume 3, section 17.13 of March 2013 revision) says
  earlier models can only write to lower 32 bits of
  IA32_TIME_STAMP_COUNTER, but these models also should not have invariant
  TSC so they would never even get to your new routine.  So your patch
  seems OK for Intel CPUs, at least as a tunable that is disabled by default.
 
 Thanks.
 
  My only concern would be why TSC on the BSP started out-of-sync on your
  system.  Theoretically, BIOS could adjust TSCs in SMM to try to hide SMI
  code execution from the OS, which could then make them out-of-sync
  again.  Not sure if that's what's happening here, but might be worth a
  test putting the TSC test code on a periodic timer to see if they ever
  get out of sync again.
 
 I did one more interesting observation: on every reboot drift between 
 BSP and APs is growing proportionally to the previous system power-on 
 time. On first boot it is -3878361036 (just above one second), after 
 reboot some minutes later it is -1123454492776 (about 6 minutes), after 
 another reboot it is -1853033521804 (about 10 minutes).
 
 Unless my adjustment code would be active, I would guess that AP's TSC 
 is running linearly while BSP's for some reason reset to zero on every 
 reboot. But since I am synchronizing them on each boot, the only 
 possibility for it I see is that there is some other timer(s) / 
 counter(s) not affected by MSR writes that ticks linearly and reloading 
 AP's TSC, but for some reason not reloading BSP's.

For me it sounds as the BIOS bug, indeed. Could you verify the content
of IA32_TSC_ADJUST on all cores (I believe it is present on E5) ?
Also, using TSC_ADJUST to correct the skew seems to be preferrable,
according to the Intel docs.

Why do you use cpuid in the assembly sequence ? As I understand, you
ensure that there is a serialization point, but why do you need it ?

The common knowledge is that for CPUs with invariant TSC, the TSC
counter is single-instance and located on uncore. For single-socket
configurations, your patch would be fine. But, for multi-socket
machines, each package has its own counter, and counters might drift.
As result, the initial synchronization would still allow the eventual
de-sync and this is problematic.


pgpUKPVTz52N0.pgp
Description: PGP signature


Re: devstat overhead VS precision

2013-04-15 Thread Konstantin Belousov
On Mon, Apr 15, 2013 at 08:42:03PM +0200, Pawel Jakub Dawidek wrote:
 On a mostly unrelated note when two threads (T0 and T1) call get*time()
 on two different cores, but T0 does that a bit earlier is it possible
 that T0 can get later time than T1?

Define earlier first.

If you have taken sufficient measures to prevent preemption and interruption,
e.g. by entering spinlock before the fragment that calls get*, then no,
it is impossible, at least not with any x86 timekeeping hardware we use.

On the other hand, if interrupts are allowed, all bets are off.


pgp5apjuytuPs.pgp
Description: PGP signature


Re: devstat overhead VS precision

2013-04-15 Thread Konstantin Belousov
On Mon, Apr 15, 2013 at 09:37:30PM +0200, Pawel Jakub Dawidek wrote:
 On Mon, Apr 15, 2013 at 10:18:15PM +0300, Konstantin Belousov wrote:
  On Mon, Apr 15, 2013 at 08:42:03PM +0200, Pawel Jakub Dawidek wrote:
   On a mostly unrelated note when two threads (T0 and T1) call get*time()
   on two different cores, but T0 does that a bit earlier is it possible
   that T0 can get later time than T1?
  
  Define earlier first.
  
  If you have taken sufficient measures to prevent preemption and 
  interruption,
  e.g. by entering spinlock before the fragment that calls get*, then no,
  it is impossible, at least not with any x86 timekeeping hardware we use.
  
  On the other hand, if interrupts are allowed, all bets are off.
 
 So if we consider only one thread, it is not possible for it to obtain
 time t0, be scheduled to different CPU and obtain t1 where t1  t0?

Yes, I believe this scenario is also not possible. The context switching
ensures that thread's view on the global memory is consistent. At least
it is so on x86, and I think it must be the same on all other architectures.
Otherwise the compiler emited code would not work.


pgprLMOfbzxUH.pgp
Description: PGP signature


Re: libprocstat(3): retrieve process command line args and environment

2013-03-31 Thread Konstantin Belousov
On Fri, Mar 29, 2013 at 02:31:57PM +0200, Mikolaj Golub wrote:
 On Fri, Mar 29, 2013 at 11:22:45AM +0200, Konstantin Belousov wrote:
  On Thu, Mar 28, 2013 at 11:18:21PM +0200, Mikolaj Golub wrote:
   On Thu, Mar 28, 2013 at 12:51:34PM +0200, Konstantin Belousov wrote:
   
In the generic Elf 64bit draft specification I have, the notes sections
are specified to consists of entries, each of which is an array of 
8-byte
words. I think we are right using the 8-byte alignment.
   
   I have impression many implementations use 4-byte alignment. E.g. in
   NetBSD:
   
   sys/kern/core_elf32.c:
   
   #define ELFROUNDSIZE4   /* XXX Should it be sizeof(Elf_Word)? */
   #define elfround(x) roundup((x), ELFROUNDSIZE)
  Note that this is core_elf32. I am concerned with the 64-bit cores.
 
 core_elf64.c:
 
 #define   ELFSIZE 64
 
 #include core_elf32.c
Also, the 4-bytes alignment is described in the comments in the glibc
csu/abi-note.S.

 
   
   Also, we have inconsistency with imgactl_elf.c/parse_notes(), which
   uses 4-byte alignment:
   
 note = (const Elf_Note *)((const char *)(note + 1) +
 roundup2(note-n_namesz, sizeof(Elf32_Addr)) +
 roundup2(note-n_descsz, sizeof(Elf32_Addr)));
   
   I suppose there were no issues before, because accidentally the sizes
   of all notes we had were 8 bytes aligned.
  Indeed, both ABI and NOINIT notes have size which is multiple of 8.
  
   
   Now, when I add new notes it will break things. I don't have strong
   opinion, it will be ok for me to leave 8-byte alignment and fix
   issues, just want to have strong support here :-)
  Well, while the issue is discussed and decided, you could just make
  your new notes size be multiple of 8 too.
 
 I thought about this too. Then I need to be more caerful when
 extracting stats from notes, because the length returned by
 procstat_core_get() can be more than a real payload.
 
 Ok, I will try this way.
 
 I could add length to the note header, which is currently contains
 only structsize, so it would became something like:
 
 struct {
   int structsize;
   int lenght;
 }
 
 But not sure it is worth doing, especially if the forced 8-bit
 alignment is a temporary mesure.
No, it is definitely not worth it.

I inspected imgact_elf.c:parse_note(), imgact_elf.c:putnote() and
rtld.c:digest_notes(). Only  putnote() uses 8-byte alignment.
Every other OS and our !coredump code assumes 4-byte alignment.

Does changing the putnote() to align on the 4-byte boundary cause
real change in the core file notes layout ?


pgppHqvLntCg7.pgp
Description: PGP signature


Re: libprocstat(3): retrieve process command line args and environment

2013-03-31 Thread Konstantin Belousov
On Sun, Mar 31, 2013 at 06:53:00PM +0300, Mikolaj Golub wrote:
 On Sun, Mar 31, 2013 at 04:40:47PM +0300, Konstantin Belousov wrote:
 
  I inspected imgact_elf.c:parse_note(), imgact_elf.c:putnote() and
  rtld.c:digest_notes(). Only  putnote() uses 8-byte alignment.
  Every other OS and our !coredump code assumes 4-byte alignment.
 
 Thanks!
  
  Does changing the putnote() to align on the 4-byte boundary cause
  real change in the core file notes layout ?
 
 Currently, we store only 4 types of notes in a core file:
 
 #define   NT_PRSTATUS 1   /* Process status. */
 #define   NT_FPREGSET 2   /* Floating point registers. */
 #define   NT_PRPSINFO 3   /* Process state info. */
 #define   NT_THRMISC  7   /* Thread miscellaneous info. */
 
 I checked the sizes of structures inserted into the notes, and on amd64
 they all are multiple of 8:
 
 (kgdb) p sizeof(prpsinfo_t) % 8
 $1 = 0
 (kgdb) p sizeof(prstatus_t) % 8
 $2 = 0
 (kgdb) p sizeof(prfpregset_t) % 8
 $3 = 0
 (kgdb) p sizeof(thrmisc_t) % 8
 $4 = 0
 
 so both 4-byte and 8-byte aligned.
Well, FreeBSD supports some more 64bit architectures, besides amd64.
At least on powerpc64, I get
prpsinfo 120 0
prstatus_t 344 0
prfpregset_t 264 0
thrmisc_t 24 0

Second column is sizeof(), third is sizeof() % 8. This is in fact not
much surprising, since all ABIs define the alignment of the structure
as the alignment of the most demanding member. And, because 64bit
architectures have 8-byte registers, it is indeed expected that the
size % 8 == 0.

 
 I believe that the patch below will not change the current core file
 notes layout, will make things consistent in our tree, and will make
 adding my procstat notes easier, if I use 4-byte alignment.
 
 Are you ok if I commit it before introducing my changes?
Yes, I believe this is the right thing to do.

 
 Index: sys/kern/imgact_elf.c
 ===
 --- sys/kern/imgact_elf.c (revision 248706)
 +++ sys/kern/imgact_elf.c (working copy)
 @@ -1538,10 +1538,10 @@ __elfN(putnote)(void *dst, size_t *off, const char
   *off += sizeof note;
   if (dst != NULL)
   bcopy(name, (char *)dst + *off, note.n_namesz);
 - *off += roundup2(note.n_namesz, sizeof(Elf_Size));
 + *off += roundup2(note.n_namesz, sizeof(Elf32_Size));
   if (dst != NULL)
   bcopy(desc, (char *)dst + *off, note.n_descsz);
 - *off += roundup2(note.n_descsz, sizeof(Elf_Size));
 + *off += roundup2(note.n_descsz, sizeof(Elf32_Size));
  }
  
  static boolean_t
 
 Also, shouldn't we update then the following comment in sys/elf_common.h?
 
 /*
  * Note header.  The .note section contains an array of notes.  Each
  * begins with this header, aligned to a word boundary.  Immediately
  * following the note header is n_namesz bytes of name, padded to the
  * next word boundary.  Then comes n_descsz bytes of descriptor, again
  * padded to a word boundary.  The values of n_namesz and n_descsz do
  * not include the padding.
  */
 
 -- 
 Mikolaj Golub


pgpt3ztmYjZwj.pgp
Description: PGP signature


Re: libprocstat(3): retrieve process command line args and environment

2013-03-29 Thread Konstantin Belousov
On Thu, Mar 28, 2013 at 11:18:21PM +0200, Mikolaj Golub wrote:
 On Thu, Mar 28, 2013 at 12:51:34PM +0200, Konstantin Belousov wrote:
 
  In the generic Elf 64bit draft specification I have, the notes sections
  are specified to consists of entries, each of which is an array of 8-byte
  words. I think we are right using the 8-byte alignment.
 
 I have impression many implementations use 4-byte alignment. E.g. in
 NetBSD:
 
 sys/kern/core_elf32.c:
 
 #define ELFROUNDSIZE4   /* XXX Should it be sizeof(Elf_Word)? */
 #define elfround(x) roundup((x), ELFROUNDSIZE)
Note that this is core_elf32. I am concerned with the 64-bit cores.

 
 Also, we have inconsistency with imgactl_elf.c/parse_notes(), which
 uses 4-byte alignment:
 
   note = (const Elf_Note *)((const char *)(note + 1) +
   roundup2(note-n_namesz, sizeof(Elf32_Addr)) +
   roundup2(note-n_descsz, sizeof(Elf32_Addr)));
 
 I suppose there were no issues before, because accidentally the sizes
 of all notes we had were 8 bytes aligned.
Indeed, both ABI and NOINIT notes have size which is multiple of 8.

 
 Now, when I add new notes it will break things. I don't have strong
 opinion, it will be ok for me to leave 8-byte alignment and fix
 issues, just want to have strong support here :-)
Well, while the issue is discussed and decided, you could just make
your new notes size be multiple of 8 too.

 
 BTW, looking at NetBSD code I see they set p_align in the note
 segement to ELFROUNDSIZE:
 
 /* Write out the PT_NOTE header. */
 ws.psections-p_type = PT_NOTE;
 ws.psections-p_offset = notestart;
 ws.psections-p_vaddr = 0;
 ws.psections-p_paddr = 0;
 ws.psections-p_filesz = notesize;
 ws.psections-p_memsz = 0;
 ws.psections-p_flags = PF_R;
 ws.psections-p_align = ELFROUNDSIZE;
 
 while we set to 0:
 
   /* The note segement. */
   phdr-p_type = PT_NOTE;
   phdr-p_offset = hdrsize;
   phdr-p_vaddr = 0;
   phdr-p_paddr = 0;
   phdr-p_filesz = notesz;
   phdr-p_memsz = 0;
   phdr-p_flags = 0;
   phdr-p_align = 0;
You mean, for the core dumps ?

 
 Shouldn't we set it to alignment size too? Note also, they set
 Segment is readable flag.
I think both changes are fine.

I skimmed over the usermode parts of the patch. One thing I did not liked
is the mis-handling of the read() return values. If there is short read,
the errno value is meaningless, but warn() would still append it to
the message. I suggest to explicitely distinguish -1 and = 0 returns
from reads.


pgp4aWv2zWVyL.pgp
Description: PGP signature


Re: libprocstat(3): retrieve process command line args and environment

2013-03-28 Thread Konstantin Belousov
On Sun, Mar 24, 2013 at 05:54:28PM +0200, Mikolaj Golub wrote:
 Here is an updated patch set, which I think includes all kib's
 suggestions. It also introduces procstat groups, umask, rlimit, and
 osrel notes.
 
 http://people.freebsd.org/~trociny/procstat_core.3.patch
 
 Sbuf section interface looks like below:
 
 /* start a section */
 sbuf_start_section(sb, NULL);
 /* write something to sbuf */
 ...
 /* start a subsection (need to save the current section length) */
 sbuf_start_section(sb, old_len);
 /* write something to sbuf */
 ...
 /* end a subsection padding to 4 bytes with '\0' bytes 
   (need to provide the previously saved section length) */
 sbuf_end_section(sb, old_len, 4, 0);
 ...
  /* aling the whole section to page size */
 sbuf_end_section(sb, -1, PAGE_SIZE, 0);
This looks fine.

 
 Open issues/questions:
 
 1) I would also like to make libprocstat(3) extract environment, args,
 and auxv from the core. It looks like I don't need to store envv and
 argv in notes, as it is already present in the core. But I think I
 need to know psstrings to find them.
 
 Would it be ok, if I add auxv and psstrings notes, and extract envv
 and agrv from a program section in the core?
Yes, I agree that not doing this in kernel at the time of the core
dump is preferred.

As usual, both args/env and aux vector could be overriden by the userspace,
so its reading is only best-efforts operation.

 
 2) I started NT_PROCSTAT constants from the first not used number in
 elf_common.h, i.e. 8. But in this case they conflict with those
 available on other systems:
 
 contrib/binutils/include/elf/common.h:
 
 #define NT_PSTATUS  10  /* Has a struct pstatus */
 #define NT_FPREGS   12  /* Has a struct fpregset */
 #define NT_PSINFO   13  /* Has a struct psinfo */
 #define NT_LWPSTATUS16  /* Has a struct lwpstatus_t */
 #define NT_LWPSINFO 17  /* Has a struct lwpsinfo_t */
 #define NT_WIN32PSTATUS 18  /* Has a struct win32_pstatus */
 
 Although note name = FreeBSD should have disambiguated this, readelf
 looks like ignores this and its output for my core on i386 looks
 confusing:
 
   Owner Data size   Description
   FreeBSD   0x006c  NT_PRPSINFO (prpsinfo structure)
   FreeBSD   0x0068  NT_PRSTATUS (prstatus structure)
   FreeBSD   0x00b0  NT_FPREGSET (floating point registers)
   FreeBSD   0x0018  NT_THRMISC (thrmisc structure)
   FreeBSD   0x0304  Unknown note type: (0x0008)
   FreeBSD   0x0a6c  Unknown note type: (0x0009)
   FreeBSD   0x08d4  NT_PSTATUS (pstatus structure)
   FreeBSD   0x000c  Unknown note type: (0x000b)
   FreeBSD   0x0006  NT_FPREGS (floating point registers)
   FreeBSD   0x00d4  NT_PSINFO (psinfo structure)
   FreeBSD   0x0008  Unknown note type: (0x000e)
 
 Should I use some other range for NT_PROCSTAT notes?
I think the only possible solution is to post on the binutils list
and fix the readelf ignore of the note vendor.

 
 3) Following our current code I align notes to sizeof(Elf_Size), which
 is 4 on i386 and 8 on amd64.
 
 But I have an issue reading the notes by readelf on amd64, which alway
 uses 4 byte alignment:
 
 contrib/binutils/binutils/readelf.c:
 
   next = (Elf_External_Note *)(inote.descdata + align_power 
 (inote.descsz, 2));
 
 where
 
   #define align_power(addr, align)\
   (((addr) + ((bfd_vma) 1  (align)) - 1)  ((bfd_vma) -1  (align)))
 
 Should I change alignment to 4 bytes?
In the generic Elf 64bit draft specification I have, the notes sections
are specified to consists of entries, each of which is an array of 8-byte
words. I think we are right using the 8-byte alignment.

 
 4) In libprocstat I added new functions and placed them under already
 existent FBSD_1.3 version section in Symbol.map.
 
 Shouldn't I bump the version? Won't I need any additional care if I
 want to MFC the code to stable/9 and may be 8?
Version of what ? MFC does not require any additional actions, FBSD_1.3
is the valid version namespace in 9 and 8.



pgpJEnpLmwEkV.pgp
Description: PGP signature


Re: syscalls per process ?

2013-03-26 Thread Konstantin Belousov
On Tue, Mar 26, 2013 at 11:58:02AM +1000, Paul Koch wrote:
 Hi,
 
 At the start of all our programs, we set up a SIGALRM signal handler
 to retrieve rusage stats each second and log this info for analysis.
 It is useful for long running programs.  One of the things we
 would really really like to get is the number of system calls the
 process has performed.
 
 Is there a simple way for a process to retrieve its own number of
 syscalls it has performed ?   Pity it's not available via getrusage().
 
 We don't want to run an external program (eg. truss/dtrace) on each
 program.

AFAIR, the per-threads syscalls are not accounted. The accounting was
performed for some short time, but it was only used as a workaround for
some nfs client issue, and since the alternate approach was used, the
td_syscalls was removed.

Only global system count of the syscalls is maintained.


pgphES023wWDl.pgp
Description: PGP signature


Re: [patch] SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC

2013-03-18 Thread Konstantin Belousov
On Sun, Mar 17, 2013 at 10:23:53PM +0100, Jilles Tjoelker wrote:
 Here are some more modifications to allow creating file descriptors with
 close-on-exec set. Like in linux/glibc, SOCK_CLOEXEC and SOCK_NONBLOCK
 can be OR'ed in socket() and socketpair()'s type parameter, and
 MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS)
 atomically close-on-exec.
 
 The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in
 NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_*.
 
 I do not pass the SOCK_* flags to MAC because this may cause incorrect
 failures and can be done later via fcntl() anyway. I expect audit to
 cope with the new flags.
 
 For MSG_CMSG_CLOEXEC, I had to change unp_externalize to take a flags
 argument.

This looks fine to me.

The only note I have, which is not directly related to your patch,
is the recvmsg(2) behaviour when the undefined flag is passed.
The syscall silently ignores the flags. I think this is quite wrong,
and would cause interesting (security) implications if the program
using the MSG_CMSG_CLOEXEC is run on older kernel which does not
interpret the flag.

Might be, we should start returning EINVAL for unknown flag, despite
SUSv4 not specifying the condition ?


pgp6vw8MKGyIu.pgp
Description: PGP signature


Re: libprocstat(3): retrieve process command line args and environment

2013-03-17 Thread Konstantin Belousov
On Sun, Mar 17, 2013 at 12:35:20AM +0200, Mikolaj Golub wrote:
 On Sat, Mar 16, 2013 at 09:16:05PM +0200, Konstantin Belousov wrote:
 
  IMO sbuf_pad() should be moved to subr_sbuf.c. I find the KPI of
  the sbuf_pad() wrong. You have two separate number, the amount to
  pad to, and the current amount. Natural interface would take the
  two numbers separate instead of the difference. Also, could the
  sbuf_pad() deduce the current amount on its own, this would be the
  best ?
 
 Hm, I am not sure about this.
 
 So are you proposing instead of something like this
 
   sbuf_pad(sb, roundup(x, y) - x, 0);
 
 to have
 
   sbuf_pad(sb, x, roundup(x, y), 0)?
 
 Although it is a little easier to write, it looks less intuitive for
 me.  E.g. I have troubles how to document this and explain. I can't
 reffer x as a current position in sbuf, because it might not be. It is
 just a some position, roundup(x,y) is another one, and only their
 difference makes sence for sbuf_pad, so why not just to provide this
 difference? So
 
   sbuf_pad(sb, from, to, c);
 
 looks for me less intutive than
 
   sbuf_pad(sb, len, c);
 
 A KPI that would be natural for my case: 
 
   /* start a section that is going to be aligned to sizeof(Elf_Size),
  using byte '0' for padding */
   sbuf_padded_start(sb, sizeof(Elf_Size), 0);
   /* write something to sbuf */
   sbuf_bcat(sb, data, len);
   /* align the sbuf section */
   sbuf_pad(sb);
 
 This might look a little awkward and would add some overhead for the normal
 case though...
This looks fine, in fact. You might want to call it sbuf_start_section()
and sbuf_end_section() ?

 
  In register_note(), put spaces around '|' for the malloc line.
  
  It seems that you did not moved the 'ABI hack' for ENOMEM situation for
  sysctl_kern_proc_filedesc() into the rewritten function.
 
 
 Ah, this is a thing I wanted to discuss but forgot! As I understand
 the idea of the 'ABI hack' is: if the output buffer is less than the
 size of data we have, truncate our data to the last successfully
 written kinfo_file structure and return without error.
 
 In my code I do reset ENOMEM to 0 (see sysctl_kern_proc_filedesc), but
 I don't see a way how using sbuf interface I can truncate req-oldidx
 to the last successfully written file: sbuf_bcat() (to internal
 buffer) may be successful and the error might appear only later, when
 draining. I suspect it will break things if I return with a partial
 kinfo_file, but haven't come with a solution yet...
All you need is to reset req-oldidx. This could be done outside the
sbuf interface, in the top level function implementing the sysctl ?

What you propose in the follow-up message should work too, I do not
have any preference.
 
  Please commit the changes to use pget() in the sysctl handlers separately.
  
  Indents after the else clauses in kern_proc_out() are wrong.
 
 Do you mean indents after '#ifdef COMPAT_FREEBSD32' block? I did it
 that way so if COMPAT_FREEBSD32 sections were removed from the code
 the indentation would be correct. I saw this practice through the code
 and used it myself before.
The sections are not going to be removed. IMHO code should be formatted
as if the preprocessor directive lines are not present. Could you point
out an example of existing code consistent with your indentation ?

 
  Since you are moving the KERN_PROC_ZOMBMASK out of kern_proc.c,
  a comment is needed to describe its usage. I would find it very
  confusing if grep reveals no like of code that sets the flags.
  
  In the comment for sbuf_drain_core_output(), s/drainig/draining/,
  s/sefely/safely/ and s/hold/call us with the process lock held/.
  
  I do not see much sense in the kstack note. The core is dumped when
  the threads are moved into the safe state in the kernel, so you
  cannot catch 'living' stack backtraces for the kernel side of
  things.
 
 I was afraid of it after I had tried it on real dumps :-( Ok, will
 remove the kstack note.
 
  On the other hand, things like umask, resources and osrel might be
  of the interest for post-morted analysis.
 
 This is in my TODO list.
 
  I did not looked at the usermode changes.
 
 Thanks for your suggestions! Will do them.
 
 -- 
 Mikolaj Golub


pgpB2nVH0GTw4.pgp
Description: PGP signature


Re: Linking produces unusable executable

2013-03-10 Thread Konstantin Belousov
On Sun, Mar 10, 2013 at 09:00:02PM -0400, Ilya Kaliman wrote:
 Hello, hackers!
 
 I have a strange problem with a big executable. I have a piece of scientific
 software and some C++ module in it with a LOT of template code. When compiled
 this module produces 450 MB archive file (w/o debugging symbols). Now, when I
 compile the program without this module everything works perfect. With this
 module turned on the linker produces an executable (around 180 MB in size)
 without any errors or warnings. But when I try to start the final program zsh
 just says: abort. ldd on this exe says: ./a.out: signal 6.
 
 I watched the memory consumption during linking and it doesn't seem to exhaust
 all available memory (the linker seem to stop allocating on around 2 GB). I've
 also tried to enable --no-keep-memory for ld with no luck - linking still
 produces no errors but the resulting executable is unusable.
 
 I've tried it on 9.1 and 10-CURRENT with both gcc/g++/ld from the base system
 and from ports (gcc 4.7.3, binutils 2.23.1) and with clang.
 
 I've tried to build some of the bigger ports like chromium (just in case):
 all works fine.
 
 Everything works on linux though (with the same gcc/ld). With debugging 
 symbols
 the exe is around 1GB, without them its around 200MB. Works fine in every case
 with different optimization levels.
 
 Any ideas how to solve this?

For start, it would be nice to provide some useful information together
or even instead of the long story.

What is the architecture ? Show at least the output of the size(1) on
the final binary. Show exact shell message on the attempt of the binary
run. Show the ktrace/kdump of the start attempt.

As a guess, look at the sysctl kern.maxtsiz and compare it to the text
size reported by the size(1). The same for the initialized data segment
and kern.maxdsiz.


pgph__UaQKK53.pgp
Description: PGP signature


Re: clang generated code sometimes confuses fbt

2013-03-02 Thread Konstantin Belousov
On Sat, Mar 02, 2013 at 09:23:15PM +0100, Dimitry Andric wrote:
 On 2013-03-02 18:52, Andriy Gapon wrote:
  on 02/03/2013 19:35 Andriy Gapon said the following:
  Now, I am not quite sure why ctfconvert skips bpobj_iterate_impl in the
  clang-generated code.  Seems like some sort of a bug in ctfconvert.
 
  It seems that gcc and clang put different names for symbol of type FILE:
  clang:
  readelf -a -W /usr/obj/usr/src/sys/TRANT/bpobj.o| fgrep -w FILE
1:  0 FILELOCAL  DEFAULT  ABS
  /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/bpobj.c
 
  gcc:
  readelf -a -W /usr/obj/usr/src/sys/ODYSSEY/bpobj.o| fgrep -w FILE
1:  0 FILELOCAL  DEFAULT  ABS bpobj.c
 
  ctfconvert seems to compare this value with bpobj.c and so in the clang 
  case
  it doesn't recognize the static symbols.
 
  Does my analysis seem reasonable?
 
 Have you verified that ctfconvert does the right thing, if you modify
 the FILE symbol to have just the filename?
 
 Indeed, clang puts the original filename from the command line in the
 .file directive, while gcc explicitly removes any directory names; see
 contrib/gcc/toplev.c, around line 680:
 
void
output_file_directive (FILE *asm_file, const char *input_name)
{
  int len;
  const char *na;
 
  if (input_name == NULL)
input_name = stdin;
 
  len = strlen (input_name);
  na = input_name + len;
 
  /* NA gets INPUT_NAME sans directory names.  */
  while (na  input_name)
{
   if (IS_DIR_SEPARATOR (na[-1]))
 break;
   na--;
}
 ...
 
 That NA gets INPUT_NAME sans directory names comment was inserted by
 rms in r279. :-)  So I guess this is the way gcc has done it from the
 start, but there is no explanation as to why rms chose to remove those
 directory names.  I do not see the problem, except maybe for having
 reproducible builds?

I seems that at least gdb also depends on the stripping the path for stabs
(which is not dwarf) debugging format interpretation. On the FreeBSD
system, do 'info', select 'stabs' - Stab Sections - Elf Linker
Relocation. The last paragraph documents the gdb requirements.

So it seems that stripped path in the STT_FILE is the common expectation.


pgpIIMEAO2ALX.pgp
Description: PGP signature


Re: [patch] statfs does not detect -t nullfs -o union as a union mount

2013-03-01 Thread Konstantin Belousov
On Sat, Mar 02, 2013 at 01:00:33AM +0100, Jilles Tjoelker wrote:
 On Thu, Feb 28, 2013 at 12:21:44AM +0200, Konstantin Belousov wrote:
  On Wed, Feb 27, 2013 at 10:31:42PM +0100, Jilles Tjoelker wrote:
   While testing recent changes to opendir(), I noticed that fstatfs() does
   not return the MNT_UNION flag for a -t nullfs -o union mount. As a
   result, opendir()/readdir() return files that exist in both top and
   bottom directories twice (at least . and ..). Other -o union mounts and
   -t unionfs mounts work correctly in this regard.
 
   The below patch passes through just the MNT_UNION flag of the nullfs
   mount itself. Perhaps more flags should be passed through.
 
   commit fce32a779af4eb977c9b96feb6e4f811d89f2881
   Author: Jilles Tjoelker jil...@stack.nl
   Date:   Sat Feb 23 22:22:39 2013 +0100
   
   nullfs: Preserve the MNT_UNION flag of the nullfs mount itself.
   
   This is needed so that opendir() can properly detect a union mount 
   like
   mount -t nullfs -o union dir1 dir2.
   
   diff --git a/sys/fs/nullfs/null_vfsops.c b/sys/fs/nullfs/null_vfsops.c
   index 3724e0a..ff06f57 100644
   --- a/sys/fs/nullfs/null_vfsops.c
   +++ b/sys/fs/nullfs/null_vfsops.c
   @@ -313,7 +313,7 @@ nullfs_statfs(mp, sbp)

 /* now copy across the interesting information and fake the rest */
 sbp-f_type = mstat.f_type;
   - sbp-f_flags = mstat.f_flags;
   + sbp-f_flags = (sbp-f_flags  MNT_UNION) | mstat.f_flags;
 sbp-f_bsize = mstat.f_bsize;
 sbp-f_iosize = mstat.f_iosize;
 sbp-f_blocks = mstat.f_blocks;
 
  Would it make sense to preserve more flags from the upper mount ?
  I see a use for MNT_NOEXEC as well, at least.
 
 Yes, preserving MNT_NOEXEC will make -t nullfs -o noexec work better, in
 particular rtld's check for libraries loaded via environment variables.
 
 In the same way MNT_RDONLY, MNT_NOSUID and MNT_NOSYMFOLLOW could be
 preserved.
 
 On the other hand, MNT_ROOTFS should probably not be passed through from
 the underlying filesystem.
 
 This would give
 sbp-f_flags = (sbp-f_flags  (MNT_RDONLY | MNT_NOEXEC | MNT_NOSUID |
 MNT_UNION | MNT_NOSYMFOLLOW) | (mstat.f_flags  ~MNT_ROOTFS);
I think this is fine.


pgpbxLGJDTdz9.pgp
Description: PGP signature


Re: [patch] statfs does not detect -t nullfs -o union as a union mount

2013-02-27 Thread Konstantin Belousov
On Wed, Feb 27, 2013 at 10:31:42PM +0100, Jilles Tjoelker wrote:
 While testing recent changes to opendir(), I noticed that fstatfs() does
 not return the MNT_UNION flag for a -t nullfs -o union mount. As a
 result, opendir()/readdir() return files that exist in both top and
 bottom directories twice (at least . and ..). Other -o union mounts and
 -t unionfs mounts work correctly in this regard.
 
 The below patch passes through just the MNT_UNION flag of the nullfs
 mount itself. Perhaps more flags should be passed through.
 
 commit fce32a779af4eb977c9b96feb6e4f811d89f2881
 Author: Jilles Tjoelker jil...@stack.nl
 Date:   Sat Feb 23 22:22:39 2013 +0100
 
 nullfs: Preserve the MNT_UNION flag of the nullfs mount itself.
 
 This is needed so that opendir() can properly detect a union mount like
 mount -t nullfs -o union dir1 dir2.
 
 diff --git a/sys/fs/nullfs/null_vfsops.c b/sys/fs/nullfs/null_vfsops.c
 index 3724e0a..ff06f57 100644
 --- a/sys/fs/nullfs/null_vfsops.c
 +++ b/sys/fs/nullfs/null_vfsops.c
 @@ -313,7 +313,7 @@ nullfs_statfs(mp, sbp)
  
   /* now copy across the interesting information and fake the rest */
   sbp-f_type = mstat.f_type;
 - sbp-f_flags = mstat.f_flags;
 + sbp-f_flags = (sbp-f_flags  MNT_UNION) | mstat.f_flags;
   sbp-f_bsize = mstat.f_bsize;
   sbp-f_iosize = mstat.f_iosize;
   sbp-f_blocks = mstat.f_blocks;

Would it make sense to preserve more flags from the upper mount ?
I see a use for MNT_NOEXEC as well, at least.


pgpZ9zKLu606K.pgp
Description: PGP signature


Re: [patch] Wine DLL base address patches

2013-02-26 Thread Konstantin Belousov
On Tue, Feb 26, 2013 at 10:52:15PM +0200, Damjan Jovanovic wrote:
 On Fri, Feb 22, 2013 at 5:19 AM, Damjan Jovanovic damjan@gmail.com 
 wrote:
  On Thu, Feb 21, 2013 at 5:44 PM, Konstantin Belousov
  kostik...@gmail.com wrote:
  On Thu, Feb 21, 2013 at 12:57:45AM +0200, Damjan Jovanovic wrote:
  On Wed, Feb 20, 2013 at 10:51 PM, Tijl Coosemans t...@coosemans.org 
  wrote:
   On 20-02-2013 16:48, Konstantin Belousov wrote:
   On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote:
   Hi
  
   Wine needs some of its libraries to be loaded at specific base
   addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently
   lacks.
  
   I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that
   loads libraries at their preferred base addresses
   (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port
   of Prelink to FreeBSD which Wine uses to set base addresses
   (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-),
   the changed dynamic loader doesn't show any problems in a few days of
   testing, and prelink works with the --reloc-only option as used by
   Wine.
  
   Please review/test/comment/commit.
  
   Unfortunately, it is not safe. MAP_FIXED overrides any previous 
   mappings
   which could exist at the specified address.
  
   I've simplified the rtld patch to a single line. The second patch makes
   Wine use -Ttext-segment linker flag instead of prelink. This requires
   binutils from ports, but it's easier than porting prelink.
  
 
  All of that occurred to me as well.
 
  The problem with that one-line rtld patch is that loading an
  application will now fail if any of its libraries cannot be loaded at
  their requested address.
  But this is intended behaviour. Also, the default virtaddr base for the
  shared libraries is 0, so the existing binaries should be not affected.
 
  In that case, and since failing to load a library only causes the
  process to exit when starting up and not when it calls dlopen(), I
  approve.
 
 
  The problem with -Ttext-segment (and isn't it just -Ttext?) is that it
  doesn't seem to work: the base_vaddr seen by rtld will remain 0, and
  the address listed in /proc/.../map is different from what it should
  be. Also run readelf -l on a library compiled that way and compare
  with the output of one run through prelink --reloc-only, you'll see
  the lowest VirtAddr and PhysAddr in LOAD headers change only with
  prelink. I really ported prelink because there was no other choice.
  The -Ttext-segment does work. As indicated by Tijl, you need recent
  binutils.  I just verified that ld 2.32.1 obeys -Ttext-segment.
 
  You can also take a look at the default linker script to see how
  -Ttext-segment is used, look for SEGMENT_START(text-segment).
 
 
  My apologies: I confused -Ttext which is documented but doesn't work,
  with -Ttext-segment which is undocumented in FreeBSD 9.1 and might
  work. I would test it further, but -CURRENT doesn't installworld
  (ERROR: Required auditdistd user is missing, see /usr/src/UPDATING.)
  and I am away until next week.
 
  Prelink is now in Ports. What I'd recommend is checking if the
  binaries are the same, and if not, doing a diff between readelf -a
  outputs of the prelinked binary vs -Ttext-segmented binary. Also run
  this a few times and make sure the address is what's expected:
 
  #include windows.h
  #include stdio.h
  int main(int argc, char **argv)
  {
printf(%p\n, LoadLibrary(KERNEL32));
return 0;
  }
 
  mingw32-gcc hello.c -o hello.exe
  wine hello.exe
 
 
 With binutils 2.23.1 (in ports), comparing the output of ld
 -Ttext-segment=0x7b80 and prelink --reloc-only 0x7b80 using
 diffs of readelf -a outputs gives this:
 -11: 7b80 0 OBJECT  LOCAL  DEFAULT6
 _GLOBAL_OFFSET_TABLE_
 +11:  0 OBJECT  LOCAL  DEFAULT6
 _GLOBAL_OFFSET_TABLE_
 in other words, prelink also shifts the global offset table to the
 requested base address, ld does not. I don't think this matters since
 it's only ELF segments that get loaded - sections are irrelevant.
I suspect that it is sort of bug in ld. On the other hand, _G_O_T_ symbol
should be not used for real relocations, because both i386 and amd64 define
specific relocations which allow to reference the begining of the GOT.
The symbol is mostly a symbolic way to generate the relocations.

So indeed, this should be fine.
 
 objdump -s finds no differences.
 
 So I am happy with all of Tijl's patches, please commit them.

I expect Tijl to do it himself.


pgpg3p0c_4gD1.pgp
Description: PGP signature


Re: Building a program to be used only during make buildkernel

2013-02-25 Thread Konstantin Belousov
On Mon, Feb 25, 2013 at 09:34:46AM +0100, Jean-S?bastien P?dron wrote:
 I haven't tried to build the module into the kernel yet.

IMO this is not a target for now.  The KMS currently is not integrated
with syscons at all, so startup of a KMS driver causes blank screen and
lost console. The vidconsole is functional, but it has no use for the
user who cannot see the output.  This is the state of i915 driver, which
does not provide a files glue for static linking into the kernel, for
the described reason.

Solving integration issues with syscons is rather pointless, both due
to the baroque interface of the syscons, and due to the supposed replacement
of syscons, available in the base/user/ed/newcons.


pgpG41nDnQLxk.pgp
Description: PGP signature


Re: Building a program to be used only during make buildkernel

2013-02-22 Thread Konstantin Belousov
On Fri, Feb 22, 2013 at 02:36:42PM +0100, Jean-S?bastien P?dron wrote:
 Hello,
 
 During the build of the radeon DRM driver in Linux, a C program is
 first built to generate headers included by the driver. The program is
 not meant to be installed.
 
 In FreeBSD, how would one build and use such a program during the build
 of a kernel module? And what if the driver is built into the kernel instead?

Look at the genasssym program, used by the kernel and several ABI modules
for exactly this use.


pgpI9C7XfLA3X.pgp
Description: PGP signature


Re: [patch] Wine DLL base address patches

2013-02-21 Thread Konstantin Belousov
On Thu, Feb 21, 2013 at 12:57:45AM +0200, Damjan Jovanovic wrote:
 On Wed, Feb 20, 2013 at 10:51 PM, Tijl Coosemans t...@coosemans.org wrote:
  On 20-02-2013 16:48, Konstantin Belousov wrote:
  On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote:
  Hi
 
  Wine needs some of its libraries to be loaded at specific base
  addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently
  lacks.
 
  I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that
  loads libraries at their preferred base addresses
  (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port
  of Prelink to FreeBSD which Wine uses to set base addresses
  (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-),
  the changed dynamic loader doesn't show any problems in a few days of
  testing, and prelink works with the --reloc-only option as used by
  Wine.
 
  Please review/test/comment/commit.
 
  Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings
  which could exist at the specified address.
 
  I've simplified the rtld patch to a single line. The second patch makes
  Wine use -Ttext-segment linker flag instead of prelink. This requires
  binutils from ports, but it's easier than porting prelink.
 
 
 All of that occurred to me as well.
 
 The problem with that one-line rtld patch is that loading an
 application will now fail if any of its libraries cannot be loaded at
 their requested address.
But this is intended behaviour. Also, the default virtaddr base for the
shared libraries is 0, so the existing binaries should be not affected.

 
 The problem with -Ttext-segment (and isn't it just -Ttext?) is that it
 doesn't seem to work: the base_vaddr seen by rtld will remain 0, and
 the address listed in /proc/.../map is different from what it should
 be. Also run readelf -l on a library compiled that way and compare
 with the output of one run through prelink --reloc-only, you'll see
 the lowest VirtAddr and PhysAddr in LOAD headers change only with
 prelink. I really ported prelink because there was no other choice.
The -Ttext-segment does work. As indicated by Tijl, you need recent
binutils.  I just verified that ld 2.32.1 obeys -Ttext-segment.

You can also take a look at the default linker script to see how
-Ttext-segment is used, look for SEGMENT_START(text-segment).

 
 See my attached test cases, and examine the contents of /proc/.../map
 while they run.




pgpsVpqSyF9L5.pgp
Description: PGP signature


Re: [patch] Wine DLL base address patches

2013-02-20 Thread Konstantin Belousov
On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote:
 Hi
 
 Wine needs some of its libraries to be loaded at specific base
 addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently
 lacks.
 
 I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that
 loads libraries at their preferred base addresses
 (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port
 of Prelink to FreeBSD which Wine uses to set base addresses
 (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-),
 the changed dynamic loader doesn't show any problems in a few days of
 testing, and prelink works with the --reloc-only option as used by
 Wine.
 
 Please review/test/comment/commit.

Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings
which could exist at the specified address.

What you could do, is
- either specify the desired base without MAP_FIXED. Kernel takes the base
  as the hint, but only uses it if the request completely fits into the
  hinted area without overriding any existing mapping. Also, the heap
  is explicitely avoided.
  The code would need to check if kernel honored hint by comparing the
  actual map address with the desired address, and act accordingly.
- or add a new mmap(2) flag, which would specify that mapping should be
  done at that address, if possible to not overlap with existing mapping,
  otherwise the mapping should fail.


pgpTmtvgtVj_B.pgp
Description: PGP signature


Re: [patch] Wine DLL base address patches

2013-02-20 Thread Konstantin Belousov
On Wed, Feb 20, 2013 at 09:51:37PM +0100, Tijl Coosemans wrote:
 On 20-02-2013 16:48, Konstantin Belousov wrote:
  On Wed, Feb 20, 2013 at 05:29:01PM +0200, Damjan Jovanovic wrote:
  Hi
 
  Wine needs some of its libraries to be loaded at specific base
  addresses (https://wiki.freebsd.org/Wine), something FreeBSD currently
  lacks.
 
  I've written a patch to the dynamic loader (/libexec/ld-elf.so.1) that
  loads libraries at their preferred base addresses
  (http://www.freebsd.org/cgi/query-pr.cgi?pr=176216), as well as a port
  of Prelink to FreeBSD which Wine uses to set base addresses
  (http://www.freebsd.org/cgi/query-pr.cgi?pr=176283). Both work :-),
  the changed dynamic loader doesn't show any problems in a few days of
  testing, and prelink works with the --reloc-only option as used by
  Wine.
 
  Please review/test/comment/commit.
  
  Unfortunately, it is not safe. MAP_FIXED overrides any previous mappings
  which could exist at the specified address.
 
 I've simplified the rtld patch to a single line. The second patch makes
 Wine use -Ttext-segment linker flag instead of prelink. This requires
 binutils from ports, but it's easier than porting prelink.
 
 Index: libexec/rtld-elf/map_object.c
 ===
 --- libexec/rtld-elf/map_object.c (revision 246986)
 +++ libexec/rtld-elf/map_object.c (working copy)
 @@ -175,7 +175,7 @@ map_object(int fd, const char *path, const struct
  base_vaddr = trunc_page(segs[0]-p_vaddr);
  base_vlimit = round_page(segs[nsegs]-p_vaddr + segs[nsegs]-p_memsz);
  mapsize = base_vlimit - base_vaddr;
 -base_addr = hdr-e_type == ET_EXEC ? (caddr_t) base_vaddr : NULL;
 +base_addr = (caddr_t) base_vaddr;
  
  mapbase = mmap(base_addr, mapsize, PROT_NONE, MAP_ANON | MAP_PRIVATE |
MAP_NOCORE, -1, 0);

If this is enough for wine, I definitely have no objection.  The typical
dso has zero virtual base for the first segment, so the patch should change
nothing for dso built without special map file.


 Index: Makefile
 ===
 --- Makefile  (revision 312556)
 +++ Makefile  (working copy)
 @@ -28,7 +28,7 @@
  LATEST_LINK= wine-devel
  CPPFLAGS+=   -I${LOCALBASE}/include
  LDFLAGS+=-L${LOCALBASE}/lib
 -USE_GCC= any
 +USE_GCC= 4.7
  GNU_CONFIGURE=   yes
  CONFIGURE_ARGS+=--verbose --disable-tests \
   --without-alsa --without-capi --without-dbus \
 Index: files/patch-tools-winegcc-utils.h
 ===
 --- files/patch-tools-winegcc-utils.h (revision 0)
 +++ files/patch-tools-winegcc-utils.h (working copy)
 @@ -0,0 +1,12 @@
 +--- tools/winegcc/utils.h.orig
  tools/winegcc/utils.h
 +@@ -42,7 +42,8 @@
 + 
 + enum target_platform
 + {
 +-PLATFORM_UNSPECIFIED, PLATFORM_APPLE, PLATFORM_SOLARIS, 
 PLATFORM_WINDOWS, PLATFORM_CYGWIN
 ++PLATFORM_UNSPECIFIED, PLATFORM_APPLE, PLATFORM_FREEBSD, 
 PLATFORM_SOLARIS,
 ++PLATFORM_WINDOWS, PLATFORM_CYGWIN
 + };
 + 
 + void error(const char* s, ...) DECLSPEC_NORETURN;
 Index: files/patch-tools-winegcc-winegcc.c
 ===
 --- files/patch-tools-winegcc-winegcc.c   (revision 0)
 +++ files/patch-tools-winegcc-winegcc.c   (working copy)
 @@ -0,0 +1,32 @@
 +--- tools/winegcc/winegcc.c.orig
  tools/winegcc/winegcc.c
 +@@ -172,6 +172,7 @@
 + {
 + { macos,   PLATFORM_APPLE },
 + { darwin,  PLATFORM_APPLE },
 ++{ freebsd, PLATFORM_FREEBSD },
 + { solaris, PLATFORM_SOLARIS },
 + { cygwin,  PLATFORM_CYGWIN },
 + { mingw32, PLATFORM_WINDOWS },
 +@@ -232,6 +233,8 @@
 + 
 + #ifdef __APPLE__
 + static enum target_platform build_platform = PLATFORM_APPLE;
 ++#elif defined(__FreeBSD__)
 ++static enum target_platform build_platform = PLATFORM_FREEBSD;
 + #elif defined(__sun)
 + static enum target_platform build_platform = PLATFORM_SOLARIS;
 + #elif defined(__CYGWIN__)
 +@@ -1020,6 +1023,12 @@
 + if (opts-strip)
 + strarray_add(link_args, -Wl,-x);
 + break;
 ++case PLATFORM_FREEBSD:
 ++if (opts-image_base)
 ++{
 ++strarray_add(link_args, strmake(-Wl,-Ttext-segment=%s, 
 opts-image_base));
 ++}
 ++break;
 + case PLATFORM_SOLARIS:
 + {
 + char *mapfile = get_temp_file( output_name, .map );





pgpAviZFPInSj.pgp
Description: PGP signature


Re: Stupid question about integer sizes

2013-02-19 Thread Konstantin Belousov
On Tue, Feb 19, 2013 at 06:52:34AM -0800, m...@freebsd.org wrote:
 On Tue, Feb 19, 2013 at 5:11 AM, Borja Marcos bor...@sarenet.es wrote:
 
  Hello,
 
  I'm really sorry if this is a stupid question, but as far as I know, 
  u_int64_t defined in /usr/include/sys/types.h should *always* be
  a 64 bit unsigned integer, right?
 
  Seems there's a bug (or I need more and stronger coffee). Compiling a 
  program on a 64 bit system with -m32 gets the 64 bit integer types wrong.
 
  % cat tachin.c
  #include sys/types.h
  #include stdio.h
 
 
  main()
  {
  printf(sizeof uint64_t = %d\n, sizeof(uint64_t));
  printf(sizeof u_int64_t = %d\n, sizeof(u_int64_t));
  }
 
 
 
  uname -a
  FreeBSD splunk 9.1-RELEASE FreeBSD 9.1-RELEASE #14: Wed Jan 23 17:24:05 CET 
  2013 root@splunk:/usr/obj/usr/src/sys/SPLUNK  amd64
 
  % gcc -o tachin tachin.c
  % ./tachin
  sizeof uint64_t = 8
  sizeof u_int64_t = 8
 
  % ./tachin
  sizeof uint64_t = 4   === WRONG!!
  sizeof u_int64_t = 4=== WRONG!!
 
  The same happens with clang.
 
  % clang -m32 -o tachin tachin.c
  tachin.c:5:1: warning: type specifier missing, defaults to 'int' 
  [-Wimplicit-int]
  main()
  ^~~~
  1 warning generated.
  % ./tachin
  sizeof uint64_t = 4 === WRONG!!
  sizeof u_int64_t = 4=== WRONG!!
 
 
  if I do the same on a i386 system (8.2-RELEASE, but it should be 
  irrelevant) the u_int64 types have the correct size.
 
  %gcc -o tachin tachin.c
  %./tachin
  sizeof uint64_t = 8
  sizeof u_int64_t = 8
 
 
 
 
 
  Am I missing anything? Seems like a too stupid problem to be a real bug. 
  Sorry if I am wrong.
 
 
 Last I knew -m32 still wasn't quite supported on 9.1.  This is fixed
 in CURRENT.  The problem in in the machine-dependent headers.
 sys/types.h includes sys/_types.h which includes machine/_types.h.
 But the machine alias can only point to one place; it's the amd64
 headers since you're running the amd64 kernel.
 sys/amd64/include/_types.h defines __uint64_t as unsigned long, which
 is correct in 64-bit mode but wrong in 32-bit mode.
 
 On CURRENT there is a merge x86/_types.h which uses #ifdef guards to
 define __uint64_t as unsigned long or unsigned long long, depending on
 __LP64__, so that the size is correct on a 32-bit compiler invocation.

Yes, but there are still some unconverted headers, mostly related to
the machine context and signal handlers.

Quick look:

machine/sigframe.h (probably not much useful for usermode)
machine/ucontext.h
machine/frame.h (again, usermode most likely does not need it)
machine/signal.h
machine/elf.h

libm machdep headers.


pgpSY1dpkwpFP.pgp
Description: PGP signature


Re: SIGSEGV/SIGBUS when accessing after end of mmapped file; why it differs with GCC?

2013-02-13 Thread Konstantin Belousov
On Wed, Feb 13, 2013 at 12:13:58PM -0500, Ryan Stone wrote:
 On Wed, Feb 13, 2013 at 11:18 AM, nat...@centrum.cz wrote:
 
  Hello,
  I am porting an application which maps files into the memory and works
  directly with the memory. When doing this, it can happen that when someone
  resizes the file so that part of the previously mapped region is no longer
  backed by the file, synchronous signal is sent to the process which needs
  to be handled.
 
  On all other platforms than FreeBSD I have tested (Solaris, Linux, Darwin,
  HP-UX) the signal in question is SIGBUS. However on FreeBSD, depending on
  the compiler used, it is either SIGBUS or SIGSEGV. When I compile the
  binary with gcc version 4.2.1 20070831 patched [FreeBSD], amd64, the
  signal is SIGBUS, when i use gcc version 4.7.3 20121103 (prerelease)
  (FreeBSD Ports Collection), the signal is SIGSEGV. Please note that one of
  the versions of gcc between 4.2 and 4.7 also caused SIGSEFV to be sent but
  this did not matter for me as I did not need to use that version; I however
  need gcc 4.7 to work because of c++11 stuff that my project has recently
  started to use.
 
  Unfortunately registering signal handler on SIGSEGV is very inconvenient
  for me; I would prefer to somehow switch the behavior to be sane.
 
  Please anyone has an idea whether or how this could be achieved? I have
  tried to find out why, on single machine, just because of different gcc
  version, kernel sends different signal but I have never worked with fbsd
  kernel before and so my search did not succeed so far.
 
  Machine in question runs amd64 FreeBSD 9.1-RC2, but this has also happened
  to me with older version of FreeBSD.
 
 
  gcc 4.2 (same happens also with libstdc++ and co. from gcc 4.2):
  Program received signal SIGBUS, Bus error.
  (gdb) i shared
  FromTo  Syms Read   Shared Object Library
  0x000800f2ef70  0x000800f3ee68  Yes (*) /libexec/ld-elf.so.1
  0x00080114d710  0x000801159748  Yes (*) /lib/libthr.so.3
  0x0008013c2d30  0x00080142c656  Yes
  /usr/local/lib/gcc47/libstdc++.so.6
  0x0008016737c0  0x0008016891b8  Yes (*) /lib/libm.so.5
  0x000801893930  0x0008018a3088  Yes
  /usr/local/lib/gcc47/libgcc_s.so.1
  0x000801ad71d0  0x000801ba9358  Yes (*) /lib/libc.so.7
 
  gcc 4.7:
  Program received signal SIGSEGV, Segmentation fault.
  (gdb) i shared
  FromTo  Syms Read   Shared Object Library
  0x000800f5ef70  0x000800f6ee68  Yes (*) /libexec/ld-elf.so.1
  0x00080117d710  0x000801189748  Yes (*) /lib/libthr.so.3
  0x0008013f2d30  0x00080145c656  Yes
  /usr/local/lib/gcc47/libstdc++.so.6
  0x0008016a37c0  0x0008016b91b8  Yes (*) /lib/libm.so.5
  0x0008018c3930  0x0008018d3088  Yes
  /usr/local/lib/gcc47/libgcc_s.so.1
  0x000801b071d0  0x000801bd9358  Yes (*) /lib/libc.so.7
 
 
  I would be glad for any hint or information.
  Kind Regards,
  Ondrej Kolacek
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
 
 
 
 I think that setting sysctl machdep.prot_fault_translation=1 would do what
 you want.

This might be an indication of the issue with the toolchains you use,
in particular, with the linker or csu.  The default selection for the signal
is based on the note osver, linked into the binary from crt1.o. Either
linker omits the note, or wrong crt1 is used.

You did not specified anything about version of the FreeBSD used, nor
the exact compiler invocations. Using the crystal ball, I see the
r244600 for HEAD and r244904 for stable/9, if you use --gc-sections
flags. This is more or less consistent with what you reported, since
gcc from ports uses binutils from ports, which have newer ld with
bugfix already applied.


pgpKqdnUDY0de.pgp
Description: PGP signature


Re: Request for review, time_pps_fetch() enhancement

2013-02-13 Thread Konstantin Belousov
On Wed, Feb 13, 2013 at 08:16:32AM -0700, Ian Lepore wrote:
 On Tue, 2013-02-12 at 22:34 +0200, Konstantin Belousov wrote:
  On Tue, Feb 12, 2013 at 09:03:39AM -0700, Ian Lepore wrote:
   On Sun, 2013-02-10 at 12:37 +0200, Konstantin Belousov wrote:
On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote:
 On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote:
  On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote:
   I'd like feedback on the attached patch, which adds support to our
   time_pps_fetch() implementation for the blocking behaviors 
   described in
   section 3.4.3 of RFC 2783.  The existing implementation can only 
   return
   the most recently captured data without blocking.  These changes 
   add the
   ability to block (forever or with timeout) until a new event 
   occurs.
 
   Index: sys/kern/kern_tc.c
   ===
   --- sys/kern/kern_tc.c(revision 246337)
   +++ sys/kern/kern_tc.c(working copy)
   @@ -1446,6 +1446,50 @@
 * RFC 2783 PPS-API implementation.
 */

   +static int
   +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps)
   +{
   [snip]
   + aseq = pps-ppsinfo.assert_sequence;
   + cseq = pps-ppsinfo.clear_sequence;
   + while (aseq == pps-ppsinfo.assert_sequence 
   + cseq == pps-ppsinfo.clear_sequence) {
  Note that compilers are allowed to optimize these accesses even over
  the sequential point, which is the tsleep() call. Only accesses to
  volatile objects are forbidden to be rearranged.
 
  I suggest to add volatile casts to pps in the loop condition.
 
 The memory pointed to by pps is global (other code may have a pointer 
 to
 it); therefore, the compiler must assume that the tsleep() call (which
 invokes code in a different compilation unit) may modify it.
 
 Because volatile does not make concurrent access by multiple threads
 defined either, adding it here only seems to slow down the code
 (potentially).
The volatile guarantees that the compiler indeed reloads the value on
read access. Conceptually, the tsleep() does not modify or even access
the checked fields, and compiler is allowed to note this by whatever
methods (LTO ?).

More, the standard says that an implementation is allowed to not 
evaluate
part of the expression if no side effects are produced, even by calling
a function.

I agree that for practical means, the _currently_ used compilers should
consider the tsleep() call as the sequential point. But then the 
volatile
qualifier cast applied for the given access would not change the code as
well.

   
   Doesn't this then imply that essentially every driver has this problem,
   and for that matter, every sequence of code anywhere in the base
   involving loop while repeatedly sleeping, then waking and checking the
   state of some data for changes?  I sure haven't seen that many volatile
   qualifiers scattered around the code.
  
  No, it does not imply that every driver has this problem.
  A typical driver provides the mutual exclusion for access of
  the shared data, which means using locks. Locks include neccessary
  barries to ensure the visibility of the changes, in particular the
  compiler barriers.
 
 O.   I had never considered that using mutexes had other side
 effects.  So is there a correct MI way to invoke the right barrier magic
 in a situation like this?

My belief is that you do not need a barrier there.  The only (slightly)
problematic issue there is a purely theoretical possibility that a
very smart compiler would omit the reload step.  The volatile qualifier
for the dererefence in the loop condition should close this, as I described
in the very first reply.


pgpmciXFSJ0R5.pgp
Description: PGP signature


Re: Request for review, time_pps_fetch() enhancement

2013-02-12 Thread Konstantin Belousov
On Tue, Feb 12, 2013 at 09:03:39AM -0700, Ian Lepore wrote:
 On Sun, 2013-02-10 at 12:37 +0200, Konstantin Belousov wrote:
  On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote:
   On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote:
On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote:
 I'd like feedback on the attached patch, which adds support to our
 time_pps_fetch() implementation for the blocking behaviors described 
 in
 section 3.4.3 of RFC 2783.  The existing implementation can only 
 return
 the most recently captured data without blocking.  These changes add 
 the
 ability to block (forever or with timeout) until a new event occurs.
   
 Index: sys/kern/kern_tc.c
 ===
 --- sys/kern/kern_tc.c(revision 246337)
 +++ sys/kern/kern_tc.c(working copy)
 @@ -1446,6 +1446,50 @@
   * RFC 2783 PPS-API implementation.
   */
  
 +static int
 +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps)
 +{
 [snip]
 + aseq = pps-ppsinfo.assert_sequence;
 + cseq = pps-ppsinfo.clear_sequence;
 + while (aseq == pps-ppsinfo.assert_sequence 
 + cseq == pps-ppsinfo.clear_sequence) {
Note that compilers are allowed to optimize these accesses even over
the sequential point, which is the tsleep() call. Only accesses to
volatile objects are forbidden to be rearranged.
   
I suggest to add volatile casts to pps in the loop condition.
   
   The memory pointed to by pps is global (other code may have a pointer to
   it); therefore, the compiler must assume that the tsleep() call (which
   invokes code in a different compilation unit) may modify it.
   
   Because volatile does not make concurrent access by multiple threads
   defined either, adding it here only seems to slow down the code
   (potentially).
  The volatile guarantees that the compiler indeed reloads the value on
  read access. Conceptually, the tsleep() does not modify or even access
  the checked fields, and compiler is allowed to note this by whatever
  methods (LTO ?).
  
  More, the standard says that an implementation is allowed to not evaluate
  part of the expression if no side effects are produced, even by calling
  a function.
  
  I agree that for practical means, the _currently_ used compilers should
  consider the tsleep() call as the sequential point. But then the volatile
  qualifier cast applied for the given access would not change the code as
  well.
  
 
 Doesn't this then imply that essentially every driver has this problem,
 and for that matter, every sequence of code anywhere in the base
 involving loop while repeatedly sleeping, then waking and checking the
 state of some data for changes?  I sure haven't seen that many volatile
 qualifiers scattered around the code.

No, it does not imply that every driver has this problem.
A typical driver provides the mutual exclusion for access of
the shared data, which means using locks. Locks include neccessary
barries to ensure the visibility of the changes, in particular the
compiler barriers.


pgpoT0FDmflHp.pgp
Description: PGP signature


Re: Request for review, time_pps_fetch() enhancement

2013-02-10 Thread Konstantin Belousov
On Sat, Feb 09, 2013 at 02:47:06PM +0100, Jilles Tjoelker wrote:
 On Wed, Feb 06, 2013 at 05:58:30PM +0200, Konstantin Belousov wrote:
  On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote:
   I'd like feedback on the attached patch, which adds support to our
   time_pps_fetch() implementation for the blocking behaviors described in
   section 3.4.3 of RFC 2783.  The existing implementation can only return
   the most recently captured data without blocking.  These changes add the
   ability to block (forever or with timeout) until a new event occurs.
 
   Index: sys/kern/kern_tc.c
   ===
   --- sys/kern/kern_tc.c(revision 246337)
   +++ sys/kern/kern_tc.c(working copy)
   @@ -1446,6 +1446,50 @@
 * RFC 2783 PPS-API implementation.
 */

   +static int
   +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps)
   +{
   [snip]
   + aseq = pps-ppsinfo.assert_sequence;
   + cseq = pps-ppsinfo.clear_sequence;
   + while (aseq == pps-ppsinfo.assert_sequence 
   + cseq == pps-ppsinfo.clear_sequence) {
  Note that compilers are allowed to optimize these accesses even over
  the sequential point, which is the tsleep() call. Only accesses to
  volatile objects are forbidden to be rearranged.
 
  I suggest to add volatile casts to pps in the loop condition.
 
 The memory pointed to by pps is global (other code may have a pointer to
 it); therefore, the compiler must assume that the tsleep() call (which
 invokes code in a different compilation unit) may modify it.
 
 Because volatile does not make concurrent access by multiple threads
 defined either, adding it here only seems to slow down the code
 (potentially).
The volatile guarantees that the compiler indeed reloads the value on
read access. Conceptually, the tsleep() does not modify or even access
the checked fields, and compiler is allowed to note this by whatever
methods (LTO ?).

More, the standard says that an implementation is allowed to not evaluate
part of the expression if no side effects are produced, even by calling
a function.

I agree that for practical means, the _currently_ used compilers should
consider the tsleep() call as the sequential point. But then the volatile
qualifier cast applied for the given access would not change the code as
well.

 
   + err = tsleep(pps, PCATCH, ppsfch, timo);
   + if (err == EWOULDBLOCK  fapi-timeout.tv_sec == -1) {
   + continue;
   + } else if (err != 0) {
   + return (err);
   + }
   + }
   + }
 -- 
 Jilles Tjoelker


pgpcFkMsoB_QP.pgp
Description: PGP signature


Re: Request for review, time_pps_fetch() enhancement

2013-02-10 Thread Konstantin Belousov
On Fri, Feb 08, 2013 at 04:13:40PM -0700, Ian Lepore wrote:
 On Wed, 2013-02-06 at 17:58 +0200, Konstantin Belousov wrote:
  On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote:
   I'd like feedback on the attached patch, which adds support to our
   time_pps_fetch() implementation for the blocking behaviors described in
   section 3.4.3 of RFC 2783.  The existing implementation can only return
   the most recently captured data without blocking.  These changes add the
   ability to block (forever or with timeout) until a new event occurs.
   
   -- Ian
   
  
   Index: sys/kern/kern_tc.c
   ===
   --- sys/kern/kern_tc.c(revision 246337)
   +++ sys/kern/kern_tc.c(working copy)
   @@ -1446,6 +1446,50 @@
 * RFC 2783 PPS-API implementation.
 */

   +static int
   +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps)
   +{
   + int err, timo;
   + pps_seq_t aseq, cseq;
   + struct timeval tv;
   +
   + if (fapi-tsformat  fapi-tsformat != PPS_TSFMT_TSPEC)
   + return (EINVAL);
   +
   + /*
   +  * If no timeout is requested, immediately return whatever values were
   +  * most recently captured.  If timeout seconds is -1, that's a request
   +  * to block without a timeout.  WITNESS won't let us sleep forever
   +  * without a lock (we really don't need a lock), so just repeatedly
   +  * sleep a long time.
   +  */
  Regarding no need for the lock, it would just move the implementation into
  the low quality one, for the case when one timestamp capture is lost
  and caller of time_pps_fetch() sleeps until next pps event is generated.
  
  I understand the desire to avoid lock, esp. in the pps_event() called
  from the arbitrary driver context. But the race is also real.
  
 
 What race?  A user of the pps interface understands that there is one
 event per second, and understands that if you ask to block until the
 next event at approximately the time that event is expected to occur,
 then it is ambiguous whether the call completes almost-immediately or in
 about 1 second.
 
 Looking at it another way, if a blocking call is made right around the
 time of the PPS, the thread could get preempted before getting to
 pps_fetch() function and not get control again until after the PPS has
 occurred.  In that case it's going to block for about a full second,
 even though the call was made before top-of-second.  That situation is
 exactly the same with or without locking, so what extra functionality is
 gained with locking?  What guarantee does locking let us make to the
 caller that the lockless code doesn't?

No guarantees, but I noted in the original reply that this is about the
quality of the implementation and not about correctness.

As I said there as well, I am not sure that any locking can be useful
for the situation at all.


pgpNVBaNfjQbj.pgp
Description: PGP signature


Re: fcntl(2) F_READAHEAD set to zero doesn't work [patch]

2013-02-10 Thread Konstantin Belousov
On Fri, Feb 08, 2013 at 04:58:09PM -0700, Ian Lepore wrote:
 I discovered today that fcntl(fd, F_READAHEAD, 0) doesn't work as
 advertised.  It's supposed to disable readahead, but instead it restores
 the default readahead behavior (if it had previously been changed), and
 there is no way to disable readahead.[1]  I think the attached patch
 fixes it, but it's not immediately clear from the patch why; here's the
 deal...
 
 The amount of readahead is calculated by sequential_heuristic() in
 vfs_vnops.c.  If the FRDAHEAD flag is set on the file it uses the value
 stored in the file's f_seqcount, otherwise it calculates a value (and
 updates f_seqcount, which doesn't ever happen when FRDAHEAD is set).
 
 So the patch causes the FRDAHEAD flag to be set even in the case of the
 readahead amount being zero.  Because it seems like a useful concept, it
 still allows the readahead to be restored to default behavior, now by
 passing a negative value.
 
 Does this look right to those of you who understand this part of the
 system better than I do?
It looks correct.

I do wonder if it better to keep the actual behaviour as is and set
the readahead amount to default on the arg == 0, disabling readahead
for arg  0. It is too exteme, most likely, please proceed.

 
 -- Ian
 
 [1] No way using F_READAHEAD; I know about POSIX_FADV_RANDOM.

 Index: sys/kern/kern_descrip.c
 ===
 --- sys/kern/kern_descrip.c   (revision 246337)
 +++ sys/kern/kern_descrip.c   (working copy)
 @@ -776,7 +776,7 @@
   }
   fhold(fp);
   FILEDESC_SUNLOCK(fdp);
 - if (arg != 0) {
 + if (arg = 0) {
   vp = fp-f_vnode;
   error = vn_lock(vp, LK_SHARED);
   if (error != 0) {
 Index: lib/libc/sys/fcntl.2
 ===
 --- lib/libc/sys/fcntl.2  (revision 246337)
 +++ lib/libc/sys/fcntl.2  (working copy)
 @@ -28,7 +28,7 @@
  .\ @(#)fcntl.2  8.2 (Berkeley) 1/12/94
  .\ $FreeBSD$
  .\
 -.Dd July 27, 2012
 +.Dd February 8, 2013
  .Dt FCNTL 2
  .Os
  .Sh NAME
 @@ -171,7 +171,7 @@
  which is rounded up to the nearest block size.
  A zero value in
  .Fa arg
 -turns off read ahead.
 +turns off read ahead, a negative value restores the system default.
  .It Dv F_RDAHEAD
  Equivalent to Darwin counterpart which sets read ahead amount of 128KB
  when the third argument,

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org



pgp_z8JoFEici.pgp
Description: PGP signature


Re: Request for review, time_pps_fetch() enhancement

2013-02-06 Thread Konstantin Belousov
On Tue, Feb 05, 2013 at 09:41:38PM -0700, Ian Lepore wrote:
 I'd like feedback on the attached patch, which adds support to our
 time_pps_fetch() implementation for the blocking behaviors described in
 section 3.4.3 of RFC 2783.  The existing implementation can only return
 the most recently captured data without blocking.  These changes add the
 ability to block (forever or with timeout) until a new event occurs.
 
 -- Ian
 

 Index: sys/kern/kern_tc.c
 ===
 --- sys/kern/kern_tc.c(revision 246337)
 +++ sys/kern/kern_tc.c(working copy)
 @@ -1446,6 +1446,50 @@
   * RFC 2783 PPS-API implementation.
   */
  
 +static int
 +pps_fetch(struct pps_fetch_args *fapi, struct pps_state *pps)
 +{
 + int err, timo;
 + pps_seq_t aseq, cseq;
 + struct timeval tv;
 +
 + if (fapi-tsformat  fapi-tsformat != PPS_TSFMT_TSPEC)
 + return (EINVAL);
 +
 + /*
 +  * If no timeout is requested, immediately return whatever values were
 +  * most recently captured.  If timeout seconds is -1, that's a request
 +  * to block without a timeout.  WITNESS won't let us sleep forever
 +  * without a lock (we really don't need a lock), so just repeatedly
 +  * sleep a long time.
 +  */
Regarding no need for the lock, it would just move the implementation into
the low quality one, for the case when one timestamp capture is lost
and caller of time_pps_fetch() sleeps until next pps event is generated.

I understand the desire to avoid lock, esp. in the pps_event() called
from the arbitrary driver context. But the race is also real.

 + if (fapi-timeout.tv_sec || fapi-timeout.tv_nsec) {
 + if (fapi-timeout.tv_sec == -1)
 + timo = 0x7fff;
 + else {
 + tv.tv_sec = fapi-timeout.tv_sec;
 + tv.tv_usec = fapi-timeout.tv_nsec / 1000;
 + timo = tvtohz(tv);
 + }
 + aseq = pps-ppsinfo.assert_sequence;
 + cseq = pps-ppsinfo.clear_sequence;
 + while (aseq == pps-ppsinfo.assert_sequence 
 + cseq == pps-ppsinfo.clear_sequence) {
Note that compilers are allowed to optimize these accesses even over
the sequential point, which is the tsleep() call. Only accesses to
volatile objects are forbidden to be rearranged.

I suggest to add volatile casts to pps in the loop condition.

 + err = tsleep(pps, PCATCH, ppsfch, timo);
 + if (err == EWOULDBLOCK  fapi-timeout.tv_sec == -1) {
 + continue;
 + } else if (err != 0) {
 + return (err);
 + }
 + }
 + }
 +
 + pps-ppsinfo.current_mode = pps-ppsparam.mode;
 + fapi-pps_info_buf = pps-ppsinfo;
 +
 + return (0);
 +}
 +
  int
  pps_ioctl(u_long cmd, caddr_t data, struct pps_state *pps)
  {
 @@ -1485,13 +1529,7 @@
   return (0);
   case PPS_IOC_FETCH:
   fapi = (struct pps_fetch_args *)data;
 - if (fapi-tsformat  fapi-tsformat != PPS_TSFMT_TSPEC)
 - return (EINVAL);
 - if (fapi-timeout.tv_sec || fapi-timeout.tv_nsec)
 - return (EOPNOTSUPP);
 - pps-ppsinfo.current_mode = pps-ppsparam.mode;
 - fapi-pps_info_buf = pps-ppsinfo;
 - return (0);
 + return (pps_fetch(fapi, pps));
  #ifdef FFCLOCK
   case PPS_IOC_FETCH_FFCOUNTER:
   fapi_ffc = (struct pps_fetch_ffc_args *)data;
 @@ -1540,7 +1578,7 @@
  void
  pps_init(struct pps_state *pps)
  {
 - pps-ppscap |= PPS_TSFMT_TSPEC;
 + pps-ppscap |= PPS_TSFMT_TSPEC | PPS_CANWAIT;
   if (pps-ppscap  PPS_CAPTUREASSERT)
   pps-ppscap |= PPS_OFFSETASSERT;
   if (pps-ppscap  PPS_CAPTURECLEAR)
 @@ -1680,6 +1718,9 @@
   hardpps(tsp, ts.tv_nsec + 10 * ts.tv_sec);
   }
  #endif
 +
 + /* Wakeup anyone sleeping in pps_fetch().  */
 + wakeup(pps);
  }
  
  /*

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org



pgpO2zZzzQm3w.pgp
Description: PGP signature


Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread Konstantin Belousov
On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote:
 Hi,
 
 I have a patch to dynamically calculate NKPT for amd64 kernels. This
 should fix the various issues that people pointed out in the email
 thread.
 
 Please review and let me know if there are any objections to committing this.
 
 Also, thanks to Alan (alc@) for reviewing and providing feedback on
 the initial version of the patch.
 
 Patch (also available at 
 http://people.freebsd.org/~neel/patches/nkpt_diff.txt):
 
 Index: sys/amd64/include/pmap.h
 ===
 --- sys/amd64/include/pmap.h  (revision 246277)
 +++ sys/amd64/include/pmap.h  (working copy)
 @@ -113,13 +113,7 @@
   ((unsigned long)(l2)  PDRSHIFT) | \
   ((unsigned long)(l1)  PAGE_SHIFT))
 
 -/* Initial number of kernel page tables. */
 -#ifndef NKPT
 -#define  NKPT32
 -#endif
 -
  #define NKPML4E  1   /* number of kernel PML4 slots 
 */
 -#define NKPDPE   howmany(NKPT, NPDEPG)/* number of kernel PDP 
 slots */
 
  #define  NUPML4E (NPML4EPG/2)/* number of userland PML4 
 pages */
  #define  NUPDPE  (NUPML4E*NPDPEPG)/* number of userland PDP 
 pages */
 @@ -181,6 +175,7 @@
  #define  PML4map ((pd_entry_t *)(addr_PML4map))
  #define  PML4pml4e   ((pd_entry_t *)(addr_PML4pml4e))
 
 +extern int nkpt; /* Initial number of kernel page tables */
  extern u_int64_t KPDPphys;   /* physical address of kernel level 3 */
  extern u_int64_t KPML4phys;  /* physical address of kernel level 4 */
 
 Index: sys/amd64/amd64/minidump_machdep.c
 ===
 --- sys/amd64/amd64/minidump_machdep.c(revision 246277)
 +++ sys/amd64/amd64/minidump_machdep.c(working copy)
 @@ -232,7 +232,7 @@
   /* Walk page table pages, set bits in vm_page_dump */
   pmapsize = 0;
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); ) {
   /*
* We always write a page, even if it is zero. Each
 @@ -364,7 +364,7 @@
   /* Dump kernel page directory pages */
   bzero(fakepd, sizeof(fakepd));
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); va += NBPDP) {
   i = (va  PDPSHIFT)  ((1ul  NPDPEPGSHIFT) - 1);
 
 Index: sys/amd64/amd64/pmap.c
 ===
 --- sys/amd64/amd64/pmap.c(revision 246277)
 +++ sys/amd64/amd64/pmap.c(working copy)
 @@ -202,6 +202,10 @@
  vm_offset_t virtual_avail;   /* VA of first avail page (after kernel bss) */
  vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */
 
 +int nkpt;
 +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0,
 +Number of kernel page table pages allocated on bootup);
 +
  static int ndmpdp;
  static vm_paddr_t dmaplimit;
  vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 @@ -495,17 +499,42 @@
 
  CTASSERT(powerof2(NDMPML4E));
 
 +/* number of kernel PDP slots */
 +#define  NKPDPE(ptpgs)   howmany((ptpgs), NPDEPG)
 +
  static void
 +nkpt_init(vm_paddr_t addr)
 +{
 + int pt_pages;
 + 
 +#ifdef NKPT
 + pt_pages = NKPT;
 +#else
 + pt_pages = howmany(addr, 1  PDRSHIFT);
 + pt_pages += NKPDPE(pt_pages);
 +
 + /*
 +  * Add some slop beyond the bare minimum required for bootstrapping
 +  * the kernel.
 +  *
 +  * This is quite important when allocating KVA for kernel modules.
 +  * The modules are required to be linked in the negative 2GB of
 +  * the address space.  If we run out of KVA in this region then
 +  * pmap_growkernel() will need to allocate page table pages to map
 +  * the entire 512GB of KVA space which is an unnecessary tax on
 +  * physical memory.
 +  */
 + pt_pages += 4;  /* 8MB additional slop for kernel modules */
8MB might be to low. I just checked one of my machines with fully
modularized kernel, it takes slightly more than 6 MB to load 50 modules.
I think that 16MB would be safer, but it probably needs to be scaled
down based on the available phys memory. amd64 kernel could be booted
on 128MB machine still.


pgpp2cL7wub9D.pgp
Description: PGP signature


Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread Konstantin Belousov
On Tue, Feb 05, 2013 at 07:45:24AM -0800, m...@freebsd.org wrote:
 On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com 
 wrote:
  On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote:
  Hi,
 
  I have a patch to dynamically calculate NKPT for amd64 kernels. This
  should fix the various issues that people pointed out in the email
  thread.
 
  Please review and let me know if there are any objections to committing 
  this.
 
  Also, thanks to Alan (alc@) for reviewing and providing feedback on
  the initial version of the patch.
 
  Patch (also available at 
  http://people.freebsd.org/~neel/patches/nkpt_diff.txt):
 
  Index: sys/amd64/include/pmap.h
  ===
  --- sys/amd64/include/pmap.h  (revision 246277)
  +++ sys/amd64/include/pmap.h  (working copy)
  @@ -113,13 +113,7 @@
((unsigned long)(l2)  PDRSHIFT) | \
((unsigned long)(l1)  PAGE_SHIFT))
 
  -/* Initial number of kernel page tables. */
  -#ifndef NKPT
  -#define  NKPT32
  -#endif
  -
   #define NKPML4E  1   /* number of kernel PML4 
  slots */
  -#define NKPDPE   howmany(NKPT, NPDEPG)/* number of kernel PDP 
  slots */
 
   #define  NUPML4E (NPML4EPG/2)/* number of userland PML4 
  pages */
   #define  NUPDPE  (NUPML4E*NPDPEPG)/* number of userland PDP 
  pages */
  @@ -181,6 +175,7 @@
   #define  PML4map ((pd_entry_t *)(addr_PML4map))
   #define  PML4pml4e   ((pd_entry_t *)(addr_PML4pml4e))
 
  +extern int nkpt; /* Initial number of kernel page tables */
   extern u_int64_t KPDPphys;   /* physical address of kernel level 3 */
   extern u_int64_t KPML4phys;  /* physical address of kernel level 4 */
 
  Index: sys/amd64/amd64/minidump_machdep.c
  ===
  --- sys/amd64/amd64/minidump_machdep.c(revision 246277)
  +++ sys/amd64/amd64/minidump_machdep.c(working copy)
  @@ -232,7 +232,7 @@
/* Walk page table pages, set bits in vm_page_dump */
pmapsize = 0;
pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
  - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
  + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
kernel_vm_end); ) {
/*
 * We always write a page, even if it is zero. Each
  @@ -364,7 +364,7 @@
/* Dump kernel page directory pages */
bzero(fakepd, sizeof(fakepd));
pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
  - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
  + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
kernel_vm_end); va += NBPDP) {
i = (va  PDPSHIFT)  ((1ul  NPDPEPGSHIFT) - 1);
 
  Index: sys/amd64/amd64/pmap.c
  ===
  --- sys/amd64/amd64/pmap.c(revision 246277)
  +++ sys/amd64/amd64/pmap.c(working copy)
  @@ -202,6 +202,10 @@
   vm_offset_t virtual_avail;   /* VA of first avail page (after kernel bss) 
  */
   vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) 
  */
 
  +int nkpt;
  +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0,
  +Number of kernel page table pages allocated on bootup);
  +
   static int ndmpdp;
   static vm_paddr_t dmaplimit;
   vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
  @@ -495,17 +499,42 @@
 
   CTASSERT(powerof2(NDMPML4E));
 
  +/* number of kernel PDP slots */
  +#define  NKPDPE(ptpgs)   howmany((ptpgs), NPDEPG)
  +
   static void
  +nkpt_init(vm_paddr_t addr)
  +{
  + int pt_pages;
  +
  +#ifdef NKPT
  + pt_pages = NKPT;
  +#else
  + pt_pages = howmany(addr, 1  PDRSHIFT);
  + pt_pages += NKPDPE(pt_pages);
  +
  + /*
  +  * Add some slop beyond the bare minimum required for bootstrapping
  +  * the kernel.
  +  *
  +  * This is quite important when allocating KVA for kernel modules.
  +  * The modules are required to be linked in the negative 2GB of
  +  * the address space.  If we run out of KVA in this region then
  +  * pmap_growkernel() will need to allocate page table pages to map
  +  * the entire 512GB of KVA space which is an unnecessary tax on
  +  * physical memory.
  +  */
  + pt_pages += 4;  /* 8MB additional slop for kernel modules */
  8MB might be to low. I just checked one of my machines with fully
  modularized kernel, it takes slightly more than 6 MB to load 50 modules.
  I think that 16MB would be safer, but it probably needs to be scaled
  down based on the available phys memory. amd64 kernel could be booted
  on 128MB machine still.
 
 Is there no way to not map the entire 512GB?  Otherwise this patch
 could really hose some vendors.  E.g. the kernel module for the OneFS
 file system is around 8MB all by itself

Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread Konstantin Belousov
On Tue, Feb 05, 2013 at 09:12:41AM -0800, Neel Natu wrote:
 Hi Konstantin,
 
 Sounds fine. I can bump it up to 8 pages.
 
 Also, wrt your comment about scaling this number based on available
 memory, I wonder if it makes sense to optimize for 16KB of additional
 space.
If it boots on 128MB as is, I am completely fine with the existing patch
(after slack increase) as well.

 
 I would much rather work with you and Alan to fix pmap_growkernel() so
 we don't need to care about this slack in the first place :-)

Sure, if you consider this important enough.  I do agree with mdf that
having this happen without any user intervention is good. But having
just a tunable for the slack might be much less labor-intensive and
good enough.


pgp7buUfBNk5B.pgp
Description: PGP signature


Re: scheduler-swapper, SI_SUB_RUN_SCHEDULER-SI_SUB_LAST

2013-02-02 Thread Konstantin Belousov
On Sat, Feb 02, 2013 at 01:50:40PM +0200, Andriy Gapon wrote:
 
 I would like to propose the following mostly cosmetic change:
 http://people.freebsd.org/~avg/scheduler-swapper.diff
 
 This is something that bit me early in my FreeBSD days, so I am kind of 
 obsessed
 with it.
 The current naming is confusing/misleading indeed.
 And magic properties of SI_SUB_RUN_SCHEDULER:SI_ORDER_LAST is a hidden gem.

You may remove the Giant unlock from the scheduler()/swapper() as well
then, doing it before the swapper() call in the mi_startup().

Note that the wait chain for the idle swapper is still called sched.



pgp0gyHIR2xWU.pgp
Description: PGP signature


Re: Sockets programming question

2013-01-28 Thread Konstantin Belousov
On Mon, Jan 28, 2013 at 08:11:47AM -0700, Ian Lepore wrote:
 I've got a question that isn't exactly freebsd-specific, but
 implemenation-specific behavior may be involved.
 
 I've got a server process that accepts connections from clients on a
 PF_LOCAL stream socket.  Multiple clients can be connected at once; a
 list of them is tracked internally.  The server occasionally sends data
 to each client.  The time between messages to clients can range
 literally from milliseconds to months.  Clients never send any data to
 the server, indeed they may shutdown that side of the connection and
 just receive data.
 
 The only way I can find to discover that a client has disappeared is by
 trying to send them a message and getting an error because they've
 closed the socket or died completely.  At that point I can reap the
 resources and remove them from the client list.  This is problem because
 of the months between messages thing.  A lot of clients can come and
 go during those months and I've got this ever-growing list of open
 socket descriptors because I never had anything to say the whole time
 they were connected.
 
 By trial and error I've discovered that I can sort of poll for their
 presence by writing a zero-length message.  If the other end of the
 connection is gone I get the expected error and can reap the client,
 otherwise it appears to quietly write nothing and return zero and have
 no other side effects than polling the status of the server-client side
 of the pipe.
 
 My problem with this polling is that I can't find anything in writing
 that sanctions this behavior.  Would this amount to relying on a
 non-portable accident of the current implementation?  
 
 Also, am I missing something simple and there's a cannonical way to
 handle this?  In all the years I've done client/server stuff I've never
 had quite this type of interaction (or lack thereof) between client and
 server before.

Check for the IN events as well. I would not trust the mere presence
of the IN in the poll result, but consequent read should return EOF
and this is good indicator of the dead client.


pgpetdMu2vn_M.pgp
Description: PGP signature


Re: Processes' FIBs

2013-01-22 Thread Konstantin Belousov
On Tue, Jan 17, 2012 at 01:21:27PM +0100, Oliver Fromme wrote:
 Kostik Belousov kostik...@gmail.com wrote:
   The patch misses compat32 bits and breaks compat32 ps/top.
 
 Right, thank you for pointing it out!  I missed it because
 I only have i386 for testing.
 
 I've created new patch sets for releng8 and current.  These
 include compat32 support and an entry for the manual page.
 
 Would someone with amd64 please test the compat32 part?
 I've been using this code on i386 for a few days without
 any problems.
 
 I've attached the patch for current below.  Both patch sets
 are also available from this URL:
 http://www.secnetix.de/olli/tmp/ki_fibnum/
 
 Testing is easy:  Apply the patch, rebuild bin/ps and kernel.
 Make sure that your kernel config has options ROUTETABLES=16
 so multiple FIBs are supported.  Reboot.  Open a shell with
 setfib, e.g. setfib 3 /bin/sh (no root required), type
 ps -ax -o user,pid,fib,command or something similar, and
 verify that the shell process and its children are listed
 with the correct FIB.  When testing on amd64, use both the
 native ps and an i386 binary.
 
 Thank you very much!
 
 Best regards
   Oliver
 
 --- sys/sys/user.h.orig   2011-11-07 22:13:19.0 +0100
 +++ sys/sys/user.h2012-01-17 11:33:59.0 +0100
 @@ -83,7 +83,7 @@
   * it in two places: function fill_kinfo_proc in sys/kern/kern_proc.c and
   * function kvm_proclist in lib/libkvm/kvm_proc.c .
   */
 -#define  KI_NSPARE_INT   9
 +#define  KI_NSPARE_INT   8
  #define  KI_NSPARE_LONG  12
  #define  KI_NSPARE_PTR   6
  
 @@ -186,6 +186,7 @@
*/
   charki_sparestrings[50];/* spare string space */
   int ki_spareints[KI_NSPARE_INT];/* spare room for growth */
 + int ki_fibnum;  /* Default FIB number */
   u_int   ki_cr_flags;/* Credential flags */
   int ki_jid; /* Process jail ID */
   int ki_numthreads;  /* XXXKSE number of threads in total */
 --- sys/kern/kern_proc.c.orig 2012-01-15 19:47:24.0 +0100
 +++ sys/kern/kern_proc.c  2012-01-17 12:52:36.0 +0100
 @@ -836,6 +836,7 @@
   kp-ki_swtime = (ticks - p-p_swtick) / hz;
   kp-ki_pid = p-p_pid;
   kp-ki_nice = p-p_nice;
 + kp-ki_fibnum = p-p_fibnum;
   kp-ki_start = p-p_stats-p_start;
   timevaladd(kp-ki_start, boottime);
   PROC_SLOCK(p);
 @@ -1121,6 +1122,7 @@
   bcopy(ki-ki_comm, ki32-ki_comm, COMMLEN + 1);
   bcopy(ki-ki_emul, ki32-ki_emul, KI_EMULNAMELEN + 1);
   bcopy(ki-ki_loginclass, ki32-ki_loginclass, LOGINCLASSLEN + 1);
 + CP(*ki, *ki32, ki_fibnum);
   CP(*ki, *ki32, ki_cr_flags);
   CP(*ki, *ki32, ki_jid);
   CP(*ki, *ki32, ki_numthreads);
 --- sys/compat/freebsd32/freebsd32.h.orig 2011-11-11 08:17:00.0 
 +0100
 +++ sys/compat/freebsd32/freebsd32.h  2012-01-17 11:34:00.0 +0100
 @@ -319,6 +319,7 @@
   charki_loginclass[LOGINCLASSLEN+1];
   charki_sparestrings[50];
   int ki_spareints[KI_NSPARE_INT];
 + int ki_fibnum;
   u_int   ki_cr_flags;
   int ki_jid;
   int ki_numthreads;
 --- bin/ps/keyword.c.orig 2011-09-29 08:31:42.0 +0200
 +++ bin/ps/keyword.c  2012-01-17 12:54:49.0 +0100
 @@ -85,6 +85,7 @@
   {etimes, ELAPSED, NULL, USER, elapseds, 0, CHAR, NULL, 0},
   {euid, , uid, 0, NULL, 0, CHAR, NULL, 0},
   {f, F, NULL, 0, kvar, KOFF(ki_flag), INT, x, 0},
 + {fib, FIB, NULL, 0, kvar, NULL, 2, KOFF(ki_fibnum), INT, d, 0},
   {flags, , f, 0, NULL, 0, CHAR, NULL, 0},
   {gid, GID, NULL, 0, kvar, KOFF(ki_groups), UINT, UIDFMT, 0},
   {group, GROUP, NULL, LJUST, egroupname, 0, CHAR, NULL, 0},
 --- bin/ps/ps.1.orig  2011-11-22 22:53:06.0 +0100
 +++ bin/ps/ps.1   2012-01-17 12:56:17.0 +0100
 @@ -29,7 +29,7 @@
  .\ @(#)ps.1 8.3 (Berkeley) 4/18/94
  .\ $FreeBSD: src/bin/ps/ps.1,v 1.112 2011/11/22 21:53:06 trociny Exp $
  .\
 -.Dd November 22, 2011
 +.Dd January 17, 2012
  .Dt PS 1
  .Os
  .Sh NAME
 @@ -506,6 +506,9 @@
  minutes:seconds.
  .It Cm etimes
  elapsed running time, in decimal integer seconds
 +.It Cm fib
 +default FIB number, see
 +.Xr setfib 1
  .It Cm flags
  the process flags, in hexadecimal (alias
  .Cm f )
Just reviving the recent thread after the ping.

The patch looks fine to me, and is still not committed.


pgpXl5s07HuIv.pgp
Description: PGP signature


Re: Getting the current thread ID without a syscall?

2013-01-15 Thread Konstantin Belousov
On Tue, Jan 15, 2013 at 03:54:03PM -0500, Trent Nelson wrote:
 Howdy,
 
 I have an unusual requirement: I need to get the current thread ID
 in as few instructions as possible.  On Windows, I managed to come
 up with this glorious hack:
 
 #ifdef WITH_INTRINSICS
 #   ifdef MS_WINDOWS
 #   include intrin.h
 #   if defined(MS_WIN64)
 #   pragma intrinsic(__readgsdword)
 #   define _Py_get_current_process_id() (__readgsdword(0x40))
 #   define _Py_get_current_thread_id()  (__readgsdword(0x48))
 #   elif defined(MS_WIN32)
 #   pragma intrinsic(__readfsdword)
 #   define _Py_get_current_process_id() (__readfsdword(0x20))
 #   define _Py_get_current_thread_id()  (__readfsdword(0x24))
 
 That exploits the fact that Windows uses the FS/GS registers to
 store thread/process metadata.  Could I use a similar approach on
 FreeBSD to get the thread ID without the need for syscalls?
The layout of the per-thread structure used by libthr is private and
is not guaranteed to be stable even on the stable branches.

Yes, you could obtain the tid this way, but note explicitely that using
it makes your application not binary compatible with any version of
the FreeBSD except the one you compiled on.

You could read the _thread_off_tid integer variable and use the value
as offset from the %fs base to the long containing the unique thread id.
But don't use this in anything except the private code.

 
 (I technically don't need the thread ID, I just need to get some
  form of unique identifier for the current thread such that I can
  compare it to a known global value that's been set to the main
  thread, in order to determine if I'm currently that thread or not.
  As long as it's unique for each thread, and static for the lifetime
  of the thread, that's fine.)
 
 The am I the main thread? comparison is made every ~50-100 opcodes,
 which is why it needs to have the lowest overhead possible.
On newer CPUs in amd64 mode, there is getfsbase instruction which reads
the %fs register base. System guarantees that %fs base is unique among
live threads.


pgpsp_lioKIEy.pgp
Description: PGP signature


Re: Getting the current thread ID without a syscall?

2013-01-15 Thread Konstantin Belousov
On Tue, Jan 15, 2013 at 04:35:14PM -0500, Trent Nelson wrote:
 On Tue, Jan 15, 2013 at 01:16:41PM -0800, Konstantin Belousov wrote:
  On Tue, Jan 15, 2013 at 03:54:03PM -0500, Trent Nelson wrote:
   Howdy,
   
   I have an unusual requirement: I need to get the current thread ID
   in as few instructions as possible.  On Windows, I managed to come
   up with this glorious hack:
   
   #ifdef WITH_INTRINSICS
   #   ifdef MS_WINDOWS
   #   include intrin.h
   #   if defined(MS_WIN64)
   #   pragma intrinsic(__readgsdword)
   #   define _Py_get_current_process_id() (__readgsdword(0x40))
   #   define _Py_get_current_thread_id()  (__readgsdword(0x48))
   #   elif defined(MS_WIN32)
   #   pragma intrinsic(__readfsdword)
   #   define _Py_get_current_process_id() (__readfsdword(0x20))
   #   define _Py_get_current_thread_id()  (__readfsdword(0x24))
   
   That exploits the fact that Windows uses the FS/GS registers to
   store thread/process metadata.  Could I use a similar approach on
   FreeBSD to get the thread ID without the need for syscalls?
  The layout of the per-thread structure used by libthr is private and
  is not guaranteed to be stable even on the stable branches.
  
  Yes, you could obtain the tid this way, but note explicitely that using
  it makes your application not binary compatible with any version of
  the FreeBSD except the one you compiled on.
 
 Luckily it's for an open source project (Python), so recompilation
 isn't a big deal.  (I also check the intrinsic result versus the
 syscall result during startup to verify the same ID is returned,
 falling back to the syscall by default.)
For you, may be. For your users, it definitely will be a problem.
And worse, the problem will be blamed on the operating system and not
to the broken application.

 
  You could read the _thread_off_tid integer variable and use the value
  as offset from the %fs base to the long containing the unique thread id.
  But don't use this in anything except the private code.
 
 Ah, thanks, that's what I was interested in knowing.
 
   
   (I technically don't need the thread ID, I just need to get some
form of unique identifier for the current thread such that I can
compare it to a known global value that's been set to the main
thread, in order to determine if I'm currently that thread or not.
As long as it's unique for each thread, and static for the lifetime
of the thread, that's fine.)
   
   The am I the main thread? comparison is made every ~50-100 opcodes,
   which is why it needs to have the lowest overhead possible.
 
  On newer CPUs in amd64 mode, there is getfsbase instruction which reads
  the %fs register base. System guarantees that %fs base is unique among
  live threads.
 
 Interesting.  I was aware of those instructions, but never assessed
 them in detail once I'd figured out the readgsdword approach.  I
 definitely didn't realize they return unique values per thread
 (although it makes sense now that I think about it).
 
 Thanks Konstantin, very helpful.
 
 Trent.


pgpm7aloC53Ee.pgp
Description: PGP signature


Re: Getting the current thread ID without a syscall?

2013-01-15 Thread Konstantin Belousov
On Tue, Jan 15, 2013 at 06:03:30PM -0500, Trent Nelson wrote:
 On Tue, Jan 15, 2013 at 02:33:41PM -0800, Ian Lepore wrote:
  On Tue, 2013-01-15 at 14:29 -0800, Alfred Perlstein wrote:
   On 1/15/13 1:43 PM, Konstantin Belousov wrote:
On Tue, Jan 15, 2013 at 04:35:14PM -0500, Trent Nelson wrote:
   
 Luckily it's for an open source project (Python), so recompilation
 isn't a big deal.  (I also check the intrinsic result versus the
 syscall result during startup to verify the same ID is returned,
 falling back to the syscall by default.)
For you, may be. For your users, it definitely will be a problem.
And worse, the problem will be blamed on the operating system and not
to the broken application.
   
   Anything we can do to avoid this would be best.
   
   The reason is that we are still dealing with an optimization that perl 
   did, it reached inside of the opaque struct FILE to do nasty things.  
   Now it is very difficult for us to fix struct FILE.
   
   We are still paying for this years later.
   
   Any way we can make this a supported interface?
   
   -Alfred
  
  Re-reading the original question, I've got to ask why pthread_self()
  isn't the right answer?  The requirement wasn't I need to know what the
  OS calls me it was I need a unique ID per thread within a process.
 
 The identity check is performed hundreds of times per second.  The
 overhead of (Py_MainThreadId == __readgsdword(0x48) ? A() : B()) is
 negligible -- I can't say the same for a system/function call.
 
 (I'm experimenting with an idea I had to parallelize Python such
  that it can exploit all cores without impeding the performance
  of normal single-threaded execution (like previous-GIL-removal
  attempts and STM).  It's very promising so far -- presuming we
  can get the current thread ID in a couple of instructions.  If
  not, single-threaded performance suffers too much.)

If the only thing you ever need is to get a unique handle for the
current thread, without the requirement that it corresponds to any
other identifier, everything becomes much easier.

On amd64, use 'movq %fs:0,%register', on i386 'movl %gs:0,%register'.
This instruction is guaranteed to return the thread-unique address
of the tcb. See an article about ELF TLS for more details.

Even better, this instruction is portable among all ELF Unixes which
support TLS.


pgpnI6yuTRWOO.pgp
Description: PGP signature


Re: malloc+utrace, tracking memory leaks in a running program.

2013-01-10 Thread Konstantin Belousov
On Thu, Jan 10, 2013 at 10:16:46AM -0500, Alfred Perlstein wrote:
 On 1/10/13 2:38 AM, Konstantin Belousov wrote:
  On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote:
  Here are more convenient links that give diffs against FreeBSD and
  jemalloc for the proposed changes:
 
  FreeBSD:
  https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2
 
  Why  do you need to expedite the records through the ktrace at all ?
  Wouldn't direct write(2)s to a file allow for better performance
  due to not stressing kernel memory allocator and single writing thread ?
  Also, the malloc coupling to the single-system interface would be
  prevented.
 
  I believe that other usermode tracers also behave in the similar way,
  using writes and not private kernel interface.
 
  Also, what /proc issues did you mentioned ? There is
  sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map
  and does not require /proc mounted.
 
  jemalloc:
  https://github.com/alfredperlstein/jemalloc/compare/master...utrace2
 
 
 Konstantin, you are right, it is a strange thing this utrace.  I am not 
 sure why it was done this way.
 
 You are correct in that much more efficient system could be made using 
 writes gathered into a single write(2).
Even without writes gathering, non-coalesced writes should be faster than
utrace.

 
 Do you think there is any reason they may have re-used the kernel paths 
 for ktrace even at the cost of efficiency?
I can only speculate. The utracing of the malloc calls in the context
of the ktrace stream is useful for the human reading the trace. Instead
of seeing the sequence of unexplanaible calls allocating and freeing
memory, you would see something more penetrable. For example, you would
see accept/malloc/read/write/free, which could be usefully interpreted
as network server serving the client.

This context is not needed for a leak detector.
 
 About kern.proc.vmmap I will look into that.
 
 -Alfred


pgpMOJFXTDekU.pgp
Description: PGP signature


Re: malloc+utrace, tracking memory leaks in a running program.

2013-01-10 Thread Konstantin Belousov
On Thu, Jan 10, 2013 at 01:29:38PM -0500, Alfred Perlstein wrote:
 On 1/10/13 1:05 PM, Konstantin Belousov wrote:
  On Thu, Jan 10, 2013 at 10:16:46AM -0500, Alfred Perlstein wrote:
  On 1/10/13 2:38 AM, Konstantin Belousov wrote:
  On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote:
  Here are more convenient links that give diffs against FreeBSD and
  jemalloc for the proposed changes:
 
  FreeBSD:
  https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2
 
  Why  do you need to expedite the records through the ktrace at all ?
  Wouldn't direct write(2)s to a file allow for better performance
  due to not stressing kernel memory allocator and single writing thread ?
  Also, the malloc coupling to the single-system interface would be
  prevented.
 
  I believe that other usermode tracers also behave in the similar way,
  using writes and not private kernel interface.
 
  Also, what /proc issues did you mentioned ? There is
  sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map
  and does not require /proc mounted.
 
  jemalloc:
  https://github.com/alfredperlstein/jemalloc/compare/master...utrace2
 
  Konstantin, you are right, it is a strange thing this utrace.  I am not
  sure why it was done this way.
 
  You are correct in that much more efficient system could be made using
  writes gathered into a single write(2).
  Even without writes gathering, non-coalesced writes should be faster than
  utrace.
 
  Do you think there is any reason they may have re-used the kernel paths
  for ktrace even at the cost of efficiency?
  I can only speculate. The utracing of the malloc calls in the context
  of the ktrace stream is useful for the human reading the trace. Instead
  of seeing the sequence of unexplanaible calls allocating and freeing
  memory, you would see something more penetrable. For example, you would
  see accept/malloc/read/write/free, which could be usefully interpreted
  as network server serving the client.
 
  This context is not needed for a leak detector.
 Now I may be wrong here, but I think it's an artifact of someone 
 noticing how useful fitting this into the ktrace system and leveraging 
 existing code.
 
 Even though there are significant performance deficiencies, the actual 
 utility of the existing framework may have been such a stepping stool 
 towards tracing that it was just used.
 
 Right now the code already exists, however it logs just {operation, 
 size, ptr}, example:
 malloc, 512, - 0xdeadbeef
 free, 0, 0xdeadbeef
 realloc, 512, 0 - 0xdeadc0de
 realloc, 1024, 0xdeadc0de - 0x
 free, 0, 0x
 
 What do you think of just adding the address of the caller of 
 malloc/free/realloc to these already existing tracepoints?

In most real-world applications I saw, malloc() was not a function called
to do the allocation. Usually, there is either an app-specific wrapper,
or the language runtime system which calls malloc(), e.g. the new operator
for the C++ code. Than, the caller address becomes constant for the whole
duration of the program run.

What would be useful is the full backtrace of each allocation. The tools
like libunwind are indeed optimized for this usage pattern.

From this POV, the libc malloc(3) might be better offering a set of the
well-defined hooks for a pluggable tracer to utilize. I am on the fence
there, you could override the malloc/free without hooks, by the ELF symbol
interposing technique, but hooks would also offer features not easily
implementable with the interposing.


pgp3bpBpdU_be.pgp
Description: PGP signature


Re: malloc+utrace, tracking memory leaks in a running program.

2013-01-09 Thread Konstantin Belousov
On Thu, Jan 10, 2013 at 01:56:48AM -0500, Alfred Perlstein wrote:
 Here are more convenient links that give diffs against FreeBSD and 
 jemalloc for the proposed changes:
 
 FreeBSD:
 https://github.com/alfredperlstein/freebsd/compare/13e7228d5b83c8fcfc63a0803a374212018f6b68~1...utrace2
 
Why  do you need to expedite the records through the ktrace at all ?
Wouldn't direct write(2)s to a file allow for better performance
due to not stressing kernel memory allocator and single writing thread ?
Also, the malloc coupling to the single-system interface would be
prevented.

I believe that other usermode tracers also behave in the similar way,
using writes and not private kernel interface.

Also, what /proc issues did you mentioned ? There is
sysctl kern.proc.vmmap which is much more convenient than /proc/pid/map
and does not require /proc mounted.

 jemalloc:
 https://github.com/alfredperlstein/jemalloc/compare/master...utrace2
 
 -Alfred
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


pgp3u3h6Pt05o.pgp
Description: PGP signature


Re: cpufreq not working as module on i386/amd64

2013-01-08 Thread Konstantin Belousov
On Tue, Jan 08, 2013 at 01:15:59PM +0800, Jia-Shiun Li wrote:
 Hi all,
 
 move to -hackers as it seems a better place to discuss.
 
 Turns out, at  acpi_cpu_attach(), the bus probe will query
 ACPI_GET_FEATURES() to get cpu_features flags from each cpufreq
 drivers. Since it is only executed once at booting, subsequent module
 loading after booting will not be called get_features() and will not
 be able to express interests in terms of ACPI_CAP_* flags. And if
 these flags were not set first, acpi_perf won't get attached. Later
 when loading cpufreq.ko, est_attach(), etc. will not be able to find
 acpi_perf and thus failed to attach.
 
 After referred to Intel doc. #302223 mentioned in
 dev/acpica/acpivar.h, I got confused by the meaning of these bits.
 According to the document, capability bits probably should be set as
 long as OSPM is *capable* of using them, no matter when they will be
 used. I am not familiar with ACPI, but I suppose it will expose
 states/methods to OSPM according to these bits?
 
 To confirm this, I add a line to simply OR'd (ACPI_CAP_PERF_MSRS |
 ACPI_CAP_C1_IO_HALT) to sc-cpu_features after the loop calling
 ACPI_GET_FEATURES(). And then I am able to load cpufreq.ko after boot.
 powerd also works fine with it.
 
 If this is true, then probably we don't need get_features() for acpi
 interface/drivers. It is sufficient to just turn on proper bits
 directly for acpi_cpu when related code is added to repository.
 
 Any comments?
 
 -8-8-8-
 # svn diff
 Index: sys/dev/acpica/acpi_cpu.c
 ===
 --- sys/dev/acpica/acpi_cpu.c   (revision 245148)
 +++ sys/dev/acpica/acpi_cpu.c   (working copy)
 @@ -350,6 +350,7 @@
 sc-cpu_features |= features;
 }
 free(drivers, M_TEMP);
 +   sc-cpu_features |= ACPI_CAP_PERF_MSRS | ACPI_CAP_C1_IO_HALT;
  }
I had almost word-to-word identical conversation with Andrey Gapon,
might be a year ago. I Cc:ed him.

 
  /*
 
 -8-8-8-
 
 
 Jia-Shiun
 
 On Wed, Dec 26, 2012 at 2:10 AM, Jia-Shiun Li jiash...@gmail.com wrote:
  I was cleaning up hard drive and found these old logs. Anyway I added
  some printf()
  and saw the process failed at device_find_child(..., acpi_perf, ...)
  of est_acpi_info() i.e. it cannot find acpi_perf device.
 
  devinfo did confirmed the absence of acpi_perf. Comparing the dmesgs
  revealed the main difference: IST (i-state?) OEM tables in SSDT seems
  not loaded if cpufreq was not compiled into kernel, as it shows below.
 
  Before I diving into the ACPI part, can anyone familiar with how ACPI
  works shed me some light?
 
  -8-8-8-
  % diff -bu cpufreq-nb.log cpufreq-no.log
  ...
  @@ -158,17 +158,11 @@
   acpi0: Power Button (fixed)
   cpu0: Processor \\_PR_.CPU0 (ACPI ID 1) - APIC ID 0
   cpu0: ACPI CPU on acpi0
  -ACPI: SSDT 0xbfd8dc98 00223 (v01  PmRef  Cpu0Ist 3000 INTL 20051117)
  -ACPI: Dynamic OEM Table Load:
  -ACPI: SSDT 0 00223 (v01  PmRef  Cpu0Ist 3000 INTL 20051117)
   ACPI: SSDT 0xbfd8b598 00537 (v01  PmRef  Cpu0Cst 3001 INTL 20051117)
   ACPI: Dynamic OEM Table Load:
   ACPI: SSDT 0 00537 (v01  PmRef  Cpu0Cst 3001 INTL 20051117)
   cpu1: Processor \\_PR_.CPU1 (ACPI ID 2) - APIC ID 1
   cpu1: ACPI CPU on acpi0
  -ACPI: SSDT 0xbfd8ce18 001CF (v01  PmRefApIst 3000 INTL 20051117)
  -ACPI: Dynamic OEM Table Load:
  -ACPI: SSDT 0 001CF (v01  PmRefApIst 3000 INTL 20051117)
   ACPI: SSDT 0xbfd8df18 0008D (v01  PmRefApCst 3000 INTL 20051117)
   ACPI: Dynamic OEM Table Load:
   ACPI: SSDT 0 0008D (v01  PmRefApCst 3000 INTL 20051117
 
  On Sat, Jan 29, 2011 at 4:41 PM, Alexander Best arun...@freebsd.org wrote:
  On Sat Jan 29 11, Jia-Shiun Li wrote:
  Hi all,
 
  I found that cpufreq driver failed to attach when compiled as module
  and loaded, but it works fine when compiled into kernel. I am
  wondering if this is due to some kind of limitation, or can be fixed?
 
  that's rather odd. for me neither the module nor the kernel code works, 
  since
  my cpu isn't supported by sys/x86/cpufreq/est.c. actually only pentium 
  mobile
  cpus seem to be supported.
 
  maybe you can add some printf's to est.c:est_get_info() to identify at 
  which
  point error gets set. also you might want to make
 
  est: cpu_vendor %s, msr %0jx\n, cpu_vendor, msr);
 
  non-conditional. maybe the output differy in kernel/module mode.
 
  cheers.
  alex
 
 
  Tested on a Pentium E5200 desktop (i386) and a Pentium T4200 laptop
  (amd64). Both got the same result. dmesg of T4200 attached.
 
  kldload module:
  -8-8-8-
  est0: Enhanced SpeedStep Frequency Control on cpu0
  est: CPU supports Enhanced Speedstep, but is not recognized.
  est: cpu_vendor GenuineIntel, msr 6194c1a06004c1a
  device_attach: est0 attach returned 6
  est1: Enhanced SpeedStep Frequency Control on cpu1
  est: CPU supports Enhanced Speedstep, but is not recognized.
  est: 

Re: About QUOTA support in stock kernel (resent)

2012-12-25 Thread Konstantin Belousov
On Tue, Dec 25, 2012 at 09:34:30PM +0800, Patrick Dung wrote:
 Hi,
 
 I would like to know why quota is not enabled in the stock kernel..
 
 I remembered that it is not enabled since freebsd 3.5 or freebsd 4 generation.
 Now in freebsd 9.0, it still neeed a kernel rebuild.
 
 I have heard it has performance issue (GIANT lock) about quota.

Enabling quota by default would cause small overhead, like one mutex acquire,
for each inode and block alloc/dealloc, even for mount without quotas enabled.

Might be, it is reasonable to just enable it now. Unless somebody provide
valid objections and I do not forget, I will do it in a week for HEAD.


pgp9ipGxCUQ5A.pgp
Description: PGP signature


Re: About QUOTA support in stock kernel (resent)

2012-12-25 Thread Konstantin Belousov
On Tue, Dec 25, 2012 at 10:23:26AM -0500, Eitan Adler wrote:
 On 25 December 2012 10:07, Konstantin Belousov kostik...@gmail.com wrote:
  Enabling quota by default would cause small overhead, like one mutex 
  acquire,
  for each inode and block alloc/dealloc, even for mount without quotas 
  enabled.
 
 Why is this, and can it be avoided (for mounts without quotas)?
Because system should check whether quota is enabled to do the accounting.

 
  Might be, it is reasonable to just enable it now. Unless somebody provide
  valid objections and I do not forget, I will do it in a week for HEAD.
 
 
 
 -- 
 Eitan Adler


pgpGnpuB7k1Hj.pgp
Description: PGP signature


Re: Fix overlinking in base aka import pkgconf

2012-12-16 Thread Konstantin Belousov
On Sun, Dec 16, 2012 at 02:01:00PM +0100, Jeremie Le Hen wrote:
 Hi bapt, kib,
 
 On Sat, Dec 15, 2012 at 11:56:43AM +0100, Baptiste Daroussin wrote:
  On Sat, Dec 15, 2012 at 03:22:34AM +0200, Konstantin Belousov wrote:
   On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote:
Hi,

Some of our binary are overlinked, the way we handle the linking 
doesn't help
for that.
   What do you mean there ? Do you mean that some libraries specified for the
   linking stage of the final binary are not needed for the execution ?
  
  I mean some library are registrer in the binary with DT_NEEDED tag where 
  they
  don't need to.
  
   

On proposition could be to use pkgconf 
https://github.com/pkgconf/pkgconf which
is BSD license pkg-config implementation 100% compatible with 
pkg-config.

What I propose is to create a new PCADD variable for the Makefiles.

PCADD will invoke pkgconf to gather the libraries and the cflags for a 
given
project.

The second thing would be to create .pc files for all of our libraries.

for example:
usr.bin/fstat dynamic build is overlinked
   And how this is better than just removing the unneeded library from
   the Makefile ?
  
  In that case to statically build you need -lkvm -lutil -lprocstat but if you
  build dynamically you will only need -lproctast meaning you don't need to be
  directly linked to libutil and libkvm. This means a breakage of libutil will
  only have inpact on procstat and no more on fstat for example.
 
 I think this could be solved by an implicit linker script contained in
 .so and .a files, pointing to the real libraries.
 
 We have already the SHLIB_LDSCRIPT variable to accomplish this for .so
 files.  It may be possible to do the same for .a files, though this
 would require renaming the real archives to something like .a.0 (here
 I'm just taking a similar scheme to the one used for DSO).
 
 As a matter of fact, libprocstat.a would contain:
 
   GROUP ( /usr/lib/libprocstat.a.0 /usr/lib/libkvm.a /usr/lib/libutil.a )
 
 Note that libkvm.a and libutil.a could be linker scripts as well.
 
 Kib, do you see any problem to this proposition?

Wouldn't you need to completely rewrite the handling of the .a files
in the share/mk ? I somewhat dislike the mere thought that .a is not
an archive any longer.

Does it make sense from the overhead and complexity POV, for such small
goal ?


pgpxuDm949pH3.pgp
Description: PGP signature


Re: Fix overlinking in base aka import pkgconf

2012-12-15 Thread Konstantin Belousov
On Sat, Dec 15, 2012 at 08:58:14AM +0100, Bernhard Fr?hlich wrote:
 On Sa., 15. Dez. 2012 02:22:34 CET, Konstantin Belousov kostik...@gmail.com 
 wrote:
 
  On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote:
   Hi,
   
   Some of our binary are overlinked, the way we handle the linking
   doesn't help for that.
  What do you mean there ? Do you mean that some libraries specified for
  the linking stage of the final binary are not needed for the execution ?
  
   
   On proposition could be to use pkgconf
   https://github.com/pkgconf/pkgconf which is BSD license pkg-config
   implementation 100% compatible with pkg-config.
   
   What I propose is to create a new PCADD variable for the Makefiles.
   
   PCADD will invoke pkgconf to gather the libraries and the cflags for a
   given project.
   
   The second thing would be to create .pc files for all of our libraries.
   
   for example:
   usr.bin/fstat dynamic build is overlinked
  And how this is better than just removing the unneeded library from
  the Makefile ?
  
  For the port consumption, I believe that the better solution is to
  provide a pack of the .pc files describing base libraries, most likely
  as port.
 
 We should definitely generate some pc files for our base libraries.
 We already have quite a few ports that need to hack around because of
 missing pc files for ssl for example.

This is fine. But making the base build depend on the pkgconf does not
give us anything except complications and overhead, in my opinion.


pgptYtmqz9fXe.pgp
Description: PGP signature


Re: postgres, initdb, FreeBSD bug?

2012-12-15 Thread Konstantin Belousov
On Fri, Dec 14, 2012 at 10:43:00PM -0600, David Noel wrote:
 I've been fighting with a bug I can't quite seem to figure out and was
 told that this might be the place to come. I'm running
 postgresql-9.2.2 on FreeBSD 8.3-RELEASE-p5 and am having things break
 down when I try to run initdb. I got in contact with the pgsql-general
 mailing list and we debugged the issue to the point where it seemed
 that this might be a FreeBSD-related error. Relevant excerpts from
 several email are below that piece together the error:
 
 I'm running into the following error message when running initdb (FreeBSD 
 host):
 
ygg# /usr/local/etc/rc.d/postgresql initdb -D /zdb/pgsql/data --debug
The files belonging to this database system will be owned by user pgsql.
This user must also own the server process.
 
The database cluster will be initialized with locales
  COLLATE:  C
  CTYPE:en_US.UTF-8
  MESSAGES: en_US.UTF-8
  MONETARY: en_US.UTF-8
  NUMERIC:  en_US.UTF-8
  TIME: en_US.UTF-8
The default text search configuration will be set to english.
 
creating directory /zdb/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in /zdb/pgsql/data/base/1 ... FATAL:
could not open file pg_xlog/00010001 (log file 0,
segment 1): No such file or directory
child process exited with exit code 1
initdb: removing data directory /zdb/pgsql/data
 
  ...
 
  Interestingly, I have a second--virtually identical--server that I
   just tried initdb on. FreeBSD 8.3-RELEASE-p5, postgresql-server-9.2.2.
   Exact same FATAL: could not open file pg_xlog error. So it is
   reproducible.
 
  ...
 
  The relevant part of the ktrace output is
 
71502 postgres CALL  unlink(0x7fffc130)
71502 postgres NAMI  pg_xlog/xlogtemp.71502
71502 postgres RET   unlink -1 errno 2 No such file or directory
71502 postgres CALL
  open(0x7fffc130,O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR)
71502 postgres NAMI  pg_xlog/xlogtemp.71502
71502 postgres RET   open 3
71502 postgres CALL  write(0x3,0x801a56030,0x2000)
71502 postgres GIO   fd 3 wrote 4096 bytes
 a lot of uninteresting write() calls snipped ...
71502 postgres RET   write 8192/0x2000
71502 postgres CALL  close(0x3)
71502 postgres RET   close 0
71502 postgres CALL  unlink(0x7fffbc60)
71502 postgres NAMI  pg_xlog/00010001
71502 postgres RET   unlink -1 errno 2 No such file or directory
71502 postgres CALL  link(0x7fffc130,0x7fffbc60)
71502 postgres NAMI  pg_xlog/xlogtemp.71502
71502 postgres NAMI  pg_xlog/00010001
71502 postgres RET   link -1 errno 1 Operation not permitted
71502 postgres CALL  unlink(0x7fffc130)
71502 postgres NAMI  pg_xlog/xlogtemp.71502
71502 postgres RET   unlink 0
71502 postgres CALL  open(0x7fffc530,O_RDWR,unused0x180)
71502 postgres NAMI  pg_xlog/00010001
71502 postgres RET   open -1 errno 2 No such file or directory
 
   This corresponds to the execution of XLogFileInit(), and what's
   evidently happening is that we successfully create and zero-fill
   the first xlog segment file under a temporary name, but then
   the attempt to rename it into place with link() fails with EPERM.
 
   This is really a WTF kind of failure, I think.  The directory is
   certainly writable --- it was made under our own UID, and what's
   more we just managed to create the file there under its temp name.
   So how can we get an EPERM failure from link()?
 
   I think this is a kernel bug.
 
   regards, tom lane
 
   PS: one odd thing here is that the ereport(LOG) in
   InstallXLogFileSegment isn't doing anything; otherwise we'd have gotten
   a much more helpful error report about could not link file.  I don't
   think we run the bootstrap mode with log_min_messages set high enough to
   disable LOG messages, so why isn't it printing?  Nonetheless, this error
   shouldn't have occurred.
 
  ...
 
  Where to from here? The freebsd-datab...@freebsd.org mailing list? The
  postgresql port maintainer? Who should I be in touch with?
 
  ...
 
  You need to talk to some FreeBSD kernel hackers about why link()
   might be failing here.  Since you see it on UFS too, we can probably
   exonerate the ZFS filesystem-specific code.
 
   I did some googling and found that EPERM can be issued if the filesystem
   doesn't support hard links (which shouldn't apply to ZFS I trust).
   Also, Linux has a protected_hardlinks option that causes certain
   attempts at creating hard links to fail --- but our use-case here
   doesn't fall foul of any of those restrictions AFAICS, and of course
   FreeBSD isn't Linux.  Still, I wonder if you're running into some
   misdesigned or misimplemented 

Re: Fix overlinking in base aka import pkgconf

2012-12-15 Thread Konstantin Belousov
On Sat, Dec 15, 2012 at 11:56:43AM +0100, Baptiste Daroussin wrote:
 On Sat, Dec 15, 2012 at 03:22:34AM +0200, Konstantin Belousov wrote:
  On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote:
   Hi,
   
   Some of our binary are overlinked, the way we handle the linking doesn't 
   help
   for that.
  What do you mean there ? Do you mean that some libraries specified for the
  linking stage of the final binary are not needed for the execution ?
 
 I mean some library are registrer in the binary with DT_NEEDED tag where they
 don't need to.
 
  
   
   On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf 
   which
   is BSD license pkg-config implementation 100% compatible with pkg-config.
   
   What I propose is to create a new PCADD variable for the Makefiles.
   
   PCADD will invoke pkgconf to gather the libraries and the cflags for a 
   given
   project.
   
   The second thing would be to create .pc files for all of our libraries.
   
   for example:
   usr.bin/fstat dynamic build is overlinked
  And how this is better than just removing the unneeded library from
  the Makefile ?
 
 In that case to statically build you need -lkvm -lutil -lprocstat but if you
 build dynamically you will only need -lproctast meaning you don't need to be
 directly linked to libutil and libkvm. This means a breakage of libutil will
 only have inpact on procstat and no more on fstat for example.
 
  
  For the port consumption, I believe that the better solution is to provide
  a pack of the .pc files describing base libraries, most likely as port.
 
 Yeah the port is another thing which yes can probably be done that way.
 
  
  Using .pc for the base system build is overkill, it does not add anything
  that cannot be accomplished by our existing build system. IMO.
 
 Probably.
 The thing is with pkgconf, fstat does not need to know that procstat
 needs libkvm and libutil for static link, it just has to know it needs 
 procstat
 and pkgconf does all the magic. and pkgconf is really small (only 48k on an
 amd64 box)
 
 Other solution would be to reinvent the same thing using our framework?
 Maybe a LDSADD (LD STATIC ADD) to differenciate both?

First, there is a linker flag, --as-needed, which does exactly what you
want: it only registers the dso as a dependency using DT_NEEDED dynamic
tag when there are real references from the linked object to the symbols
in the dso.

On the other hand, I am not a big fun of this feature. Dynamic linker
is permissive for the main binary to reference a symbols in implicit
dependencies of the explicit dependencies. If sudden change causes the
main binary to start using a symbol from 'second order' dsos, you cause
unneccessary complications there. More, recently there was a trend in the
Linux world to disable the static linker from resolving symbols from the
second order dsos. See the --no-add-needed and its recent switch in the
newer ld.

Overall, I like the fact that we explicitely list all depended base libraries
in the link command, for the base system.


pgphsm2ISdNX9.pgp
Description: PGP signature


Re: postgres, initdb, FreeBSD bug?

2012-12-15 Thread Konstantin Belousov
On Sat, Dec 15, 2012 at 09:02:34AM -0600, David Noel wrote:
 Ahh... the security.bsd. sysctl output had the answer: I had
 security.bsd.hardlink_check_gid and security.bsd.hardlink_check_uid
 set to 1 in sysctl.conf. Removing that fixed the problem.
Then, why did you set it ?

The sysctl indeed has an interesting interaction with the fact that the
created inode inherits gid from the parent directory. If you use it, you
must ensure that the directory group is in the group set of the process.

 
 Many thanks,
 
 -David
 
 
 
  Show the ktrace from the same error on UFS.
 
   5179 postgres CALL  unlink(0x7fffe570)
   5179 postgres NAMI  pg_notify/
   5179 postgres RET   unlink 0
   5179 postgres CALL  getdirentries(0x3,0x801a4b000,0x1000,0x801a4a068)
   5179 postgres RET   getdirentries 0
   5179 postgres CALL  lseek(0x3,0,SEEK_SET)
   5179 postgres RET   lseek 0
   5179 postgres CALL  close(0x3)
   5179 postgres RET   close 0
   5179 postgres CALL  open(0x7fffe580,O_RDWR|O_CREAT,S_IRUSR|S_IWUSR)
   5179 postgres NAMI  pg_notify/
   5179 postgres RET   open 3
   5179 postgres CALL  lseek(0x3,0,SEEK_SET)
   5179 postgres RET   lseek 0
   5179 postgres CALL  write(0x3,0x8041c1b40,0x2000)
   5179 postgres GIO   fd 3 wrote 4096 bytes
 ...
   5179 postgres RET   write 8192/0x2000
   5179 postgres CALL  close(0x3)
   5179 postgres RET   close 0
   5179 postgres CALL  unlink(0x7fffbc60)
   5179 postgres NAMI  pg_xlog/00010001
   5179 postgres RET   unlink -1 errno 2 No such file or directory
   5179 postgres CALL  link(0x7fffc130,0x7fffbc60)
   5179 postgres NAMI  pg_xlog/xlogtemp.5179
   5179 postgres NAMI  pg_xlog/00010001
   5179 postgres RET   link -1 errno 1 Operation not permitted
   5179 postgres CALL  unlink(0x7fffc130)
   5179 postgres NAMI  pg_xlog/xlogtemp.5179
   5179 postgres RET   unlink 0
   5179 postgres CALL  open(0x7fffc530,O_RDWR,unused0x180)
   5179 postgres NAMI  pg_xlog/00010001
   5179 postgres RET   open -1 errno 2 No such file or directory
 
 
  Show the security.bsd sysctl settings, in particular,
  harlink_check_{u,g}id.ygg# sysctl security.bsd.
 
 security.bsd.map_at_zero: 0
 security.bsd.suser_enabled: 1
 security.bsd.unprivileged_proc_debug: 0
 security.bsd.conservative_signals: 1
 security.bsd.see_other_gids: 0
 security.bsd.see_other_uids: 0
 security.bsd.unprivileged_idprio: 0
 security.bsd.unprivileged_read_msgbuf: 0
 security.bsd.hardlink_check_gid: 1
 security.bsd.hardlink_check_uid: 1
 security.bsd.unprivileged_get_quota: 0
 security.bsd.stack_guard_page: 0
 
  Show the ls -la output for the pg_xlog directory.
 
 ygg:~ ls -la /zdb/pgsql/data/pg_xlog/
 total 5
 drwx--   3 pgsql  wheel   3 Dec 15 08:39 .
 drwx--  14 pgsql  wheel  18 Dec 15 08:39 ..
 drwx--   2 pgsql  wheel   2 Dec 15 08:39 archive_status


pgp1QMVmZ4ACT.pgp
Description: PGP signature


Re: Fix overlinking in base aka import pkgconf

2012-12-14 Thread Konstantin Belousov
On Sat, Dec 15, 2012 at 12:54:19AM +0100, Baptiste Daroussin wrote:
 Hi,
 
 Some of our binary are overlinked, the way we handle the linking doesn't help
 for that.
What do you mean there ? Do you mean that some libraries specified for the
linking stage of the final binary are not needed for the execution ?

 
 On proposition could be to use pkgconf https://github.com/pkgconf/pkgconf 
 which
 is BSD license pkg-config implementation 100% compatible with pkg-config.
 
 What I propose is to create a new PCADD variable for the Makefiles.
 
 PCADD will invoke pkgconf to gather the libraries and the cflags for a given
 project.
 
 The second thing would be to create .pc files for all of our libraries.
 
 for example:
 usr.bin/fstat dynamic build is overlinked
And how this is better than just removing the unneeded library from
the Makefile ?

For the port consumption, I believe that the better solution is to provide
a pack of the .pc files describing base libraries, most likely as port.

Using .pc for the base system build is overkill, it does not add anything
that cannot be accomplished by our existing build system. IMO.


pgpTUPcXFQuN1.pgp
Description: PGP signature


Re: [PATCH] Shared library debug .symbols files

2012-12-13 Thread Konstantin Belousov
On Thu, Dec 13, 2012 at 04:08:47PM +, Ed Maste wrote:
 I've been working generating userland debugging symbols, with the goal
 that we'll build them for each release.  The user could install them
 along with the system, or later on when needed for debugging.  The
 symbols files will also be useful for profiling and tools such as
 Valgrind, without needing to build and install world first.
 
 This patch enables .symbols files for shared libraries when DEBUG_FLAGS
 is set.  Future changes will be needed to address static libraries and
 base system binaries, and the release build bits.
You cannot strip static libraries, at least if you want the result
to be useful for the consequent use.


pgp2AYosISFKr.pgp
Description: PGP signature


Re: ELF symtab and ddbsymtab difference

2012-12-06 Thread Konstantin Belousov
On Wed, Dec 05, 2012 at 12:13:24PM +0530, Shrikanth Kamath wrote:
 This is regarding the fields in the structure elf_file_t in link_elf.c.
 For some kernel modules the symtab field is different from the ddbsymtab
 field for some it is the same, would like to know what is the difference
 between the two and how to enable ddbsymtab?
Assuming we are talking about the link_elf.c and not about link_elf_obj.c.
The symtab and ddbsymtab are first initialized from the dynamic symbol
table in the module, and later, in the process of loading the module, if
the non-dynamic symbol table is present, ddbsymtab is rewritten to point
to the table.

 
 Does enabling -g in CFLAGS make the binary build the ddbsymtab different
 from symtab?
No. It is strip done on the module which could result in the removal of the
non-dynamic symtab.

 
 The problem is lookup for some symbols in the kernel module that I built
 returns with undefined, on inspecting it was getting a ENOENT from the
 function
 link_elf_lookup_symbol()
 {
  ...
  /* If we have not found it, look at the full table (if loaded) */
  if (ef-symtab == ef-ddbsymtab)
  return (ENOENT);
  ...
 }

It is not the problem. It just means that you dynamic symbol table does
not contain the needed symbol, which is the problem. As a coincident,
you also stripped your module, making the problem to be exposed.

I guess that you should look at the EXPORT_SYMS feature of the module
Makefiles. But I also remember there were some bugs.


pgpYNFsprvOlA.pgp
Description: PGP signature


Re: naming a .h file for kernel use only

2012-12-04 Thread Konstantin Belousov
On Tue, Dec 04, 2012 at 03:28:37PM -0500, Rick Macklem wrote:
 Hi,
 
 For my NFSv4.1 client work, I've taken a few definitions out of a
 kernel rpc .c file and put them in a .h file so that they can
 be included in other sys/rpc .c files.
 
 I've currently named the file _krpc.h. I thought I'd check if
 this is a reasonable name before doing the big commit of the
 NFSv4.1 stuff to head. (I have a vague notion that a leading _
 would indicate not for public use, but I am not sure?)
 
 Thanks in advance for naming suggestions for this file, rick

I believe that _thing.h is the convention for the headers that are
never directly included by the .c files. It is there to hide some
shared parts of other includes, for consumption by includes.

Why not move the stuff to sys/nfs/krpc.h ? BTW, this file lacks the
include guards.


pgpN0KluGObMT.pgp
Description: PGP signature


Re: Two Intel E31220L 9.0-Stable systems, 'kern.random' missing on one?

2012-12-03 Thread Konstantin Belousov
On Mon, Dec 03, 2012 at 03:11:21PM +, Karl Pielorz wrote:
 
 Hi,
 
 I have two SuperMicro E31220L based systems - both had identical 
 /etc/sysctl.conf - I then shifted them from 9.0-R to 9.0-Stable (as of 
 2012/12/03).
 
 Now I've noticed of them complains at boot time that a bunch of OID's are 
 missing - and sure enough:
 
 
 sysctl kern.random
 sysctl: unknown oid 'kern.random'
 
 
 But the other system returns these fine.
 
 One system ID's as 'Intel E31220L @ 2.20Ghz' - the other ID's as 'Intel 
 E31220L V2 @ 2.30Ghz' the V2 system is apparently 'missing' the kern.random 
 stuff.
 
 The only reason I can think for the OID's not being present is if one 
 system is using hardware RNG? - Though 'man 4 random' states:
 
 The only hardware implementation currently is for the
  VIA C3 Nehemiah (stepping 3 or greater) CPU.  More will be added in the
  future.
 
 Is there any other reason why they would have 'disappeared' on the non V2 
 system? (in fact, looking at the Feature2 line I can see 'RDRAND' on the V2 
 system, hmm so I'm guessing that's it?!)

You noted that RDRAND is supported by your hardware. And indeed, FreeBSD
got the RDRAND (aka Bull Mountain) hardware random number generator support
recently.

From my reading of the sys/dev/random code, the kern.random OID is only
instantiated if the software implementation is selected. If you prefer
software, you can disable RDRAND with hw.ivy_rng_enable=0 tunable.


pgpiF9CoXmpgC.pgp
Description: PGP signature


Re: FreeBSD needs Git to ensure repo integrity [was: 2012 incident]

2012-11-17 Thread Konstantin Belousov
On Sat, Nov 17, 2012 at 11:05:40PM -0800, Perry Hutchison wrote:
 [trimmed some of the lists]
 
 Chris Rees utis...@gmail.com wrote:
  ... git doesn't work with our workflow.
 
 I'm sure the workflow itself is documented somewhere, but is
 there a good writeup of _how_ git doesn't work with it, e.g. what
 capabilit{y,ies} is/are missing?  Seems this might be of interest
 to the git developers, not because they necessarily want to support
 FreeBSD as such, but as an example of a real-world workflow that git
 currently does not handle well.

Git would work well with our workflow. It supports the centralized
repository model, which the project employs right now.

The biggest issues, assuming the project indeed decides to move to Git
right now, are the migration costs, both in the term of the technical
efforts needed, and the learning curve for the most population of the
committers.

Relatively minor problem, at least with the current rate of the commits,
would be a commit race, when the shared repo head forwarded due to the
parallel commit. The issue should be somewhat mitigated by the Git
allowance to push a set of changes in one push.


pgpOinOOg50ux.pgp
Description: PGP signature


Re: Memory reserves or lack thereof

2012-11-15 Thread Konstantin Belousov
On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote:
 On 11/13/2012 05:54, Konstantin Belousov wrote:
  On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote:
  On 11/12/2012 3:48 PM, Konstantin Belousov wrote:
  On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:
  This patch still doesn't address the issue of M_NOWAIT calls driving
  the memory the all the way down to 2 pages, right ? It would be nice to
  have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
  flag to dig deep.
  This is out of scope of the change. But it is required for any further
  adjustements.
  I would suggest a somewhat different response:
 
  The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and 
  does reintroduce M_USE_RESERVE as a way to specify dig deep.
 
  Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory 
  down to two pages.  The effect of the patch is to stop M_NOWAIT at two 
  pages rather than allowing it to continue to zero pages.
 
  When you say, This is out of scope ..., I believe that you are 
  referring to changing two pages into something larger.  I agree that 
  this is out of scope for the current change.
  I referred exactly to the difference between M_USE_RESERVE set or not.
  IMO this is what was asked by the question author. So yes, my mean of
  the 'out of scope' is about tweaking the 'two pages reserve' in some
  way.
 
 Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed
 man page update to malloc(9):
 
 Index: share/man/man9/malloc.9
 ===
 --- share/man/man9/malloc.9 (revision 243091)
 +++ share/man/man9/malloc.9 (working copy)
 @@ -29,7 +29,7 @@
  .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $
  .\ $FreeBSD$
  .\
 -.Dd January 28, 2012
 +.Dd November 15, 2012
  .Dt MALLOC 9
  .Os
  .Sh NAME
 @@ -153,13 +153,12 @@ if
  .Dv M_WAITOK
  is specified.
  .It Dv M_USE_RESERVE
 -Indicates that the system can dig into its reserve in order to obtain the
 -requested memory.
 -This option used to be called
 -.Dv M_KERNEL
 -but has been renamed to something more obvious.
 -This option has been deprecated and is slowly being removed from the
 kernel,
 -and so should not be used with any new programming.
 +Indicates that the system can use its reserve of memory to satisfy the
 +request.
 +This option should only be used in combination with
 +.Dv M_NOWAIT
 +when an allocation failure cannot be tolerated by the caller without
 +catastrophic effects on the system.
  .El
  .Pp
  Exactly one of either

The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also
expressed in KASSERT, like this:

diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index d9e4692..f8a4f70 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags)
 {
int pflags;
 
+   KASSERT((malloc_flags  M_USE_RESERVE) == 0 ||
+   (malloc_flags  M_NOWAIT) != 0,
+   (M_USE_RESERVE requires M_NOWAIT));
pflags = (malloc_flags  M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT :
VM_ALLOC_SYSTEM;
if ((malloc_flags  M_ZERO) != 0)

I understand that this could be added to places of the allocator's entries,
but I think that the page allocations are fine too.


pgptBhkylD1fK.pgp
Description: PGP signature


Re: Memory reserves or lack thereof

2012-11-13 Thread Konstantin Belousov
On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote:
 On 11/12/2012 3:48 PM, Konstantin Belousov wrote:
  On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:
  This patch still doesn't address the issue of M_NOWAIT calls driving
  the memory the all the way down to 2 pages, right ? It would be nice to
  have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
  flag to dig deep.
  This is out of scope of the change. But it is required for any further
  adjustements.
 
 I would suggest a somewhat different response:
 
 The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and 
 does reintroduce M_USE_RESERVE as a way to specify dig deep.
 
 Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory 
 down to two pages.  The effect of the patch is to stop M_NOWAIT at two 
 pages rather than allowing it to continue to zero pages.
 
 When you say, This is out of scope ..., I believe that you are 
 referring to changing two pages into something larger.  I agree that 
 this is out of scope for the current change.

I referred exactly to the difference between M_USE_RESERVE set or not.
IMO this is what was asked by the question author. So yes, my mean of
the 'out of scope' is about tweaking the 'two pages reserve' in some
way.


pgpAl2UTJQyEa.pgp
Description: PGP signature


Re: Memory reserves or lack thereof

2012-11-12 Thread Konstantin Belousov
On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
 On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov 
 kostik...@gmail.comwrote:
 
  On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote:
   I have a memory subsystem design question that I'm hoping someone can
  answer.
  
   I've been looking at a machine that is completely out of memory, as in
  
v_free_count = 0,
v_cache_count = 0,
  
   I wondered how a machine could completely run out of memory like this,
  especially after finding a lack of interrupt storms or other pathologies
  that would tend to overcommit memory. So I started investigating.
  
   Most allocators come down to vm_page_alloc(), which has this guard:
  
 if ((curproc == pageproc)  (page_req != VM_ALLOC_INTERRUPT)) {
 page_req = VM_ALLOC_SYSTEM;
 };
  
 if (cnt.v_free_count + cnt.v_cache_count  cnt.v_free_reserved ||
 (page_req == VM_ALLOC_SYSTEM 
 cnt.v_free_count + cnt.v_cache_count 
  cnt.v_interrupt_free_min) ||
 (page_req == VM_ALLOC_INTERRUPT 
 cnt.v_free_count + cnt.v_cache_count  0)) {
  
   The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
  every last page.
  
   From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
  perhaps only used from interrupt threads. Not so, see kmem_malloc() or
  uma_small_alloc() which both contain this mapping:
  
 if ((flags  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
 pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
 else
 pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
  
   Note that M_USE_RESERVE has been deprecated and is used in just a
  handful of places. Also note that lots of code paths come through these
  routines.
  
   What this means is essentially _any_ allocation using M_NOWAIT will
  bypass whatever reserves have been held back and will take every last page
  available.
  
   There is no documentation stating M_NOWAIT has this side effect of
  essentially being privileged, so any innocuous piece of code that can't
  block will use it. And of course M_NOWAIT is literally used all over.
  
   It looks to me like the design goal of the BSD allocators is on
  recovery; it will give all pages away knowing it can recover.
  
   Am I missing anything? I would have expected some small number of pages
  to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
  sort of back door for grabbing memory.
  
 
  Your analysis is right, there is nothing to add or correct.
  This is the reason to strongly prefer M_WAITOK.
 
 
 Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
 well understand that it should only be used by interrupt handlers.
 
 The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
 being that the allocation shouldn't sleep.  The other being how far we're
 willing to deplete the cache/free page queues.
 
 When fine-grained locking got sprinkled throughout the kernel, we all to
 often found ourselves wanting to do allocations without the possibility of
 blocking.  So, M_NOWAIT became commonplace, where it wasn't before.
 
 This had the unintended consequence of introducing a lot of memory
 allocations in the top-half of the kernel, i.e., non-interrupt handling
 code, that were digging deep into the cache/free page queues.
 
 Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE
 allocation is less likely to succeed than an M_NOWAIT allocation.
 However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
 could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
 a cached page even though M_NOWAIT was specified.  Consequently, the system
 wouldn't dig as far into the free page queue if M_USE_RESERVE was
 specified, because it was allowed to reclaim a cached page.
 
 In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
 dig any deeper into the cache/free page queues than M_WAITOK does and
 reintroduce a M_USE_RESERVE-like flag that says dig deep into the
 cache/free page queues.  The trouble is that we then need to identify all
 of those places that are implicitly depending on the current behavior of
 M_NOWAIT also digging deep into the cache/free page queues so that we can
 add an explicit M_USE_RESERVE.
 
 Alan
 
 P.S. I suspect that we should also increase the size of the page reserve
 that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
 many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
 could actually be satisfied by two pages?

I am almost sure that most of people who put the M_NOWAIT flag, do not
know the 'allow the deeper drain of free queue' effect. As such, I believe
we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
of the problematic places would be in the swapout path.

I found a single explicit use of M_USE_RESERVE in the kernel,
so the flip

Re: Memory reserves or lack thereof

2012-11-12 Thread Konstantin Belousov
On Mon, Nov 12, 2012 at 11:35:42AM -0600, Alan Cox wrote:
 Agreed.  Most recently I eliminated several uses from the arm pmap
 implementations.  There is, however, one other use:
 
 ofed/include/linux/gfp.h:#defineGFP_ATOMIC  (M_NOWAIT |
 M_USE_RESERVE)
Yes, I forgot to mention this. I have no idea about semantic  of
GFP_ATOMIC compat flag.

Below is the updated patch with two your notes applied.

diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
index dc9c307..ab1e869 100644
--- a/sys/amd64/amd64/uma_machdep.c
+++ b/sys/amd64/amd64/uma_machdep.c
@@ -29,6 +29,7 @@ __FBSDID($FreeBSD$);
 
 #include sys/param.h
 #include sys/lock.h
+#include sys/malloc.h
 #include sys/mutex.h
 #include sys/systm.h
 #include vm/vm.h
@@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, 
int wait)
int pflags;
 
*flags = UMA_SLAB_PRIV;
-   if ((wait  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
-   else
-   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
-   if (wait  M_ZERO)
-   pflags |= VM_ALLOC_ZERO;
+   pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags);
if (m == NULL) {
diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
index f60cdb1..75366e3 100644
--- a/sys/arm/arm/vm_machdep.c
+++ b/sys/arm/arm/vm_machdep.c
@@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t 
*flags, int wait)
ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
return (ret);
}
-   if ((wait  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
-   else
-   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
-   if (wait  M_ZERO)
-   pflags |= VM_ALLOC_ZERO;
+   pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
if (m == NULL) {
diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
index 71caa29..2ce1ca6 100644
--- a/sys/fs/devfs/devfs_devs.c
+++ b/sys/fs/devfs/devfs_devs.c
@@ -121,7 +121,7 @@ devfs_alloc(int flags)
struct cdev *cdev;
struct timespec ts;
 
-   cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
+   cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
((flags  MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
if (cdp == NULL)
return (NULL);
diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
index 37353ff..9f77762 100644
--- a/sys/ia64/ia64/uma_machdep.c
+++ b/sys/ia64/ia64/uma_machdep.c
@@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, 
int wait)
int pflags;
 
*flags = UMA_SLAB_PRIV;
-   if ((wait  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
-   else
-   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
-   if (wait  M_ZERO)
-   pflags |= VM_ALLOC_ZERO;
+   pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
 
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
index 798e632..24baef0 100644
--- a/sys/mips/mips/uma_machdep.c
+++ b/sys/mips/mips/uma_machdep.c
@@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, 
int wait)
void *va;
 
*flags = UMA_SLAB_PRIV;
-
-   if ((wait  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-   pflags = VM_ALLOC_INTERRUPT;
-   else
-   pflags = VM_ALLOC_SYSTEM;
+   pflags = m2vm_flags(wait, 0);
 
for (;;) {
m = pmap_alloc_direct_page(0, pflags);
diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
index a491680..3e320b9 100644
--- a/sys/powerpc/aim/mmu_oea64.c
+++ b/sys/powerpc/aim/mmu_oea64.c
@@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, 
u_int8_t *flags, int wait)
*flags = UMA_SLAB_PRIV;
needed_lock = !PMAP_LOCKED(kernel_pmap);
 
-if ((wait  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
-else
-pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
-if (wait  M_ZERO)
-pflags |= VM_ALLOC_ZERO;
+   pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
 
 for (;;) {
 m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c
index 162c7fb..3882bfa 100644
--- a/sys/powerpc/aim/slb.c
+++ b/sys/powerpc/aim/slb.c
@@ -483,12 +483,7 @@ 

Re: Memory reserves or lack thereof

2012-11-12 Thread Konstantin Belousov
On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:
 This patch still doesn't address the issue of M_NOWAIT calls driving
 the memory the all the way down to 2 pages, right ? It would be nice to
 have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
 flag to dig deep.

This is out of scope of the change. But it is required for any further
adjustements.


pgpHI7rQOhvFP.pgp
Description: PGP signature


Re: Memory reserves or lack thereof

2012-11-10 Thread Konstantin Belousov
On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote:
 I have a memory subsystem design question that I'm hoping someone can answer.
 
 I've been looking at a machine that is completely out of memory, as in
 
  v_free_count = 0, 
  v_cache_count = 0, 
 
 I wondered how a machine could completely run out of memory like this, 
 especially after finding a lack of interrupt storms or other pathologies that 
 would tend to overcommit memory. So I started investigating.
 
 Most allocators come down to vm_page_alloc(), which has this guard:
 
   if ((curproc == pageproc)  (page_req != VM_ALLOC_INTERRUPT)) {
   page_req = VM_ALLOC_SYSTEM;
   };
 
   if (cnt.v_free_count + cnt.v_cache_count  cnt.v_free_reserved ||
   (page_req == VM_ALLOC_SYSTEM  
   cnt.v_free_count + cnt.v_cache_count  cnt.v_interrupt_free_min) ||
   (page_req == VM_ALLOC_INTERRUPT 
   cnt.v_free_count + cnt.v_cache_count  0)) {
 
 The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every 
 last page.
 
 From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, 
 perhaps only used from interrupt threads. Not so, see kmem_malloc() or 
 uma_small_alloc() which both contain this mapping:
 
   if ((flags  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
   else
   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
 
 Note that M_USE_RESERVE has been deprecated and is used in just a handful of 
 places. Also note that lots of code paths come through these routines.
 
 What this means is essentially _any_ allocation using M_NOWAIT will bypass 
 whatever reserves have been held back and will take every last page available.
 
 There is no documentation stating M_NOWAIT has this side effect of 
 essentially being privileged, so any innocuous piece of code that can't block 
 will use it. And of course M_NOWAIT is literally used all over.
 
 It looks to me like the design goal of the BSD allocators is on recovery; it 
 will give all pages away knowing it can recover.
 
 Am I missing anything? I would have expected some small number of pages to be 
 held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of 
 back door for grabbing memory.
 

Your analysis is right, there is nothing to add or correct.
This is the reason to strongly prefer M_WAITOK.


pgpXUAix5bcxa.pgp
Description: PGP signature


Re: [patch] rtld: fix fd leak with parallel dlopen and fork/exec

2012-11-04 Thread Konstantin Belousov
On Sun, Nov 04, 2012 at 03:37:27PM +0100, Jilles Tjoelker wrote:
 Rtld does not set FD_CLOEXEC on its internal file descriptors;
 therefore, such a file descriptor may be passed to a process created by
 another thread running in parallel to dlopen() or fdlopen().
 
 The race is easy to trigger with the below dlopen_exec_race.c that
 performs the two in parallel repeatedly, for example
 ./dlopen_exec_race /lib/libedit.so.7 | grep lib
 
 No other threads are expected to be running during parsing of the hints
 and libmap files but the file descriptors need not be passed to child
 processes so I added O_CLOEXEC there as well.
 
 The O_CLOEXEC flag is ignored by older kernels so it will not cause
 breakage when people try the unsupported action of running new rtld on
 old kernel. However, the fcntl(F_DUPFD_CLOEXEC) will fail with [EINVAL]
 on an old kernel so this patch will break fdlopen() with old kernels. I
 suppose this is OK because fdlopen() is not used in the vital parts of
 buildworld and the like.
The patch should be fine. We definitely do not support running newer
binaries on older kernels, so the worries about the fdlopen(3) are
not sustained.

Please commit.

 
 Index: libexec/rtld-elf/libmap.c
 ===
 --- libexec/rtld-elf/libmap.c (revision 240561)
 +++ libexec/rtld-elf/libmap.c (working copy)
 @@ -121,7 +121,7 @@
   }
   }
  
 - fd = open(rpath, O_RDONLY);
 + fd = open(rpath, O_RDONLY | O_CLOEXEC);
   if (fd == -1) {
   dbg(lm_parse_file: open(\%s\) failed, %s, rpath,
   rtld_strerror(errno));
 Index: libexec/rtld-elf/rtld.c
 ===
 --- libexec/rtld-elf/rtld.c   (revision 240561)
 +++ libexec/rtld-elf/rtld.c   (working copy)
 @@ -1598,7 +1598,7 @@
   /* Keep from trying again in case the hints file is bad. */
   hints = ;
  
 - if ((fd = open(ld_elf_hints_path, O_RDONLY)) == -1)
 + if ((fd = open(ld_elf_hints_path, O_RDONLY | O_CLOEXEC)) == -1)
   return (NULL);
   if (read(fd, hdr, sizeof hdr) != sizeof hdr ||
   hdr.magic != ELFHINTS_MAGIC ||
 @@ -2046,13 +2046,13 @@
   */
  fd = -1;
  if (fd_u == -1) {
 - if ((fd = open(path, O_RDONLY)) == -1) {
 + if ((fd = open(path, O_RDONLY | O_CLOEXEC)) == -1) {
   _rtld_error(Cannot open \%s\, path);
   free(path);
   return (NULL);
   }
  } else {
 - fd = dup(fd_u);
 + fd = fcntl(fd_u, F_DUPFD_CLOEXEC, 0);
   if (fd == -1) {
   _rtld_error(Cannot dup fd);
   free(path);
 
 
 dlopen_exec_race.c:
 
 #include sys/types.h
 #include sys/wait.h
 
 #include dlfcn.h
 #include err.h
 #include errno.h
 #include pthread.h
 #include spawn.h
 #include stdio.h
 
 extern char **environ;
 
 static void *
 dlopen_thread(void *arg)
 {
   const char *path = arg;
   void *handle;
 
   for (;;) {
   handle = dlopen(path, RTLD_LOCAL | RTLD_NOW);
   if (handle == NULL)
   continue;
   dlclose(handle);
   }
   return NULL;
 }
 
 static void
 filestat_loop(void)
 {
   int error, status;
   pid_t pid, wpid;
 
   for (;;) {
   error = posix_spawnp(pid, sh, NULL, NULL,
   (char *[]){ sh, -c, procstat -f $$, NULL },
   environ);
   if (error != 0)
   errc(1, error, posix_spawnp);
   do
   wpid = waitpid(pid, status, 0);
   while (wpid == -1  errno == EINTR);
   if (wpid == -1)
   err(1, waitpid);
   if (status != 0)
   errx(1, sh -c failed);
   }
 }
 
 int
 main(int argc, char *argv[])
 {
   pthread_t td;
   int error;
 
   if (argc != 2)
   {
   fprintf(stderr, Usage: %s dso\n, argv[0]);
   return 1;
   }
 
   error = pthread_create(td, NULL, dlopen_thread, argv[1]);
   if (error != 0)
   errc(1, error, pthread_create);
 
   filestat_loop();
 
   return 0;
 }
 
 fdlopen_exec_race.c:
 
 #include sys/types.h
 #include sys/wait.h
 
 #include dlfcn.h
 #include err.h
 #include errno.h
 #include fcntl.h
 #include pthread.h
 #include spawn.h
 #include stdio.h
 #include unistd.h
 
 extern char **environ;
 
 static void *
 dlopen_thread(void *arg)
 {
   const char *path = arg;
   int fd;
   void *handle;
 
   for (;;) {
   fd = open(path, O_RDONLY | O_CLOEXEC);
   if (fd == -1)
   err(1, open %s, path);
   handle = fdlopen(fd, RTLD_LOCAL | RTLD_NOW);
   close(fd);
   if (handle == NULL)
   continue;
   dlclose(handle);
   }
   return NULL;
 }
 
 static void
 

Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-11-03 Thread Konstantin Belousov
On Sat, Nov 03, 2012 at 01:11:17PM -0500, Alan Cox wrote:
 On Wed, Oct 31, 2012 at 2:06 PM, Konstantin Belousov 
 kostik...@gmail.comwrote:
 
  On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote:
   On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org
  wrote:
I think there are some things we should be investigating about the
growth of memory usage.  I just noticed this:
   
Freebsd 6.2 on an arm processor:
   
  369 root 1   8  -88  1752K   748K nanslp   3:00  0.00% watchdogd
   
Freebsd 10.0 on the same system:
   
  367 root 1 -52   r0 10232K 10160K nanslp  10:04  0.00% watchdogd
   
The 10.0 system is built with MALLOC_PRODUCTION (without that defined
the system won't even boot, it only has 64MB of ram).  That's a crazy
amount of growth for a relatively simple daemon.
  
   Would you please, _please_ do some digging into this?
  
   It's quite possible there's something in the libraries that are
   allocating some memory upon first call invocation - yes, that's
   jemalloc, but it could also be other things like stdio.
  
   We really, really need to fix this userland bloat; it's terribly
   ridiculous at this point. There's no reason a watchdog daemon should
   take 10megabytes of RAM.
  Watchdogd was recently changed to mlock its memory. This is the cause
  of the RSS increase.
 
 
 Is it also statically linked?

No. I do not think that it is reasonable to statically link watchdogd.
It might result in some memory saving, but I dislike the whole idea
of static linkage on Tier 1 platforms.


pgpNDz1ZE5tte.pgp
Description: PGP signature


Re: watchdogd, jemalloc, and mlockall

2012-11-03 Thread Konstantin Belousov
On Sat, Nov 03, 2012 at 12:38:39PM -0600, Ian Lepore wrote:
 In an attempt to un-hijack the thread about memory usage increase
 between 6.4 and 9.x, I'm starting a new thread here related to my recent
 discovery that watchdogd uses a lot more memory since it began using
 mlockall(2).
 
 I tried statically linking watchdogd and it made a small difference in
 RSS, presumably because it doesn't wire down all of libc and libm.
 
  VSZ   RSS
 10236 10164  Dynamic
  8624  8636  Static
 
 Those numbers are from ps -u on an arm platform.  I just updated the PR
 (bin/173332) with some procstat -v output comparing with/without
 mlockall().
 
 It appears that the bulk of the new RSS bloat comes from jemalloc
 allocating vmspace in 8MB chunks.  With mlockall(MCL_FUTURE) in effect
 that leads to wiring 8MB to satisfy what probably amounts to a few
 hundred bytes of malloc'd memory.
 
 It would probably also be a good idea to remove the floating point from
 watchdogd to avoid wiring all of libm.  The floating point is used just
 to turn the timeout-in-seconds into a power-of-two-nanoseconds value.
 There's probably a reasonably efficient way to do that without calling
 log(), considering that it only happens once at program startup.

No, I propose to add a switch to turn on/off the mlockall() call.
I have no opinion on the default value of the suggested switch.


pgp7Jg9fdTcDk.pgp
Description: PGP signature


Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-10-31 Thread Konstantin Belousov
On Wed, Oct 31, 2012 at 09:49:21AM +, Karl Pielorz wrote:
 
 --On 30 October 2012 19:51 +0200 Konstantin Belousov kostik...@gmail.com 
 wrote:
 
  I suggest to take a look at where the actual memory goes.
 
  Start with procstat -v.
 
 Ok, running that for the milter PID I get seem to be able to see smallish 
 chunks used for things like 'libmilter.so', and 'libthr.so' / 'libc.so' - 
 e.g.
Since you neglected to provide the verbatim output of procstat, nothing
conclusive can be said. Obviously, you can make an investigation on your
own.

 
 2010   0x40   0x413000 r-x   190   1   0 CN-- vn 
 /usr/local/sbin/milter
 2010   0x613000   0x80 rw-   150   1   0  df
 20100x8006130000x80062b000 r-x   240  97   0 CN-- vn 
 /libexec/ld-elf.so.1
 20100x80062b0000x800634000 rw-90   1   0  df
 20100x8006340000x80065f000 rw-   330   1   0  df
 20100x80082b0000x80082d000 rw-20   1   0  df
 20100x80082d0000x800839000 r-x   12   12   2   1 CN-- vn 
 /usr/lib/libmilter.so.5
 20100x8008390000x800a38000 ---00   1   0  df
 20100x800a380000x800a39000 rw-10   1   0 C--- vn 
 /usr/lib/libmilter.so.5
 20100x800a390000x800a3c000 rw-00   0   0  --
 
 Then there's a bunch of 'large' blocks e.g..
 
  PID  STARTEND PRT  RES PRES REF SHD  FL TP PATH
 20100x801c00x80280 rw- 28690   4   0  df
 20100x802800x80340 rw- 18800   1   0  df
 20100x803400x80380 rw-  9200   1   0  df
 20100x803800x80400 rw-  8650   1   0  df
 20100x804000x80440 rw- 10240   4   0  df
 20100x804400x804c0 rw-  6270   1   0  df
 20100x804c00x80500 rw- 10210   4   0  df
Most likely, these are malloc arenas.

 
 Then lots of 'little' blocks,
 
 2010 0x70161000 0x70181000 rw-   160   1   0 ---D df
 2010 0x70362000 0x70382000 rw-   160   1   0 ---D df
 2010 0x70563000 0x70583000 rw-   160   1   0 ---D df
 2010 0x70764000 0x70784000 rw-   160   1   0 ---D df
 2010 0x70965000 0x70985000 rw-   160   1   0 ---D df
 2010 0x70b66000 0x70b86000 rw-   160   1   0 ---D df
 2010 0x70d67000 0x70d87000 rw-   160   1   0 ---D df
 2010 0x70f68000 0x70f88000 rw-   160   1   0 ---D df
 2010 0x71169000 0x71189000 rw-   160   1   0 ---D df
And those are thread stacks.

 
 
 The memory usage figures seem to have 'stabilized' now - SIZE/RES seem to 
 track around 9 times the size of the 6.4 system, for a comparative load.
 
 We can probably live with this (as they have come back down overnight as 
 load fell off) - i.e. they don't appear to be continuously growing (the 
 binaries were heavily profiled under 6.x and found to have no internal 
 leaks).
 
 -Karl


pgpM3vONjl8vE.pgp
Description: PGP signature


Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-10-31 Thread Konstantin Belousov
On Wed, Oct 31, 2012 at 02:44:05PM +, Karl Pielorz wrote:
 
 
 --On 31 October 2012 16:06 +0200 Konstantin Belousov kostik...@gmail.com 
 wrote:
 
  Since you neglected to provide the verbatim output of procstat, nothing
  conclusive can be said. Obviously, you can make an investigation on your
  own.
 
 Sorry - when I ran it this morning the output was several hundred lines - I 
 didn't want to post all of that to the list 99% of the lines are very 
 similar. I can email it you off-list if having the whole lot will help?
Just put is somewhere on http server and provide the URL.

 
  Then there's a bunch of 'large' blocks e.g..
 
   PID  STARTEND PRT  RES PRES REF SHD  FL TP
   PATH 20100x801c00x80280 rw- 28690   4   0
   df 20100x802800x80340 rw- 18800   1   0
 
  Most likely, these are malloc arenas.
 
 Ok, that's the heaviest usage.
So it is, most likely, either app doing more malloc allocations, or the
difference is due to more aggressive jemalloc behaviour. If this indeed
bothers you, you could try to use plug-in replacements for malloc.
They are usually LD_PRELOAD'ed, search the ports.  But I would not bother.

 
  Then lots of 'little' blocks,
 
  2010 0x70161000 0x70181000 rw-   160   1   0 ---D df
 
  And those are thread stacks.
 
 Ok, lots of those (lots of threads going on) - but they're all pretty small.
 
 My code only has a single call to malloc, which allocates around 20k per 
 thread.
You should look both at the virtual address space allocation for the block,
which is specified by (start, end), and at the actual resident page count.

 
 Obviously there's other libraries and stuff running with the code - so 
 would I be correct in guessing that they are more than likely for most of 
 these large blocks?
No idea.



pgpvR0EStB2PG.pgp
Description: PGP signature


Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-10-31 Thread Konstantin Belousov
On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote:
 On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org wrote:
  I think there are some things we should be investigating about the
  growth of memory usage.  I just noticed this:
 
  Freebsd 6.2 on an arm processor:
 
369 root 1   8  -88  1752K   748K nanslp   3:00  0.00% watchdogd
 
  Freebsd 10.0 on the same system:
 
367 root 1 -52   r0 10232K 10160K nanslp  10:04  0.00% watchdogd
 
  The 10.0 system is built with MALLOC_PRODUCTION (without that defined
  the system won't even boot, it only has 64MB of ram).  That's a crazy
  amount of growth for a relatively simple daemon.
 
 Would you please, _please_ do some digging into this?
 
 It's quite possible there's something in the libraries that are
 allocating some memory upon first call invocation - yes, that's
 jemalloc, but it could also be other things like stdio.
 
 We really, really need to fix this userland bloat; it's terribly
 ridiculous at this point. There's no reason a watchdog daemon should
 take 10megabytes of RAM.
Watchdogd was recently changed to mlock its memory. This is the cause
of the RSS increase.

If not wired, swapout might cause a delay of the next pat, leading to
panic.


pgp7hqYdVijEv.pgp
Description: PGP signature


Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-10-30 Thread Konstantin Belousov
On Tue, Oct 30, 2012 at 10:48:03AM -0700, Alfred Perlstein wrote:
 Some suggestions here, jemalloc, kernel threads are good ones.
 
 Another issue may just be some change for default thread stack size.  
 This would explain why the RESIDENT set is the same, but the VIRTUAL grew.
I suggest to take a look at where the actual memory goes.

Start with procstat -v.
 
 -Alfred
 
 On 10/30/12 9:56 AM, Karl Pielorz wrote:
 
 
  --On 30 October 2012 19:43 +0700 Erich Dollansky 
  erichfreebsdl...@ovitrap.com wrote:
 
  Depends how you mean 'the same' - on the 6.4 system it shows:
 
 cc (GCC) 3.4.6 [FreeBSD] 20060305
 
  And, on the 9.0-S it shows:
 
 cc (GCC) 4.2.1 20070831 patched [FreeBSD]
 
  So 'same' - but different versions.
 
  did you check the default data sizes?
 
  How do you mean?
 
  Now they've been running for an hour or so - they've gotten a little
  larger 552M/154M and 703M/75M.
 
  If it's not harmful I can live with it - it was just a bit of a
  surprise.
 
  And a reason to spend more money on memory. Knowing the real reason
  would be better.
 
  I can understand your surprise.
 
  Hehe, more 'concern' than surprise I guess now...
 
  The sendmail milter has grown to a SIZE/RES of 1045M / 454M under 9.0. 
  The original 6.4 machine under heaver load (more connections) shows a 
  SIZE/RES of 85M/52M.
 
  The TCP listener code is now showing a SIZE/REZ of 815M/80M under 9.0 
  with the original 6.4 box showing 44M/9.5M
 
  The 9.0 box says it has 185M active, 472M inactive, 693M wired, 543M 
  buf, and 4554M free.
 
  At this stage I'm just a bit concerned that at least the milter code 
  is going to grow, and grow - and die.
 
  I would think it would last over night so I'll see what the figures 
  are in the morning.
 
  Thanks for the replies...
 
  -Karl
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to 
  freebsd-hackers-unsubscr...@freebsd.org
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


pgpGsStsnYyLq.pgp
Description: PGP signature


Re: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program

2012-10-26 Thread Konstantin Belousov
On Fri, Oct 26, 2012 at 11:12:53AM -0700, Simon J. Gerraty wrote:
 
 On Fri, 26 Oct 2012 11:41:46 -0600, Warner Losh writes:
 It's called a transition period for a reason.  The historical use has =
 permeated itself into many places, not all of which are obvious.
 
 It would seem that leaving FreeBSD make as make, for the transition
 period and installing bmake as bmake, would cause the least disruption
 to everyone.  This was the original proposal presented at BSDCan in 2011.
 
 FreeBSD make already grok's the :tl and :tu modifiers,
 so it is quite simple for the two to coexist for some period.
 
 The only reason we are talking about having to frob ports etc now, 
 is a new requirement introduced this year (by yourself I think)
 that bmake replace make in base rather than allow coexistence.
 
 If we are all happy to go back to the original plan, we can ease the
 concerns of the folk you speak of?
 
 The only downside is we wait a few more years for major build improvments.

Can system build, initiated by make, call bmake immediately ?
I suppose it could be fine even to error out if make is typed instead of
bmake for src/.


pgp1H8SYjf8vh.pgp
Description: PGP signature


  1   2   >