Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

2018-04-04 Thread Mark Millard
On 2018-Apr-4, at 10:16 AM, Andriy Gapon  wrote:

> On 01/04/2018 05:31, Mark Millard wrote:
>> I have a hypothesis for part of what top is
>> counting in the process/thread SWAP column
>> that might not be what one would expect.
>> 
>> It appears to me that vnode-backed pages are
>> being re-classfied sometimes for inactive
>> processes, and this classification leads to
>> top classifying the pages as not-resident
>> but swapped (in that a "VN PAGER in" would
>> be required, in systat -vmstat terms).
> 
> Not sure.
> To me it seems that top just uses wrong statistics to calculate the value.
> 
> /* swap usage */
> #define ki_swap(kip) \
>   ((kip)->ki_swrss > (kip)->ki_rssize ? (kip)->ki_swrss - (kip)->ki_rssize : 
> 0)
> 
> ki_rssize is the resident size of a process.
> ki_swrss is resident set size before last swap.
> 
> Their difference is... exactly what?
> I cannot even meaningfully describe this value.
> But it is certainly _not_ the current swap utilization by the process.
> 
> Here is my attempt at obtaining a more reasonable approximation of the. 
> process
> swap use.  But it is still wildly inaccurate.
> 
> . . .

If I get time this weekend, I'll try the patch. Thanks.

I've classically seen things like (picking on java here):
(no patch yet, so SWAP 0K shows)

  PID USERNAME   THR PRI NICE   SIZERES   SWAP STATE   C   TIME CPU 
COMMAND
78694 root44  520 14779M 92720K 0K uwait  22   0:06   9.91% 
[java]

when Swap: . . . 0 Used . . . (or some figure much
smaller than SIZE-RES) showed. (SIZE is ki_size and
RES is ki_rssize as I remember.) It suggests some
form of reserved-but-not-allocated contribution to
ki_size (SIZE), space not resident nor swapped out
to a swap partition. Possibly vnode-backed (potential
"VN PAGER in and out" contributions instead of "SWAP
PAGER" ones, in systat -vmstat terms)?

Are such cases examples of what you were counting
as "wildly inaccurate"? Or do you count vnode-backed
but not resident as perfectly good examples of SWAP
in use?

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Can't load linux64.ko module

2018-04-04 Thread Steve Kargl
On Wed, Apr 04, 2018 at 02:13:15PM -0700, Steve Kargl wrote:
> 
> OK, so where is elf64_linux_vdso_fixup suppose to come from?
> 

The answer is compat/linux/linux_vdso.c where we find

#if defined(__i386__) || (defined(__amd64__) && defined(COMPAT_LINUX32))
#define __ELF_WORD_SIZE 32
#else
#define __ELF_WORD_SIZE 64
#endif

having COMPAT_LINUX32 in my kernel config file gives me
elf32_linux_vdso_fixup.  It seems that one cannot have
a kernel that supports both 32 and 64-bit linux software.

linux(4) states

 for an amd64 kernel use:

   options COMPAT_LINUX32

 Alternatively, to load the ABI as a module at boot time, place the
 following line in loader.conf(5):

   linux_load="YES"

It turns out that I have 'linux_load=YES" in /etc/loader.conf.
When I boot the kernel built with COMPAT_LINUX32 prevents 
the kldload of linux64.ko.

Oh well, learn something new everyday.

-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Can't load linux64.ko module

2018-04-04 Thread Steve Kargl
On Wed, Apr 04, 2018 at 01:19:55PM -0700, Steve Kargl wrote:
> On Wed, Apr 04, 2018 at 12:09:02PM -0700, Steve Kargl wrote:
> > 
> > kernel config file contains
> > 
> > options COMPAT_LINUX32
> > options COMPAT_LINUXKPI
> > options LINPROCFS
> > 
> > When booting, the kernel tries to load the module.  A manual
> > loading of the module results in
> > 
> > % kldload /boot/kernel/linux64.ko
> > kldload: an error occurred while loading module /boot/kernel/linux64.ko.
> > Please check dmesg(8) for more details.
> > sleepdirt:fvwm:root[203] dmesg | tail -2
> > link_elf_obj: symbol elf64_linux_vdso_fixup undefined
> > linker_load_file: /boot/kernel/linux64.ko - unsupported file type
> > 
> > Now, that I look at /sys/amd64/conf/NOTES again, I find that
> > there is a COMPAT_LINUX as well as the COMPAT_LINUX32.  I must
> > have conflated that two options into being the same thing.
> > 

OK, so where is elf64_linux_vdso_fixup suppose to come from?

cd /boot/kernel
foreach i (*.ko)
  nm $i | grep linux_vdso_fixup
end

00018f40 t elf32_linux_vdso_fixup
00017cd0 t elf32_linux_vdso_fixup
 U elf64_linux_vdso_fixup

nm kernel | grep linux_vdso
80f3cb88 d __set_sysinit_set_sym_elf_linux_vdso_init_sys_init
80f3e140 d __set_sysuninit_set_sym_elf_linux_vdso_uninit_sys_uninit
80a3eae0 T elf32_linux_vdso_fixup
80a3ebe0 T elf32_linux_vdso_reloc
80a3e9e0 T elf32_linux_vdso_sym_init
81180e70 B elf32_linux_vdso_syms
80f32ae0 d elf_linux_vdso_init_sys_init
80f32af8 d elf_linux_vdso_uninit_sys_uninit
80a292d0 t linux_vdso_deinstall
80a29210 t linux_vdso_install

-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Can't load linux64.ko module

2018-04-04 Thread Steve Kargl
On Wed, Apr 04, 2018 at 12:09:02PM -0700, Steve Kargl wrote:
> 
> kernel config file contains
> 
> options COMPAT_LINUX32
> options COMPAT_LINUXKPI
> options LINPROCFS
> 
> When booting, the kernel tries to load the module.  A manual
> loading of the module results in
> 
> % kldload /boot/kernel/linux64.ko
> kldload: an error occurred while loading module /boot/kernel/linux64.ko.
> Please check dmesg(8) for more details.
> sleepdirt:fvwm:root[203] dmesg | tail -2
> link_elf_obj: symbol elf64_linux_vdso_fixup undefined
> linker_load_file: /boot/kernel/linux64.ko - unsupported file type
> 
> Now, that I look at /sys/amd64/conf/NOTES again, I find that
> there is a COMPAT_LINUX as well as the COMPAT_LINUX32.  I must
> have conflated that two options into being the same thing.
> 

Hmmm, this is interesting.  /sys/amd64/conf/NOTES contains

Lines 270-271
# To enable Linuxulator support, one must also include COMPAT_LINUX in the
# config as well.  The other option is to load both as modules.

And then lines 636-637

# Enable Linux ABI emulation
#XXX#optionsCOMPAT_LINUX

with no explanation of the #XXX notations.  So, building the
kernel with COMPAT_LINUX gives

===> SLEEPDIRT
mkdir -p /usr/obj/usr/src/amd64.amd64/sys
--
>>> stage 1: configuring the kernel
--
cd /usr/src/sys/amd64/conf;  
PATH=/usr/obj/usr/src/amd64.amd64/tmp/legacy/usr/sbin:/usr/obj/usr/src/amd64.amd64/tmp/legacy/usr/bin:/usr/obj/usr/src/amd64.amd64/tmp/legacy/bin:/usr/obj/usr/src/amd64.amd64/tmp/usr/sbin:/usr/obj/usr/src/amd64.amd64/tmp/usr/bin:/sbin:/bin:/usr/sbin:/usr/bin
  config  -d /usr/obj/usr/src/amd64.amd64/sys/SLEEPDIRT  -I 
'/usr/src/sys/amd64/conf' '/usr/src/sys/amd64/conf/SLEEPDIRT'
/usr/src/sys/amd64/conf/SLEEPDIRT: unknown option "COMPAT_LINUX"
*** [buildkernel] Error code 1

make[1]: stopped in /usr/src
1 error

I guess XXX means Linux emulation isn't supported.  Bummer.


-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Can't load linux64.ko module

2018-04-04 Thread Steve Kargl
On Wed, Apr 04, 2018 at 02:41:35PM -0400, Ed Maste wrote:
> On 3 April 2018 at 12:26, Steve Kargl  
> wrote:
> > Booting a kernel from
> > %  uname -a
> > FreeBSD sleepdirt 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r331370: \
> >   Thu Mar 22 13:41:30 AKDT 2018 \
> >   kargl@sleepdirt:/usr/obj/usr/src/amd64.amd64/sys/SLEEPDIRT  amd64
> >
> > gives the following from dmesg
> >
> > % dmesg | grep linux
> > link_elf_obj: symbol elf64_linux_vdso_fixup undefined
> > linker_load_file: /boot/kernel/linux64.ko - unsupported file type
> 
> Are you loading the linuxulator bits from modules or trying to compile
> it into the kernel? Did your case work in the past but break recently?
> 
> As a point of reference, my laptop is at r331538+0a541b719b64 (my WIP
> branch), and loading linux.ko and linux64.ko is successful.

kernel and world are from r331370.

kernel config file contains

options COMPAT_LINUX32
options COMPAT_LINUXKPI
options LINPROCFS

When booting, the kernel tries to load the module.  A manual
loading of the module results in

% kldload /boot/kernel/linux64.ko
kldload: an error occurred while loading module /boot/kernel/linux64.ko.
Please check dmesg(8) for more details.
sleepdirt:fvwm:root[203] dmesg | tail -2
link_elf_obj: symbol elf64_linux_vdso_fixup undefined
linker_load_file: /boot/kernel/linux64.ko - unsupported file type

Now, that I look at /sys/amd64/conf/NOTES again, I find that
there is a COMPAT_LINUX as well as the COMPAT_LINUX32.  I must
have conflated that two options into being the same thing.

I have to update the system for the recent security announcement,
so I update everything and change my kernel config file.


-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Can't load linux64.ko module

2018-04-04 Thread Ed Maste
On 3 April 2018 at 12:26, Steve Kargl  wrote:
> Booting a kernel from
> %  uname -a
> FreeBSD sleepdirt 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r331370: \
>   Thu Mar 22 13:41:30 AKDT 2018 \
>   kargl@sleepdirt:/usr/obj/usr/src/amd64.amd64/sys/SLEEPDIRT  amd64
>
> gives the following from dmesg
>
> % dmesg | grep linux
> link_elf_obj: symbol elf64_linux_vdso_fixup undefined
> linker_load_file: /boot/kernel/linux64.ko - unsupported file type

Are you loading the linuxulator bits from modules or trying to compile
it into the kernel? Did your case work in the past but break recently?

As a point of reference, my laptop is at r331538+0a541b719b64 (my WIP
branch), and loading linux.ko and linux64.ko is successful.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

2018-04-04 Thread Don Lewis
On  4 Apr, Mark Johnston wrote:
> On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
>> On  3 Apr, Don Lewis wrote:
>> > I reconfigured my Ryzen box to be more similar to my default package
>> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
>> > and 32 GB and then started bisecting to try to track down the problem.
>> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
>> > /dev/null.  The commit range that I was searching was r329844 to
>> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
>> > and newer, ARC is totally unresponsive to memory pressure and the
>> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
>> > RAM, so there is not much leftover for getting useful work done.  Active
>> > memory and free memory both hover under 1GB each.  Looking at the
>> > commit logs over this range, the most likely culprit is:
>> > 
>> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
>> > 
>> > Add a generic Proportional Integral Derivative (PID) controller algorithm 
>> > and
>> > use it to regulate page daemon output.
>> > 
>> > This provides much smoother and more responsive page daemon output, 
>> > anticipating
>> > demand and avoiding pageout stalls by increasing the number of pages to 
>> > match
>> > the workload.  This is a reimplementation of work done by myself and 
>> > mlaier at
>> > Isilon.
>> > 
>> > 
>> > It is quite possible that the recent fixes to the PID controller will
>> > fix the problem.  Not that r329844 was trouble free ... I left tar
>> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
>> > ntpd, both of my ssh sessions into the machine, and multiple instances
>> > of getty while I was away.  I was able to log in again and successfully
>> > run poudriere, and ARC did respond to the memory pressure and cranked
>> > itself down to about 5 GB by the end of the run.  I did not see the same
>> > problem with tar when I did the same with r329904.
>> 
>> I just tried r331966 and see no improvement.  No OOM process kills
>> during the tar run to fill ARC, but with ARC filled, the machine is
>> thrashing itself at the start of the poudriere run while trying to build
>> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
>> memory demand.  I've seen no decrease in ARC size or wired memory since
>> starting poudriere.
> 
> Re-reading the ARC reclaim code, I see a couple of issues which might be
> at the root of the behaviour you're seeing.
> 
> 1. zfs_arc_free_target is too low now. It is initialized to the page
>daemon wakeup threshold, which is slightly above v_free_min. With the
>PID controller, the page daemon uses a setpoint of v_free_target.
>Moreover, it now wakes up regularly rather than having wakeups be
>synchronized by a mutex, so it will respond quickly if the free page
>count dips below v_free_target. The free page count will dip below
>zfs_arc_free_target only in the face of sudden and extreme memory
>pressure now, so the FMT_LOTSFREE case probably isn't getting
>exercised. Try initializing zfs_arc_free_target to v_free_target.

Changing zfs_arc_free_target definitely helps.  My previous poudriere
run failed when poudriere timed out the ports-mgmt/pkg build after two
hours.  After changing this setting, poudriere seems to be running
properly and ARC has dropped from 29GB to 26GB ten minutes into the run
and I'm not seeing processes in the swread state.

> 2. In the inactive queue scan, we used to compute the shortage after
>running uma_reclaim() and the lowmem handlers (which includes a
>synchronous call to arc_lowmem()). Now it's computed before, so we're
>not taking into account the pages that get freed by the ARC and UMA.
>The following rather hacky patch may help. I note that the lowmem
>logic is now somewhat broken when multiple NUMA domains are
>configured, however, since it fires only when domain 0 has a free
>page shortage.

I will try this next.

> Index: sys/vm/vm_pageout.c
> ===
> --- sys/vm/vm_pageout.c   (revision 331933)
> +++ sys/vm/vm_pageout.c   (working copy)
> @@ -1114,25 +1114,6 @@
>   boolean_t queue_locked;
>  
>   /*
> -  * If we need to reclaim memory ask kernel caches to return
> -  * some.  We rate limit to avoid thrashing.
> -  */
> - if (vmd == VM_DOMAIN(0) && pass > 0 &&
> - (time_uptime - lowmem_uptime) >= lowmem_period) {
> - /*
> -  * Decrease registered cache sizes.
> -  */
> - SDT_PROBE0(vm, , , vm__lowmem_scan);
> - EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> - /*
> -  * We do this explicitly after the caches have been
> -  * drained above.
> -  */
> - uma_reclaim();
> - lowmem_uptime = time_uptime;
> 

Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

2018-04-04 Thread Mark Johnston
On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
> On  3 Apr, Don Lewis wrote:
> > I reconfigured my Ryzen box to be more similar to my default package
> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
> > and 32 GB and then started bisecting to try to track down the problem.
> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
> > /dev/null.  The commit range that I was searching was r329844 to
> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
> > and newer, ARC is totally unresponsive to memory pressure and the
> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
> > RAM, so there is not much leftover for getting useful work done.  Active
> > memory and free memory both hover under 1GB each.  Looking at the
> > commit logs over this range, the most likely culprit is:
> > 
> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
> > 
> > Add a generic Proportional Integral Derivative (PID) controller algorithm 
> > and
> > use it to regulate page daemon output.
> > 
> > This provides much smoother and more responsive page daemon output, 
> > anticipating
> > demand and avoiding pageout stalls by increasing the number of pages to 
> > match
> > the workload.  This is a reimplementation of work done by myself and mlaier 
> > at
> > Isilon.
> > 
> > 
> > It is quite possible that the recent fixes to the PID controller will
> > fix the problem.  Not that r329844 was trouble free ... I left tar
> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
> > ntpd, both of my ssh sessions into the machine, and multiple instances
> > of getty while I was away.  I was able to log in again and successfully
> > run poudriere, and ARC did respond to the memory pressure and cranked
> > itself down to about 5 GB by the end of the run.  I did not see the same
> > problem with tar when I did the same with r329904.
> 
> I just tried r331966 and see no improvement.  No OOM process kills
> during the tar run to fill ARC, but with ARC filled, the machine is
> thrashing itself at the start of the poudriere run while trying to build
> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
> memory demand.  I've seen no decrease in ARC size or wired memory since
> starting poudriere.

Re-reading the ARC reclaim code, I see a couple of issues which might be
at the root of the behaviour you're seeing.

1. zfs_arc_free_target is too low now. It is initialized to the page
   daemon wakeup threshold, which is slightly above v_free_min. With the
   PID controller, the page daemon uses a setpoint of v_free_target.
   Moreover, it now wakes up regularly rather than having wakeups be
   synchronized by a mutex, so it will respond quickly if the free page
   count dips below v_free_target. The free page count will dip below
   zfs_arc_free_target only in the face of sudden and extreme memory
   pressure now, so the FMT_LOTSFREE case probably isn't getting
   exercised. Try initializing zfs_arc_free_target to v_free_target.

2. In the inactive queue scan, we used to compute the shortage after
   running uma_reclaim() and the lowmem handlers (which includes a
   synchronous call to arc_lowmem()). Now it's computed before, so we're
   not taking into account the pages that get freed by the ARC and UMA.
   The following rather hacky patch may help. I note that the lowmem
   logic is now somewhat broken when multiple NUMA domains are
   configured, however, since it fires only when domain 0 has a free
   page shortage.

Index: sys/vm/vm_pageout.c
===
--- sys/vm/vm_pageout.c (revision 331933)
+++ sys/vm/vm_pageout.c (working copy)
@@ -1114,25 +1114,6 @@
boolean_t queue_locked;
 
/*
-* If we need to reclaim memory ask kernel caches to return
-* some.  We rate limit to avoid thrashing.
-*/
-   if (vmd == VM_DOMAIN(0) && pass > 0 &&
-   (time_uptime - lowmem_uptime) >= lowmem_period) {
-   /*
-* Decrease registered cache sizes.
-*/
-   SDT_PROBE0(vm, , , vm__lowmem_scan);
-   EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
-   /*
-* We do this explicitly after the caches have been
-* drained above.
-*/
-   uma_reclaim();
-   lowmem_uptime = time_uptime;
-   }
-
-   /*
 * The addl_page_shortage is the number of temporarily
 * stuck pages in the inactive queue.  In other words, the
 * number of pages from the inactive count that should be
@@ -1824,6 +1805,26 @@
atomic_store_int(>vmd_pageout_wanted, 1);
 
/*
+* If we need to reclaim memory ask kernel caches to return
+* some.  We rate limit to avoid thrashing.
+*/
+   if 

Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

2018-04-04 Thread Andriy Gapon
On 01/04/2018 05:31, Mark Millard wrote:
> I have a hypothesis for part of what top is
> counting in the process/thread SWAP column
> that might not be what one would expect.
> 
> It appears to me that vnode-backed pages are
> being re-classfied sometimes for inactive
> processes, and this classification leads to
> top classifying the pages as not-resident
> but swapped (in that a "VN PAGER in" would
> be required, in systat -vmstat terms).

Not sure.
To me it seems that top just uses wrong statistics to calculate the value.

 /* swap usage */
 #define ki_swap(kip) \
   ((kip)->ki_swrss > (kip)->ki_rssize ? (kip)->ki_swrss - (kip)->ki_rssize : 0)

ki_rssize is the resident size of a process.
ki_swrss is resident set size before last swap.

Their difference is... exactly what?
I cannot even meaningfully describe this value.
But it is certainly _not_ the current swap utilization by the process.

Here is my attempt at obtaining a more reasonable approximation of the. process
swap use.  But it is still wildly inaccurate.

diff --git a/usr.bin/top/machine.c b/usr.bin/top/machine.c
index 2d97d7f867f36..361a1542e6e16 100644
--- a/usr.bin/top/machine.c
+++ b/usr.bin/top/machine.c
@@ -233,12 +233,13 @@ static int carc_enabled;
 static int pageshift;  /* log base 2 of the pagesize */

 /* define pagetok in terms of pageshift */
-
-#define pagetok(size) ((size) << pageshift)
+#define pagetok(size) ((size) << (pageshift - LOG1024))
+#define btopage(size) ((size) >> pageshift)

 /* swap usage */
 #define ki_swap(kip) \
-((kip)->ki_swrss > (kip)->ki_rssize ? (kip)->ki_swrss - (kip)->ki_rssize : 
0)
+(btopage((kip)->ki_size) > (kip)->ki_rssize ? \
+btopage((kip)->ki_size) - (kip)->ki_rssize : 0)

 /* useful externals */
 long percentages(int cnt, int *out, long *new, long *old, long *diffs);
@@ -384,9 +385,6 @@ machine_init(struct statics *statics, char do_unames)
pagesize >>= 1;
}

-   /* we only need the amount of log(2)1024 for our conversion */
-   pageshift -= LOG1024;
-
/* fill in the statics information */
statics->procstate_names = procstatenames;
statics->cpustate_names = cpustatenames;
@@ -1374,7 +1372,7 @@ static int sorted_state[] = {
 } while (0)

 #define ORDERKEY_SWAP(a, b) do { \
-   int diff = (int)ki_swap(b) - (int)ki_swap(a); \
+   int diff = (long)ki_swap(b) - (long)ki_swap(a); \
if (diff != 0) \
return (diff > 0 ? 1 : -1); \
 } while (0)

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-04-04 Thread Andriy Gapon
On 04/04/2018 16:19, Stefan Esser wrote:
> I have identified the cause of the extremely low I/O performance (2 to 6 read
> operations scheduled per second).
> 
> The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
> I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
> on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
> up the CPU (or exits).
> 
> Any non-zero value of preemt_thresh lets the system perform I/O in parallel
> with the CPU bound processes, again.

Let me guess... you have a custom kernel configuration and, unlike GENERIC
(assuming x86), it does not have 'options PREEMPTION'?

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Call for Testing: UEFI Changes

2018-04-04 Thread Kyle Evans
On Wed, Mar 21, 2018 at 7:45 PM, Kyle Evans  wrote:
> Hello!
>
> A number of changes have gone in recently pertaining to UEFI booting
> and UEFI runtime services. The changes with the most damaging
> potential are:
>
> We now put UEFI runtime services into virtual address mode, fixing
> runtime services with U-Boot/UEFI as well as the firmware
> implementation in many Lenovos. The previously observed behavior was a
> kernel panic upon invocation of efibootmgr/efivar, or a kernel panic
> just loading efirt.ko or compiling EFIRT into the kernel.
>
> Graphics mode selection is now done differently to avoid regression
> caused by r327058 while still achieving the same effect. The observed
> regression was that the kernel would usually end up drawing
> incorrectly at the old resolution on a subset of the screen, due to
> incorrect framebuffer information.
>
> Explicit testing of these changes, the latest of which happened in
> r331326, and any feedback from this testing would be greatly
> appreciated. Testing should be done with either `options EFIRT` in
> your kernel config or efirt.ko loaded along with updated bootloader
> bits.
>
> I otherwise plan to MFC commits involved with the above-mentioned
> changes by sometime in the first week of April, likely no earlier than
> two (2) weeks from now on April 4th.
>
> Thanks,
>
> Kyle Evans

As partially promised, the non-graphics related changes have been
MFC'd to stable/11 today as r332028.

The graphics related changes are going to simmer longer and probably
get ripped out, because we're bad at this.

Thanks,

Kyle Evans
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[Bug 227259] accept()/poll() and shutdown()/close() - not work as in FreeBSD10, may broke many apps

2018-04-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227259

Conrad Meyer  changed:

   What|Removed |Added

 CC|freebsd-current@FreeBSD.org |c...@freebsd.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[Bug 227259] accept()/poll() and shutdown()/close() - not work as in FreeBSD10, may broke many apps

2018-04-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227259

--- Comment #3 from rozhuk...@gmail.com ---
Why close() does not wakes thread that sleep on accept()?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[Bug 227259] accept()/poll() and shutdown()/close() - not work as in FreeBSD10, may broke many apps

2018-04-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227259

--- Comment #2 from rozhuk...@gmail.com ---
I do not understand why shutdown() does not generates POLLHUP/EV_EOF (EV_ERROR
then add shutdowned socket) for poll() and kqueue().

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[Bug 227259] accept() does not wakeup on shutdown()/close()

2018-04-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227259

rozhuk...@gmail.com changed:

   What|Removed |Added

 CC||freebsd-current@FreeBSD.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[Bug 227259] accept()/poll() and shutdown()/close() - not work as in FreeBSD10, may broke many apps

2018-04-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227259

rozhuk...@gmail.com changed:

   What|Removed |Added

Summary|accept() does not wakeup on |accept()/poll() and
   |shutdown()/close()  |shutdown()/close() - not
   ||work as in FreeBSD10, may
   ||broke many apps

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Is kern.sched.preempt_thresh=0 a sensible default? (was: Re: Extremely low disk throughput under high compute load)

2018-04-04 Thread Stefan Esser
Am 02.04.18 um 00:18 schrieb Stefan Esser:
> Am 01.04.18 um 18:33 schrieb Warner Losh:
>> On Sun, Apr 1, 2018 at 9:18 AM, Stefan Esser > > wrote:
>>
>> My i7-2600K based system with 24 GB RAM was in the midst of a buildworld 
>> -j8
>> (starting from a clean state) which caused a load average of 12 for more 
>> than
>> 1 hour, when I decided to move a directory structure holding some 10 GB 
>> to its
>> own ZFS file system. File sizes varied, but were mostly in the range 0f 
>> 500KB.
>>
>> I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus 
>> there
>> was nearly no disk activity caused by the buildworld.
>>
>> The copying proceeded at a rate of at most 10 MB/s, but most of the time 
>> less
>> than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and 
>> thus a
>> much better priority than the compute bound compiler processes, but it 
>> got
>> just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was 
>> scheduled
>> at such a low rate, that it only managed to issue a few controller 
>> writes per
>> second.
>>
>> The system is healthy and does not show any problems or anomalies under
>> normal use (e.g., file copies are fast, without the high compute load).
>>
>> This was with SCHED_ULE on a -CURRENT without WITNESS or malloc 
>> debugging.
>>
>> Is this a regression in -CURRENT?
>>
>> Does 'sync' push a lot of I/O to the disk?
> 
> Each sync takes 0.7 to 1.5 seconds to complete, but since reading is so
> slow, not much is written.
> 
> Normal gstat output for the 3 drives the RAIDZ1 consists of:
> 
> dT: 1.002s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  2  2 84   39.1  0  00.07.8  ada0
> 0  4  4 92   66.6  0  00.0   26.6  ada1
> 0  6  6259   66.9  0  00.0   36.2  ada3
> dT: 1.058s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  1  1 60   70.6  0  00.06.7  ada0
> 0  3  3 68   71.3  0  00.0   20.2  ada1
> 0  6  6242   65.5  0  00.0   28.8  ada3
> dT: 1.002s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  5  5192   44.8  0  00.0   22.4  ada0
> 0  6  6160   61.9  0  00.0   26.5  ada1
> 0  6  6172   43.7  0  00.0   26.2  ada3
> 
> This includes the copy process and the reads caused by "make -j 8 world"
> (but I assume that all the source files are already cached in ARC).

I have identified the cause of the extremely low I/O performance (2 to 6 read
operations scheduled per second).

The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
up the CPU (or exits).

Any non-zero value of preemt_thresh lets the system perform I/O in parallel
with the CPU bound processes, again.

I'm not sure about the bias relative to the PRI values displayed by top, but
for me a process with PRI above 72 (in top) should be eligible for preemption.

What value of preempt_thresh should I use to get that behavior?


And, more important: Is preempt_thresh=0 a reasonable default???

This prevents I/O bound processes from making reasonable progress if all CPU
cores/threads are busy. In my case, performance dropped from > 10 MB/s to just
a few hundred KB per second, i.e. by a factor of 30. (The %busy values in my
previous mail are misleading: At 10 MB/s the disk was about 70% busy ...)


Should preempt_thresh be set to some (possibly high, to only preempt long
running processes) value?

Regards, STefan
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"