RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)

2020-09-29 Thread David Madore
Dear list,

I'm trying to reshape a 3-disk RAID5 array to a 4-disk RAID6 array (of
the same total size and per-device size) using linux kernel 4.9.237 on
x86_64.  I understand that this reshaping operation is supposed to be
supported.  But it appears perpetually stuck at 0% with no operation
taking place whatsoever (the slices are unchanged apart from their
metadata, the backup file contains only zeroes, and nothing happens).
I wonder if this is a know kernel bug, or what else could explain it,
and I have no idea how to debug this sort of thing.

Here are some details on exactly what I've been doing.  I'll be using
loopbacks to illustrate, but I've done this on real partitions and
there was no difference.

## Create some empty loop devices:
for i in 0 1 2 3 ; do dd if=/dev/zero of=test-${i} bs=1024k count=16 ; done
for i in 0 1 2 3 ; do losetup /dev/loop${i} test-${i} ; done
## Make a RAID array out of the first three:
mdadm --create /dev/md/test --level=raid5 --chunk=256 --name=test \
  --metadata=1.0 --raid-devices=3 /dev/loop{0,1,2}
## Populate it with some content, just to see what's going on:
for i in $(seq 0 63) ; do printf "This is chunk %d (0x%x).\n" $i $i \
  | dd of=/dev/md/test bs=256k seek=$i ; done
## Now try to reshape the array from 3-way RAID5 to 4-way RAID6:
mdadm --manage /dev/md/test --add-spare /dev/loop3
mdadm --grow /dev/md/test --level=6 --raid-devices=4 \
  --backup-file=test-reshape.backup

...and then nothing happens.  /proc/mdstat reports no progress
whatsoever:

md112 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
  32256 blocks super 1.0 level 6, 256k chunk, algorithm 18 [4/3] [UUU_]
  [>]  reshape =  0.0% (1/16128) finish=1.0min 
speed=244K/sec

The loop file contents are unchanged except for the metadata
superblock, the backup file is entirely empty, and no activity
whatsoever is happening.

Actually, further investigation shows that the array is in fact
operational as a RAID6 array, but one where the Q-syndrome is stuck in
the last device: writing data to the md device (e.g., by repopulating
it with the same command as above) does cause loop3 to be updated as
expected for such a layout.  It's just the reshaping which doesn't
take place (or indeed begin).

For completeness, here's what mdadm --detail /dev/md/test looks like
before the reshape, in my example:

/dev/md/test:
Version : 1.0
  Creation Time : Wed Sep 30 02:42:30 2020
 Raid Level : raid5
 Array Size : 32256 (31.50 MiB 33.03 MB)
  Used Dev Size : 16128 (15.75 MiB 16.52 MB)
   Raid Devices : 3
  Total Devices : 4
Persistence : Superblock is persistent

Update Time : Wed Sep 30 02:44:21 2020
  State : clean 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

 Layout : left-symmetric
 Chunk Size : 256K

   Name : vega.stars:test  (local to host vega.stars)
   UUID : 30f40e34:b9a52ff0:75c8b063:77234832
 Events : 20

Number   Major   Minor   RaidDevice State
   0   700  active sync   /dev/loop0
   1   711  active sync   /dev/loop1
   3   722  active sync   /dev/loop2

   4   73-  spare   /dev/loop3

- and here's what it looks like after the attempted reshape has
started (or rather, refused to start):

/dev/md/test:
Version : 1.0
  Creation Time : Wed Sep 30 02:42:30 2020
 Raid Level : raid6
 Array Size : 32256 (31.50 MiB 33.03 MB)
  Used Dev Size : 16128 (15.75 MiB 16.52 MB)
   Raid Devices : 4
  Total Devices : 4
Persistence : Superblock is persistent

Update Time : Wed Sep 30 02:44:54 2020
  State : clean, degraded, reshaping 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

 Layout : left-symmetric-6
 Chunk Size : 256K

 Reshape Status : 0% complete
 New Layout : left-symmetric

   Name : vega.stars:test  (local to host vega.stars)
   UUID : 30f40e34:b9a52ff0:75c8b063:77234832
 Events : 22

Number   Major   Minor   RaidDevice State
   0   700  active sync   /dev/loop0
   1   711  active sync   /dev/loop1
   3   722  active sync   /dev/loop2
   4   733  spare rebuilding   /dev/loop3

I also tried writing "frozen" and then "resync" to the
/sys/block/md112/md/sync_action file with no further results.

I welcome any suggestions on how to investigate, work around, or fix
this problem.

Happy hacking,

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


Re: iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY

2016-09-21 Thread David Madore
On Tue, Sep 20, 2016 at 03:50:09PM +0300, Mika Westerberg wrote:
> Does the machine have WDAT ACPI table (see /sys/firmware/acpi/tables/*)?
> If it does, you can try the new WDAT watchdog driver instead [1]. It
> still uses the same hardware, though but via set of instructions
> provided by the BIOS that should work (given the vendor has tested
> it on Windows).
> 
> [1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1230607.html

Thanks for pointing this out.  My motherboard's BIOS does not have
this ACPI table, unfortunately, but it's at least good to know that
some do, and take the hardware watchdog seriously.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


Re: iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY

2016-09-21 Thread David Madore
On Tue, Sep 20, 2016 at 03:50:09PM +0300, Mika Westerberg wrote:
> Does the machine have WDAT ACPI table (see /sys/firmware/acpi/tables/*)?
> If it does, you can try the new WDAT watchdog driver instead [1]. It
> still uses the same hardware, though but via set of instructions
> provided by the BIOS that should work (given the vendor has tested
> it on Windows).
> 
> [1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1230607.html

Thanks for pointing this out.  My motherboard's BIOS does not have
this ACPI table, unfortunately, but it's at least good to know that
some do, and take the hardware watchdog seriously.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


motherboards with recent Intel chipsets: please test this (iTCO-wdt)

2016-09-10 Thread David Madore
TL;DR: On some motherboards with an Intel chipset, at least from Asus
and Asrock, the hardware watchdog (linux driver iTCO-wdt) fails to
reboot the system correctly (POST fails and leaves system unusable).
Looking for people willing to test, in order to pinpoint the problem.


Background:

I am looking for users of a desktop with a fairly recent Intel
chipset, especially if one or several of the following conditions are
satisfied: (1)the BIOS is written by AMI (American Megatrends), (2)the
chipset is of the Intel 100 series or C230 series (a.k.a. "Sunrise
Point", used for "Skylake" processors with an LGA1151 socket), and
(3)the system is booting under UEFI (as opposed to legacy BIOS).

The point of this test is to check whether the hardware watchdog
included in these chipsets (and known in Intel parlance, this watchdog
as the "TCO watchdog", where "TCO" stands for "Total Cost of
Ownership") reboots the system properly or, as on my motherboard,
places it in a broken state (POST fails, even when the reset button is
later pressed, or even if the power button is pressed twice; the power
supply needs to be disconnected for a few minutes to restore the
system to a working state).  This is a very serious bug, which could
be due to the BIOS, the hardware, or Linux (I suspect the former, but
it is conceivable that Linux could work around it).

Do not perform this test unless you can disconnect the power supply!


How to test:

Boot a recent Linux kernel.  Load the i2c-i801 and i2c-smbus modules.
Then load the iTCO-wdt module.  This should cause lines such as the
following to appear in the kernel log (dmesg), indicating that Linux
has detected the device:

iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400)
iTCO_wdt: initialized. heartbeat=120 sec (nowayout=0)

Make sure all your filesystems are unmounted or mounted read-only (on
systemd, e.g.: systemctl isolate emergency.target ; sync ; echo u >>
/proc/sysrq-trigger ; sync (and make sure "Emergency Remount complete"
appears at the end of dmesg)).  A /dev/watchdog device should have
appeared.  Then run

cat >> /dev/watchdog

and press enter twice.  Do not interrupt (do not press control-C or
control-D), just wait for a few minutes.  After a certain time (twice
the "heartbeat" value indicated by the kernel), the system will try to
reboot.  What interests me is whether the reboot succeeds (POST
proceeds as normal, and OS restarts) or whether the system locks up
(in which case you will need to power cycle it at the power supply
unit level in order to restore it to normal).

Please report (to me, to avoid spamming this list - I will post a
summary) results along with information as to the hardware used:
motherboard brand and model, BIOS vendor and date (dmidecode should
give this information), UEFI or legacy boot, and any extension cards
that might be used on the system (in particular, whether the system
uses an integrated GPU or a separate graphics card).  I am interested
in both positive and negative results.

Thanks in advance to all who are willing to test this!


Xref:

https://lkml.org/lkml/2016/9/8/641

https://www.reddit.com/r/linuxquestions/comments/51xad5/users_of_a_desktop_with_an_intel_chipset_could/


-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


motherboards with recent Intel chipsets: please test this (iTCO-wdt)

2016-09-10 Thread David Madore
TL;DR: On some motherboards with an Intel chipset, at least from Asus
and Asrock, the hardware watchdog (linux driver iTCO-wdt) fails to
reboot the system correctly (POST fails and leaves system unusable).
Looking for people willing to test, in order to pinpoint the problem.


Background:

I am looking for users of a desktop with a fairly recent Intel
chipset, especially if one or several of the following conditions are
satisfied: (1)the BIOS is written by AMI (American Megatrends), (2)the
chipset is of the Intel 100 series or C230 series (a.k.a. "Sunrise
Point", used for "Skylake" processors with an LGA1151 socket), and
(3)the system is booting under UEFI (as opposed to legacy BIOS).

The point of this test is to check whether the hardware watchdog
included in these chipsets (and known in Intel parlance, this watchdog
as the "TCO watchdog", where "TCO" stands for "Total Cost of
Ownership") reboots the system properly or, as on my motherboard,
places it in a broken state (POST fails, even when the reset button is
later pressed, or even if the power button is pressed twice; the power
supply needs to be disconnected for a few minutes to restore the
system to a working state).  This is a very serious bug, which could
be due to the BIOS, the hardware, or Linux (I suspect the former, but
it is conceivable that Linux could work around it).

Do not perform this test unless you can disconnect the power supply!


How to test:

Boot a recent Linux kernel.  Load the i2c-i801 and i2c-smbus modules.
Then load the iTCO-wdt module.  This should cause lines such as the
following to appear in the kernel log (dmesg), indicating that Linux
has detected the device:

iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400)
iTCO_wdt: initialized. heartbeat=120 sec (nowayout=0)

Make sure all your filesystems are unmounted or mounted read-only (on
systemd, e.g.: systemctl isolate emergency.target ; sync ; echo u >>
/proc/sysrq-trigger ; sync (and make sure "Emergency Remount complete"
appears at the end of dmesg)).  A /dev/watchdog device should have
appeared.  Then run

cat >> /dev/watchdog

and press enter twice.  Do not interrupt (do not press control-C or
control-D), just wait for a few minutes.  After a certain time (twice
the "heartbeat" value indicated by the kernel), the system will try to
reboot.  What interests me is whether the reboot succeeds (POST
proceeds as normal, and OS restarts) or whether the system locks up
(in which case you will need to power cycle it at the power supply
unit level in order to restore it to normal).

Please report (to me, to avoid spamming this list - I will post a
summary) results along with information as to the hardware used:
motherboard brand and model, BIOS vendor and date (dmidecode should
give this information), UEFI or legacy boot, and any extension cards
that might be used on the system (in particular, whether the system
uses an integrated GPU or a separate graphics card).  I am interested
in both positive and negative results.

Thanks in advance to all who are willing to test this!


Xref:

https://lkml.org/lkml/2016/9/8/641

https://www.reddit.com/r/linuxquestions/comments/51xad5/users_of_a_desktop_with_an_intel_chipset_could/


-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY

2016-09-08 Thread David Madore
TL;DR: the iTCO_wdt watchdog on the Asus P10S-WS motherboard, instead
of rebooting the machine, places the motherboard in a completely
nonfunctional state, from which it can be revived only by a hard power
cycle.  I suspect this is a BIOS bug: seeking advice on how/where to
report this, and what to do generally.  Maybe Linux can work around?


Dear list,

I have an Asus P10S-WS motherboard (Intel C236 chipset).  I have been
trying to get the iTCO_wdt hardware watchdog to work (I have been
successfully using this driver with similar Intel chipset based Asus
motherboards before, and I know it to work reliably).  I am using
Linux 4.7.3.

I trigger a reboot by killing (with kill -9) the wd_keepalive daemon
once it has opened the watchdog device.

Sadly, it appears that on this motherboard, the watchdog does not
reboot the machine (or at least, does not successfully reboot it).
Instead, the machine enters a "frozen" state (fans spinning, screen
black, all peripherals unresponsive) from which it cannot be woken up
by pressing the reset button, or even the power button twice (the
first press does turn the machine off, but it returns to the same
nonfunctional state after power on).  Instead, power has to be cut
completely, at the power supply level.

In this nonfunctional state, the Asus POST status display shows the
number "62", which according to the motherboard manual is the code for
"installation of the PCH runtime services" (I have no idea of what
that means).

I suspect that this is a BIOS ^W UEFI bug and in no way Linux's fault.
It could also be a hardware problem, a chipset bug, or something else.
And even if it is a firmware bug, it is conceivable that there is a
way to work around the problem from Linux.  So I ask for guidance from
the wisdom of this list:

* Is there something Linux can do about the problem?

* Is there a chance some kernel developer knows someone at Asus and
  can bring this problem to their attention?

* Can someone report success using the iTCO_wdt watchdog with other
  motherboards having the same Intel C236 chipset?  (Note: for it to
  work, the i2c_smbus module needs to be loaded: it took me a long
  time to figure out.)

* Is all hope lost for my motherboard?  (I badly need a hardware
  watchdog: if there is no way to get it to work on this motherboard,
  I will need to buy a new one.)

Any suggestions are welcome (or even words of comfort :-).

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY

2016-09-08 Thread David Madore
TL;DR: the iTCO_wdt watchdog on the Asus P10S-WS motherboard, instead
of rebooting the machine, places the motherboard in a completely
nonfunctional state, from which it can be revived only by a hard power
cycle.  I suspect this is a BIOS bug: seeking advice on how/where to
report this, and what to do generally.  Maybe Linux can work around?


Dear list,

I have an Asus P10S-WS motherboard (Intel C236 chipset).  I have been
trying to get the iTCO_wdt hardware watchdog to work (I have been
successfully using this driver with similar Intel chipset based Asus
motherboards before, and I know it to work reliably).  I am using
Linux 4.7.3.

I trigger a reboot by killing (with kill -9) the wd_keepalive daemon
once it has opened the watchdog device.

Sadly, it appears that on this motherboard, the watchdog does not
reboot the machine (or at least, does not successfully reboot it).
Instead, the machine enters a "frozen" state (fans spinning, screen
black, all peripherals unresponsive) from which it cannot be woken up
by pressing the reset button, or even the power button twice (the
first press does turn the machine off, but it returns to the same
nonfunctional state after power on).  Instead, power has to be cut
completely, at the power supply level.

In this nonfunctional state, the Asus POST status display shows the
number "62", which according to the motherboard manual is the code for
"installation of the PCH runtime services" (I have no idea of what
that means).

I suspect that this is a BIOS ^W UEFI bug and in no way Linux's fault.
It could also be a hardware problem, a chipset bug, or something else.
And even if it is a firmware bug, it is conceivable that there is a
way to work around the problem from Linux.  So I ask for guidance from
the wisdom of this list:

* Is there something Linux can do about the problem?

* Is there a chance some kernel developer knows someone at Asus and
  can bring this problem to their attention?

* Can someone report success using the iTCO_wdt watchdog with other
  motherboards having the same Intel C236 chipset?  (Note: for it to
  work, the i2c_smbus module needs to be loaded: it took me a long
  time to figure out.)

* Is all hope lost for my motherboard?  (I badly need a hardware
  watchdog: if there is no way to get it to work on this motherboard,
  I will need to buy a new one.)

Any suggestions are welcome (or even words of comfort :-).

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )


Many unexplainable OOMs after upgrading to 4.7.x kernel

2016-08-25 Thread David Madore
TL;DR: Why is Firefox getting OOM-killed while I have 24GB free swap?

Dear list,

A few days ago I upgraded the kernel on my desktop PC from 4.5.5 to
4.7.2 and, since then, I've witnessed a huge number of cases where
various processes (typically Firefox) got OOM-killed by the kernel.
Before this kernel upgrade, I had never seen a single OOM event in
normal use; now I've had dozens in couple of days.  Nothing has
changed in my config apart from the kernel.  Clearly, something has
changed for the worse!

In fact, this morning, the problem was so bad that I was simply unable
to start Firefox (any attempt to do so would result in it getting
killed immediately), even though I had about 24GB of free swap
available (and the system was idle).  I had to kill a few unrelated
processes to be able to launch Firefox; and even then, at some point,
running "sync" in a different window caused the Firefox to be
OOM-killed.

(Note: the problem IS NOT the choice of the process being killed:
Firefox is a reasonable target.  The problem is that the OOM-killer is
being invoked at all, whereas there is plenty of free swap, and,
before this kernel upgrade, all seemed to work perfectly.)

Unfortunately, all of this is very unreproducible: now it seems to
have disappeared completely, so I can't really test anything any more.
All I can offer is some sample output (below) from the kernel's logs.
Any suggestions as to what I should do if and when the problem returns
is welcome, either to debug or to work around these OOMs.

The computer in question is an x86_64 (Intel Core 2 Quad Q6600) box
with 8GB RAM and 24GB swap (4*6GB swap across four different disks): I
can of course offer any additional details as to its hardware, kernel
or userland config.

Sample log output from OOM-killer:

### cut after ###
Aug 25 11:46:12 vega kernel: [115461.357412] firefox invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
Aug 25 11:46:12 vega kernel: [115461.357421] CPU: 2 PID: 11342 Comm: firefox 
Not tainted 4.7.2-vega #1
Aug 25 11:46:12 vega kernel: [115461.357424] Hardware name: System manufacturer 
System Product Name/P5W64 WS Pro, BIOS 090307/31/2007
Aug 25 11:46:12 vega kernel: [115461.357427]   880068897bc8 
812e4a7f 
Aug 25 11:46:12 vega kernel: [115461.357433]   880068897c28 
8117cd40 818a9840
Aug 25 11:46:12 vega kernel: [115461.357438]  0002810e98a5 88023fd196a8 
0002 0206
Aug 25 11:46:12 vega kernel: [115461.357442] Call Trace:
Aug 25 11:46:12 vega kernel: [115461.357451]  [] 
dump_stack+0x4d/0x6e
Aug 25 11:46:12 vega kernel: [115461.357456]  [] 
dump_header.isra.16+0x51/0x191
Aug 25 11:46:12 vega kernel: [115461.357462]  [] 
oom_kill_process+0x33b/0x430
Aug 25 11:46:12 vega kernel: [115461.357465]  [] 
out_of_memory+0x258/0x2a0
Aug 25 11:46:12 vega kernel: [115461.357470]  [] 
__alloc_pages_nodemask+0xaf9/0xc90
Aug 25 11:46:12 vega kernel: [115461.357474]  [] 
alloc_kmem_pages_node+0x16/0x20
Aug 25 11:46:12 vega kernel: [115461.357479]  [] 
copy_process.part.42+0xfe/0x17a0
Aug 25 11:46:12 vega kernel: [115461.357483]  [] ? 
lru_cache_add_active_or_unevictable+0x30/0xa0
Aug 25 11:46:12 vega kernel: [115461.357487]  [] ? 
handle_mm_fault+0x172a/0x1910
Aug 25 11:46:12 vega kernel: [115461.357491]  [] 
_do_fork+0xdc/0x350
Aug 25 11:46:12 vega kernel: [115461.357496]  [] ? 
__do_page_fault+0x19d/0x4a0
Aug 25 11:46:12 vega kernel: [115461.357499]  [] 
SyS_clone+0x14/0x20
Aug 25 11:46:12 vega kernel: [115461.357503]  [] 
do_syscall_64+0x4b/0xa0
Aug 25 11:46:12 vega kernel: [115461.357507]  [] 
entry_SYSCALL64_slow_path+0x25/0x25
Aug 25 11:46:12 vega kernel: [115461.357510] Mem-Info:
Aug 25 11:46:12 vega kernel: [115461.357516] active_anon:44076 
inactive_anon:99299 isolated_anon:0
Aug 25 11:46:12 vega kernel: [115461.357516]  active_file:392644 
inactive_file:215736 isolated_file:0
Aug 25 11:46:12 vega kernel: [115461.357516]  unevictable:560 dirty:85 
writeback:0 unstable:0
Aug 25 11:46:12 vega kernel: [115461.357516]  slab_reclaimable:1033111 
slab_unreclaimable:31255
Aug 25 11:46:12 vega kernel: [115461.357516]  mapped:34613 shmem:10947 
pagetables:4477 bounce:0
Aug 25 11:46:12 vega kernel: [115461.357516]  free:77874 free_pcp:0 free_cma:0
Aug 25 11:46:12 vega kernel: [115461.357529] DMA free:15028kB min:260kB 
low:324kB high:388kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB 
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Aug 25 11:46:12 vega kernel: [115461.357531] lowmem_reserve[]: 0 2960 7954 7954
Aug 25 11:46:12 vega kernel: [115461.357544] DMA32 free:187232kB 

Many unexplainable OOMs after upgrading to 4.7.x kernel

2016-08-25 Thread David Madore
TL;DR: Why is Firefox getting OOM-killed while I have 24GB free swap?

Dear list,

A few days ago I upgraded the kernel on my desktop PC from 4.5.5 to
4.7.2 and, since then, I've witnessed a huge number of cases where
various processes (typically Firefox) got OOM-killed by the kernel.
Before this kernel upgrade, I had never seen a single OOM event in
normal use; now I've had dozens in couple of days.  Nothing has
changed in my config apart from the kernel.  Clearly, something has
changed for the worse!

In fact, this morning, the problem was so bad that I was simply unable
to start Firefox (any attempt to do so would result in it getting
killed immediately), even though I had about 24GB of free swap
available (and the system was idle).  I had to kill a few unrelated
processes to be able to launch Firefox; and even then, at some point,
running "sync" in a different window caused the Firefox to be
OOM-killed.

(Note: the problem IS NOT the choice of the process being killed:
Firefox is a reasonable target.  The problem is that the OOM-killer is
being invoked at all, whereas there is plenty of free swap, and,
before this kernel upgrade, all seemed to work perfectly.)

Unfortunately, all of this is very unreproducible: now it seems to
have disappeared completely, so I can't really test anything any more.
All I can offer is some sample output (below) from the kernel's logs.
Any suggestions as to what I should do if and when the problem returns
is welcome, either to debug or to work around these OOMs.

The computer in question is an x86_64 (Intel Core 2 Quad Q6600) box
with 8GB RAM and 24GB swap (4*6GB swap across four different disks): I
can of course offer any additional details as to its hardware, kernel
or userland config.

Sample log output from OOM-killer:

### cut after ###
Aug 25 11:46:12 vega kernel: [115461.357412] firefox invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
Aug 25 11:46:12 vega kernel: [115461.357421] CPU: 2 PID: 11342 Comm: firefox 
Not tainted 4.7.2-vega #1
Aug 25 11:46:12 vega kernel: [115461.357424] Hardware name: System manufacturer 
System Product Name/P5W64 WS Pro, BIOS 090307/31/2007
Aug 25 11:46:12 vega kernel: [115461.357427]   880068897bc8 
812e4a7f 
Aug 25 11:46:12 vega kernel: [115461.357433]   880068897c28 
8117cd40 818a9840
Aug 25 11:46:12 vega kernel: [115461.357438]  0002810e98a5 88023fd196a8 
0002 0206
Aug 25 11:46:12 vega kernel: [115461.357442] Call Trace:
Aug 25 11:46:12 vega kernel: [115461.357451]  [] 
dump_stack+0x4d/0x6e
Aug 25 11:46:12 vega kernel: [115461.357456]  [] 
dump_header.isra.16+0x51/0x191
Aug 25 11:46:12 vega kernel: [115461.357462]  [] 
oom_kill_process+0x33b/0x430
Aug 25 11:46:12 vega kernel: [115461.357465]  [] 
out_of_memory+0x258/0x2a0
Aug 25 11:46:12 vega kernel: [115461.357470]  [] 
__alloc_pages_nodemask+0xaf9/0xc90
Aug 25 11:46:12 vega kernel: [115461.357474]  [] 
alloc_kmem_pages_node+0x16/0x20
Aug 25 11:46:12 vega kernel: [115461.357479]  [] 
copy_process.part.42+0xfe/0x17a0
Aug 25 11:46:12 vega kernel: [115461.357483]  [] ? 
lru_cache_add_active_or_unevictable+0x30/0xa0
Aug 25 11:46:12 vega kernel: [115461.357487]  [] ? 
handle_mm_fault+0x172a/0x1910
Aug 25 11:46:12 vega kernel: [115461.357491]  [] 
_do_fork+0xdc/0x350
Aug 25 11:46:12 vega kernel: [115461.357496]  [] ? 
__do_page_fault+0x19d/0x4a0
Aug 25 11:46:12 vega kernel: [115461.357499]  [] 
SyS_clone+0x14/0x20
Aug 25 11:46:12 vega kernel: [115461.357503]  [] 
do_syscall_64+0x4b/0xa0
Aug 25 11:46:12 vega kernel: [115461.357507]  [] 
entry_SYSCALL64_slow_path+0x25/0x25
Aug 25 11:46:12 vega kernel: [115461.357510] Mem-Info:
Aug 25 11:46:12 vega kernel: [115461.357516] active_anon:44076 
inactive_anon:99299 isolated_anon:0
Aug 25 11:46:12 vega kernel: [115461.357516]  active_file:392644 
inactive_file:215736 isolated_file:0
Aug 25 11:46:12 vega kernel: [115461.357516]  unevictable:560 dirty:85 
writeback:0 unstable:0
Aug 25 11:46:12 vega kernel: [115461.357516]  slab_reclaimable:1033111 
slab_unreclaimable:31255
Aug 25 11:46:12 vega kernel: [115461.357516]  mapped:34613 shmem:10947 
pagetables:4477 bounce:0
Aug 25 11:46:12 vega kernel: [115461.357516]  free:77874 free_pcp:0 free_cma:0
Aug 25 11:46:12 vega kernel: [115461.357529] DMA free:15028kB min:260kB 
low:324kB high:388kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB 
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Aug 25 11:46:12 vega kernel: [115461.357531] lowmem_reserve[]: 0 2960 7954 7954
Aug 25 11:46:12 vega kernel: [115461.357544] DMA32 free:187232kB 

"hw csum failure" error on skge driver with 4.3 kernel upon receiving ICMPv6 multicast listener discovery packets

2015-11-15 Thread David Madore
The skge driver in the 4.3 kernel reports hardware checksum errors
upon receiving (certain?) IPv6 multicast packets containing ICMPv6
multicast listener discovery messages.  This is a regression since 4.1
(I believe between 4.1 and 4.2).  The e1000e driver on a different
Ethernet port of the same machine is not affected.  Disabling offload
rx checksumming suppresses the errors.  Nor are all IPv6 multicast
packets affected: for some reason, it seems only those containing
ICMPv6 multicast listener discovery messages trigger the problem.

In case it also matters, the skge interface in question (eth1 in what
follows) is part of a bridge that contains another Ethernet interface
and a Wifi card.

Here is a frame, with its link-level headers, that caused an error
when received by skge:

   33 33 ff 62 30 d8 60 fb 42 f1 b1 36 86 dd 60 00  33.b0.`.B..6..`.
0010   00 00 00 20 00 01 fe 80 00 00 00 00 00 00 62 fb  ... ..b.
0020   42 ff fe f1 b1 36 ff 02 00 00 00 00 00 00 00 00  B6..
0030   00 01 ff 62 30 d8 3a 00 01 00 05 02 00 00 83 00  ...b0.:.
0040   c9 8a 00 00 00 00 ff 02 00 00 00 00 00 00 00 00  
0050   00 01 ff 62 30 d8...b0.

(Network dumps performed on another network device suggest that the
checksum is, indeed, correct.)

And here is the syslog produced upon receiving the above packet:

Nov 15 17:52:13 pleiades kernel: [  661.393163] eth1: hw csum failure
Nov 15 17:52:13 pleiades kernel: [  661.394203] CPU: 0 PID: 0 Comm: swapper/0 
Tainted: GW   4.3.0-pleiades #1
Nov 15 17:52:13 pleiades kernel: [  661.395192] Hardware name: System 
manufacturer System Product Name/P5WD2-Premium, BIOS 0709 03/31/2006
Nov 15 17:52:13 pleiades kernel: [  661.395192]  88013a9d5d00 
88013fc03aa8 8129a186 88013afe
Nov 15 17:52:13 pleiades kernel: [  661.395192]  88013fc03ac0 
81436425  88013fc03af0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  8142b87a 
1027316b3fc03b30 88013a9d5d00 0030
Nov 15 17:52:13 pleiades kernel: [  661.395192] Call Trace:
Nov 15 17:52:13 pleiades kernel: [  661.395192][] 
dump_stack+0x44/0x5e
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
netdev_rx_csum_fault+0x35/0x40
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__skb_checksum_complete+0xca/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
ipv6_mc_validate_checksum+0xab/0x140
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
skb_checksum_trimmed+0x8f/0x180
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
ipv6_mc_check_mld+0x105/0x330
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_multicast_rcv+0x8c/0xce0 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
__netif_receive_skb+0x13/0x60
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
netif_receive_skb_internal+0x2e/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_handle_frame_finish+0x28c/0x5b0 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
usb_hcd_submit_urb+0xa4/0x960 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_handle_frame+0x151/0x270 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
usb_submit_urb+0x2d2/0x510 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__netif_receive_skb_core+0x1c2/0x990
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
__usb_hcd_giveback_urb+0x82/0xe0 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__netif_receive_skb+0x13/0x60
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
netif_receive_skb_internal+0x2e/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
napi_gro_receive+0xa0/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
skge_poll+0x380/0x7a0 [skge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
lapic_next_event+0x18/0x20
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
net_rx_action+0x13c/0x300
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__do_softirq+0xc7/0x240
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
irq_exit+0x70/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
do_IRQ+0x51/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
common_interrupt+0x7c/0x7c
Nov 15 17:52:13 pleiades kernel: [  661.395192][] ? 
mwait_idle+0x87/0x140
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
arch_cpu_idle+0xa/0x10
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
default_idle_call+0x25/0x30
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
cpu_startup_entry+0x29c/0x310
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
rest_init+0x72/0x80
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
start_kernel+0x471/0x47e
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
set_init_arg+0x55/0x55
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
x86_64_start_reservations+0x2a/0x2c
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
x86_64_start_kernel+0xe5/0xe8


I can 

"hw csum failure" error on skge driver with 4.3 kernel upon receiving ICMPv6 multicast listener discovery packets

2015-11-15 Thread David Madore
The skge driver in the 4.3 kernel reports hardware checksum errors
upon receiving (certain?) IPv6 multicast packets containing ICMPv6
multicast listener discovery messages.  This is a regression since 4.1
(I believe between 4.1 and 4.2).  The e1000e driver on a different
Ethernet port of the same machine is not affected.  Disabling offload
rx checksumming suppresses the errors.  Nor are all IPv6 multicast
packets affected: for some reason, it seems only those containing
ICMPv6 multicast listener discovery messages trigger the problem.

In case it also matters, the skge interface in question (eth1 in what
follows) is part of a bridge that contains another Ethernet interface
and a Wifi card.

Here is a frame, with its link-level headers, that caused an error
when received by skge:

   33 33 ff 62 30 d8 60 fb 42 f1 b1 36 86 dd 60 00  33.b0.`.B..6..`.
0010   00 00 00 20 00 01 fe 80 00 00 00 00 00 00 62 fb  ... ..b.
0020   42 ff fe f1 b1 36 ff 02 00 00 00 00 00 00 00 00  B6..
0030   00 01 ff 62 30 d8 3a 00 01 00 05 02 00 00 83 00  ...b0.:.
0040   c9 8a 00 00 00 00 ff 02 00 00 00 00 00 00 00 00  
0050   00 01 ff 62 30 d8...b0.

(Network dumps performed on another network device suggest that the
checksum is, indeed, correct.)

And here is the syslog produced upon receiving the above packet:

Nov 15 17:52:13 pleiades kernel: [  661.393163] eth1: hw csum failure
Nov 15 17:52:13 pleiades kernel: [  661.394203] CPU: 0 PID: 0 Comm: swapper/0 
Tainted: GW   4.3.0-pleiades #1
Nov 15 17:52:13 pleiades kernel: [  661.395192] Hardware name: System 
manufacturer System Product Name/P5WD2-Premium, BIOS 0709 03/31/2006
Nov 15 17:52:13 pleiades kernel: [  661.395192]  88013a9d5d00 
88013fc03aa8 8129a186 88013afe
Nov 15 17:52:13 pleiades kernel: [  661.395192]  88013fc03ac0 
81436425  88013fc03af0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  8142b87a 
1027316b3fc03b30 88013a9d5d00 0030
Nov 15 17:52:13 pleiades kernel: [  661.395192] Call Trace:
Nov 15 17:52:13 pleiades kernel: [  661.395192][] 
dump_stack+0x44/0x5e
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
netdev_rx_csum_fault+0x35/0x40
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__skb_checksum_complete+0xca/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
ipv6_mc_validate_checksum+0xab/0x140
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
skb_checksum_trimmed+0x8f/0x180
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
ipv6_mc_check_mld+0x105/0x330
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_multicast_rcv+0x8c/0xce0 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
__netif_receive_skb+0x13/0x60
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
netif_receive_skb_internal+0x2e/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_handle_frame_finish+0x28c/0x5b0 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
usb_hcd_submit_urb+0xa4/0x960 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
br_handle_frame+0x151/0x270 [bridge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
usb_submit_urb+0x2d2/0x510 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__netif_receive_skb_core+0x1c2/0x990
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
__usb_hcd_giveback_urb+0x82/0xe0 [usbcore]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__netif_receive_skb+0x13/0x60
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
netif_receive_skb_internal+0x2e/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
napi_gro_receive+0xa0/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
skge_poll+0x380/0x7a0 [skge]
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
lapic_next_event+0x18/0x20
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
net_rx_action+0x13c/0x300
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
__do_softirq+0xc7/0x240
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
irq_exit+0x70/0x90
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
do_IRQ+0x51/0xd0
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
common_interrupt+0x7c/0x7c
Nov 15 17:52:13 pleiades kernel: [  661.395192][] ? 
mwait_idle+0x87/0x140
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
arch_cpu_idle+0xa/0x10
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
default_idle_call+0x25/0x30
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
cpu_startup_entry+0x29c/0x310
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
rest_init+0x72/0x80
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
start_kernel+0x471/0x47e
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] ? 
set_init_arg+0x55/0x55
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
x86_64_start_reservations+0x2a/0x2c
Nov 15 17:52:13 pleiades kernel: [  661.395192]  [] 
x86_64_start_kernel+0xe5/0xe8


I can 

Re: XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)

2015-11-08 Thread David Madore
On Sun, Nov 08, 2015 at 08:49:41AM -0800, Christoph Hellwig wrote:
> can you try the patch at:
> 
> http://article.gmane.org/gmane.comp.file-systems.xfs.general/70984
> 
> The symptoms sound surprisingly similar.

Thanks for the pointer.  I'm running with the patch now, and will
follow-up if the problem reoccurs.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)

2015-11-08 Thread David Madore
Compiled a 4.2.5 kernel a few days ago.  This morning, my machine was
essentially unresponsive (couldn't log on) so I used
alt-sysrq-{s,t,s,u,b} to reboot it, and after reboot it appears that
the first suspicious message concerns xfsaild blocked for more than
300 seconds.  The disks and filesystems are good as far as I know, and
I never had any problem with my previous (4.1.4) kernel, so I guess
this is a regression in 4.2.  (I also had a similar issue with 4.3,
which is probably due to the same cause and also XFS-related: https://lkml.org/lkml/2015/11/4/104 >.)

Please contact me for further information, or just to make me feel
less lonely and helpless when facing this kind of bug. :-(

Here are the first few and last lines of log, with links to the full
logs and full config below:

Nov  8 08:08:14 vega kernel: [313800.046048] INFO: task xfsaild/md117:14100 
blocked for more than 300 seconds.
Nov  8 08:08:14 vega kernel: [313800.046054]   Not tainted 4.2.5-vega #1
Nov  8 08:08:14 vega kernel: [313800.046057] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 08:08:14 vega kernel: [313800.046060] xfsaild/md117   D 88023fc14a00 
0 14100  2 0x
Nov  8 08:08:14 vega kernel: [313800.046066]  8800bbb07d18 0046 
817244c0 8802345126c0
Nov  8 08:08:14 vega kernel: [313800.046071]  8800bbb07d08 810a35ba 
8800bbb07d58 8800bbb08000
Nov  8 08:08:14 vega kernel: [313800.046075]  880231235528  
880231235400 880234460e00
Nov  8 08:08:14 vega kernel: [313800.046080] Call Trace:
Nov  8 08:08:14 vega kernel: [313800.046090]  [] ? 
try_to_del_timer_sync+0x4a/0x60
Nov  8 08:08:14 vega kernel: [313800.046095]  [] 
schedule+0x32/0x80
Nov  8 08:08:14 vega kernel: [313800.046123]  [] 
_xfs_log_force+0x154/0x240 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046128]  [] ? 
wake_up_q+0x70/0x70
Nov  8 08:08:14 vega kernel: [313800.046143]  [] 
xfs_log_force+0x25/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046159]  [] ? 
xfs_trans_ail_cursor_done+0x11/0x30 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046173]  [] 
xfsaild+0x138/0x560 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046188]  [] ? 
xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046203]  [] ? 
xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046207]  [] 
kthread+0xc4/0xe0
Nov  8 08:08:14 vega kernel: [313800.046211]  [] ? 
kthread_create_on_node+0x180/0x180
Nov  8 08:08:14 vega kernel: [313800.046215]  [] 
ret_from_fork+0x3f/0x70
Nov  8 08:08:14 vega kernel: [313800.046219]  [] ? 
kthread_create_on_node+0x180/0x180

Nov  8 11:18:01 vega kernel: [325187.487839] Showing busy workqueues and worker 
pools:
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md117: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=2/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work 
[xfs], xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md118: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md115: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md120: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md113: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=2/256
Nov  8 11:18:01 vega kernel: [325187.487839] in-flight: 
31546:xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work 
[xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md110: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] pool 0: cpus=0 node=0 flags=0x0 
nice=0 workers=2 manager: 31448 idle: 31569
Nov  8 11:18:01 vega kernel: [325187.487839] pool 2: cpus=1 node=0 flags=0x0 
nice=0 workers=2 manager: 31616 idle: 31510
Nov  8 11:18:01 vega kernel: [325187.487839] pool 3: cpus=1 node=0 flags=0x0 
nice=-20 workers=2 manager: 31499
Nov  8 11:18:01 vega kernel: [325187.487839] pool 4: cpus=2 node=0 flags=0x0 
nice=0 workers=2 manager: 31503 idle: 31571
Nov  8 11:18:01 vega kernel: [325187.487839] pool 8: cpus=0-3 

XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)

2015-11-08 Thread David Madore
Compiled a 4.2.5 kernel a few days ago.  This morning, my machine was
essentially unresponsive (couldn't log on) so I used
alt-sysrq-{s,t,s,u,b} to reboot it, and after reboot it appears that
the first suspicious message concerns xfsaild blocked for more than
300 seconds.  The disks and filesystems are good as far as I know, and
I never had any problem with my previous (4.1.4) kernel, so I guess
this is a regression in 4.2.  (I also had a similar issue with 4.3,
which is probably due to the same cause and also XFS-related: https://lkml.org/lkml/2015/11/4/104 >.)

Please contact me for further information, or just to make me feel
less lonely and helpless when facing this kind of bug. :-(

Here are the first few and last lines of log, with links to the full
logs and full config below:

Nov  8 08:08:14 vega kernel: [313800.046048] INFO: task xfsaild/md117:14100 
blocked for more than 300 seconds.
Nov  8 08:08:14 vega kernel: [313800.046054]   Not tainted 4.2.5-vega #1
Nov  8 08:08:14 vega kernel: [313800.046057] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 08:08:14 vega kernel: [313800.046060] xfsaild/md117   D 88023fc14a00 
0 14100  2 0x
Nov  8 08:08:14 vega kernel: [313800.046066]  8800bbb07d18 0046 
817244c0 8802345126c0
Nov  8 08:08:14 vega kernel: [313800.046071]  8800bbb07d08 810a35ba 
8800bbb07d58 8800bbb08000
Nov  8 08:08:14 vega kernel: [313800.046075]  880231235528  
880231235400 880234460e00
Nov  8 08:08:14 vega kernel: [313800.046080] Call Trace:
Nov  8 08:08:14 vega kernel: [313800.046090]  [] ? 
try_to_del_timer_sync+0x4a/0x60
Nov  8 08:08:14 vega kernel: [313800.046095]  [] 
schedule+0x32/0x80
Nov  8 08:08:14 vega kernel: [313800.046123]  [] 
_xfs_log_force+0x154/0x240 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046128]  [] ? 
wake_up_q+0x70/0x70
Nov  8 08:08:14 vega kernel: [313800.046143]  [] 
xfs_log_force+0x25/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046159]  [] ? 
xfs_trans_ail_cursor_done+0x11/0x30 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046173]  [] 
xfsaild+0x138/0x560 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046188]  [] ? 
xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046203]  [] ? 
xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
Nov  8 08:08:14 vega kernel: [313800.046207]  [] 
kthread+0xc4/0xe0
Nov  8 08:08:14 vega kernel: [313800.046211]  [] ? 
kthread_create_on_node+0x180/0x180
Nov  8 08:08:14 vega kernel: [313800.046215]  [] 
ret_from_fork+0x3f/0x70
Nov  8 08:08:14 vega kernel: [313800.046219]  [] ? 
kthread_create_on_node+0x180/0x180

Nov  8 11:18:01 vega kernel: [325187.487839] Showing busy workqueues and worker 
pools:
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md117: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=2/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work 
[xfs], xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md118: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md115: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md120: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md113: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=2/256
Nov  8 11:18:01 vega kernel: [325187.487839] in-flight: 
31546:xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work 
[xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md110: flags=0x14
Nov  8 11:18:01 vega kernel: [325187.487839]   pwq 3: cpus=1 node=0 flags=0x0 
nice=-20 active=1/256
Nov  8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs]
Nov  8 11:18:01 vega kernel: [325187.487839] pool 0: cpus=0 node=0 flags=0x0 
nice=0 workers=2 manager: 31448 idle: 31569
Nov  8 11:18:01 vega kernel: [325187.487839] pool 2: cpus=1 node=0 flags=0x0 
nice=0 workers=2 manager: 31616 idle: 31510
Nov  8 11:18:01 vega kernel: [325187.487839] pool 3: cpus=1 node=0 flags=0x0 
nice=-20 workers=2 manager: 31499
Nov  8 11:18:01 vega kernel: [325187.487839] pool 4: cpus=2 node=0 flags=0x0 
nice=0 workers=2 manager: 31503 idle: 31571
Nov  8 11:18:01 vega kernel: [325187.487839] pool 8: cpus=0-3 

Re: XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)

2015-11-08 Thread David Madore
On Sun, Nov 08, 2015 at 08:49:41AM -0800, Christoph Hellwig wrote:
> can you try the patch at:
> 
> http://article.gmane.org/gmane.comp.file-systems.xfs.general/70984
> 
> The symptoms sound surprisingly similar.

Thanks for the pointer.  I'm running with the patch now, and will
follow-up if the problem reoccurs.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


tasks hung forever (khugepaged blocked) with 4.3 kernel

2015-11-04 Thread David Madore
With a 4.3 kernel I compiled two days ago, I had various processes
stuck in 'D' state this morning, tried to unmount filesystems, which
made things worse and froze everything.  In case this means anything,
reading from the hard drives with dd (e.g., dd if=/dev/sda bs=4096k
count=1) worked (data was not in cache), but not for large amounts of
data (count=64 hung forever).  If this is of any use, I did an
alt-sysrq-t after an emergency sync, and I have the logs: full kernel
log is at link below, here are the beginning and end of it:

Nov  4 08:10:10 vega kernel: [141900.078018] INFO: task khugepaged:173 blocked 
for more than 300 seconds.
Nov  4 08:10:10 vega kernel: [141900.078024]   Not tainted 4.3.0-vega #1
Nov  4 08:10:10 vega kernel: [141900.078026] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  4 08:10:10 vega kernel: [141900.078029] khugepaged  D 88023fc94f00 
0   173  2 0x
Nov  4 08:10:10 vega kernel: [141900.078035]  88023eb97748 0046 
88023e9a6d00 880236e38d40
Nov  4 08:10:10 vega kernel: [141900.078040]  88023fc1ec88 88023eb97728 
88023eb98000 8800bba5e800
Nov  4 08:10:10 vega kernel: [141900.078044]  880234607518  
880236e38d40 88023eb97760
Nov  4 08:10:10 vega kernel: [141900.078048] Call Trace:
Nov  4 08:10:10 vega kernel: [141900.078057]  [] 
schedule+0x2e/0x70
Nov  4 08:10:10 vega kernel: [141900.078086]  [] 
_xfs_log_force_lsn+0x155/0x2b0 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078091]  [] ? 
wake_up_q+0x70/0x70
Nov  4 08:10:10 vega kernel: [141900.078106]  [] 
xfs_log_force_lsn+0x29/0x80 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078123]  [] ? 
xfs_iunpin_wait+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078140]  [] 
__xfs_iunpin_wait+0x88/0x120 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078145]  [] ? 
autoremove_wake_function+0x30/0x30
Nov  4 08:10:10 vega kernel: [141900.078161]  [] 
xfs_iunpin_wait+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078178]  [] 
xfs_reclaim_inode+0x5d/0x320 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078196]  [] 
xfs_reclaim_inodes_ag+0x243/0x360 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078214]  [] 
xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078229]  [] 
xfs_fs_free_cached_objects+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078233]  [] 
super_cache_scan+0x179/0x180
Nov  4 08:10:10 vega kernel: [141900.078239]  [] 
shrink_slab.part.62.constprop.72+0x1c5/0x340
Nov  4 08:10:10 vega kernel: [141900.078243]  [] 
shrink_zone+0x166/0x170
Nov  4 08:10:10 vega kernel: [141900.078246]  [] 
do_try_to_free_pages+0x173/0x350
Nov  4 08:10:10 vega kernel: [141900.078249]  [] 
try_to_free_pages+0xa0/0x130
Nov  4 08:10:10 vega kernel: [141900.078253]  [] 
__alloc_pages_nodemask+0x40e/0x790
Nov  4 08:10:10 vega kernel: [141900.078258]  [] 
khugepaged+0x14a/0x11c0
Nov  4 08:10:10 vega kernel: [141900.078261]  [] ? 
wait_woken+0x80/0x80
Nov  4 08:10:10 vega kernel: [141900.078265]  [] ? 
use_zero_page_show+0x30/0x30
Nov  4 08:10:10 vega kernel: [141900.078269]  [] 
kthread+0xc4/0xe0
Nov  4 08:10:10 vega kernel: [141900.078273]  [] ? 
kthread_create_on_node+0x170/0x170
Nov  4 08:10:10 vega kernel: [141900.078277]  [] 
ret_from_fork+0x3f/0x70
Nov  4 08:10:10 vega kernel: [141900.078280]  [] ? 
kthread_create_on_node+0x170/0x170
Nov  4 08:10:10 vega kernel: [141900.078285] INFO: task kswapd0:443 blocked for 
more than 300 seconds.
Nov  4 08:10:10 vega kernel: [141900.078287]   Not tainted 4.3.0-vega #1

Nov  4 10:10:41 vega kernel: [149114.106817] 
Nov  4 10:10:41 vega kernel: [149114.106817] Showing busy workqueues and worker 
pools:
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue events: flags=0x0
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=9/256
Nov  4 10:10:41 vega kernel: [149114.106817] in-flight: 19153:do_sync_work
Nov  4 10:10:41 vega kernel: [149114.106817] pending: console_callback, 
dbs_timer, sysrq_reinject_alt_sysrq, cache_reap, vmstat_update, flush_to_ldisc, 
push_to_pool, kernfs_notify_workfn
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue events_power_efficient: 
flags=0x80
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=1/256
Nov  4 10:10:41 vega kernel: [149114.106817] pending: neigh_periodic_work
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue writeback: flags=0x4e
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 8: cpus=0-3 flags=0x4 nice=0 
active=2/256
Nov  4 10:10:41 vega kernel: [149114.106817] in-flight: 14424:wb_workfn, 
169(RESCUER):wb_workfn
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue xfs-reclaim/md115: 
flags=0x4
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=1/256
Nov  4 10:10:41 vega kernel: [149114.106817] pending: xfs_reclaim_worker 
[xfs]
Nov  4 10:10:41 vega kernel: 

tasks hung forever (khugepaged blocked) with 4.3 kernel

2015-11-04 Thread David Madore
With a 4.3 kernel I compiled two days ago, I had various processes
stuck in 'D' state this morning, tried to unmount filesystems, which
made things worse and froze everything.  In case this means anything,
reading from the hard drives with dd (e.g., dd if=/dev/sda bs=4096k
count=1) worked (data was not in cache), but not for large amounts of
data (count=64 hung forever).  If this is of any use, I did an
alt-sysrq-t after an emergency sync, and I have the logs: full kernel
log is at link below, here are the beginning and end of it:

Nov  4 08:10:10 vega kernel: [141900.078018] INFO: task khugepaged:173 blocked 
for more than 300 seconds.
Nov  4 08:10:10 vega kernel: [141900.078024]   Not tainted 4.3.0-vega #1
Nov  4 08:10:10 vega kernel: [141900.078026] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  4 08:10:10 vega kernel: [141900.078029] khugepaged  D 88023fc94f00 
0   173  2 0x
Nov  4 08:10:10 vega kernel: [141900.078035]  88023eb97748 0046 
88023e9a6d00 880236e38d40
Nov  4 08:10:10 vega kernel: [141900.078040]  88023fc1ec88 88023eb97728 
88023eb98000 8800bba5e800
Nov  4 08:10:10 vega kernel: [141900.078044]  880234607518  
880236e38d40 88023eb97760
Nov  4 08:10:10 vega kernel: [141900.078048] Call Trace:
Nov  4 08:10:10 vega kernel: [141900.078057]  [] 
schedule+0x2e/0x70
Nov  4 08:10:10 vega kernel: [141900.078086]  [] 
_xfs_log_force_lsn+0x155/0x2b0 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078091]  [] ? 
wake_up_q+0x70/0x70
Nov  4 08:10:10 vega kernel: [141900.078106]  [] 
xfs_log_force_lsn+0x29/0x80 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078123]  [] ? 
xfs_iunpin_wait+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078140]  [] 
__xfs_iunpin_wait+0x88/0x120 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078145]  [] ? 
autoremove_wake_function+0x30/0x30
Nov  4 08:10:10 vega kernel: [141900.078161]  [] 
xfs_iunpin_wait+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078178]  [] 
xfs_reclaim_inode+0x5d/0x320 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078196]  [] 
xfs_reclaim_inodes_ag+0x243/0x360 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078214]  [] 
xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078229]  [] 
xfs_fs_free_cached_objects+0x14/0x20 [xfs]
Nov  4 08:10:10 vega kernel: [141900.078233]  [] 
super_cache_scan+0x179/0x180
Nov  4 08:10:10 vega kernel: [141900.078239]  [] 
shrink_slab.part.62.constprop.72+0x1c5/0x340
Nov  4 08:10:10 vega kernel: [141900.078243]  [] 
shrink_zone+0x166/0x170
Nov  4 08:10:10 vega kernel: [141900.078246]  [] 
do_try_to_free_pages+0x173/0x350
Nov  4 08:10:10 vega kernel: [141900.078249]  [] 
try_to_free_pages+0xa0/0x130
Nov  4 08:10:10 vega kernel: [141900.078253]  [] 
__alloc_pages_nodemask+0x40e/0x790
Nov  4 08:10:10 vega kernel: [141900.078258]  [] 
khugepaged+0x14a/0x11c0
Nov  4 08:10:10 vega kernel: [141900.078261]  [] ? 
wait_woken+0x80/0x80
Nov  4 08:10:10 vega kernel: [141900.078265]  [] ? 
use_zero_page_show+0x30/0x30
Nov  4 08:10:10 vega kernel: [141900.078269]  [] 
kthread+0xc4/0xe0
Nov  4 08:10:10 vega kernel: [141900.078273]  [] ? 
kthread_create_on_node+0x170/0x170
Nov  4 08:10:10 vega kernel: [141900.078277]  [] 
ret_from_fork+0x3f/0x70
Nov  4 08:10:10 vega kernel: [141900.078280]  [] ? 
kthread_create_on_node+0x170/0x170
Nov  4 08:10:10 vega kernel: [141900.078285] INFO: task kswapd0:443 blocked for 
more than 300 seconds.
Nov  4 08:10:10 vega kernel: [141900.078287]   Not tainted 4.3.0-vega #1

Nov  4 10:10:41 vega kernel: [149114.106817] 
Nov  4 10:10:41 vega kernel: [149114.106817] Showing busy workqueues and worker 
pools:
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue events: flags=0x0
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=9/256
Nov  4 10:10:41 vega kernel: [149114.106817] in-flight: 19153:do_sync_work
Nov  4 10:10:41 vega kernel: [149114.106817] pending: console_callback, 
dbs_timer, sysrq_reinject_alt_sysrq, cache_reap, vmstat_update, flush_to_ldisc, 
push_to_pool, kernfs_notify_workfn
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue events_power_efficient: 
flags=0x80
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=1/256
Nov  4 10:10:41 vega kernel: [149114.106817] pending: neigh_periodic_work
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue writeback: flags=0x4e
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 8: cpus=0-3 flags=0x4 nice=0 
active=2/256
Nov  4 10:10:41 vega kernel: [149114.106817] in-flight: 14424:wb_workfn, 
169(RESCUER):wb_workfn
Nov  4 10:10:41 vega kernel: [149114.106817] workqueue xfs-reclaim/md115: 
flags=0x4
Nov  4 10:10:41 vega kernel: [149114.106817]   pwq 4: cpus=2 node=0 flags=0x0 
nice=0 active=1/256
Nov  4 10:10:41 vega kernel: [149114.106817] pending: xfs_reclaim_worker 
[xfs]
Nov  4 10:10:41 vega kernel: 

xhci-hcd does not detect USB3 host controller: how to debug/diagnose?

2015-11-02 Thread David Madore
Dear list,

I have a USB 3.0 controller card connected to a PCI-E bus that appears
on lspci as

06:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev 03) (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413]
Flags: fast devsel, IRQ 10
Memory at efdfe000 (64-bit, non-prefetchable) [size=8K]

My problem is that a 4.3 kernel that I compiled [see link below for
config] does not see this peripheral at all: loading the xhci-hcd
module does not provide any sort of diagnostic, nothing appears in
dmesg (even when I insert something in a USB port), the module is not
marked as in use and does not appear in lspci -v (there is no link
/sys/bus/pci/devices/:06:00.0/driver).  No error, no warning, no
message, nothing: the module simply behaves as a no-op.

(I made sure to load xhci-hcd before ehci-hcd and uhci-hcd, in case
this is important.  I also tried loading xhci-hcd last, but none of
this made any difference.)

This is almost certainly a config problem, because booting a live
Ubuntu 14.04 LTS, the controller is recognized by xhci-hcd, as it
should be:

[4.949337] xhci_hcd :06:00.0: xHCI Host Controller
[4.949344] xhci_hcd :06:00.0: new USB bus registered, assigned bus 
number 1
[4.949653] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[4.949655] usb usb1: New USB device strings: Mfr=3, Product=2, 
SerialNumber=1
[4.949658] usb usb1: Product: xHCI Host Controller
[4.949660] usb usb1: Manufacturer: Linux 3.19.0-25-generic xhci-hcd
[4.949662] usb usb1: SerialNumber: :06:00.0
[4.949790] hub 1-0:1.0: USB hub found
[4.949801] hub 1-0:1.0: 2 ports detected

My question is: how can I diagnose this problem?  What can cause the
driver to fail to see the peripheral and ignore it without any kind of
diagnostic?  Is there any way I can debug this, or ask the xhci-hcd
module "why are you ignoring 06:00.0?"?


Links to more information:

Full config of custom 4.3 kernel:
http://www.madore.org/~david/.tmp/config.20151102.broken

dmesg output of said custom kernel:
http://www.madore.org/~david/.tmp/dmesg.20151102.broken

lspci -vvv output under custom kernel:
http://www.madore.org/~david/.tmp/lspci.20151102.broken

dmesg output of live Ubuntu where the module works:
http://www.madore.org/~david/.tmp/dmesg.20151102.ubuntu

lspci -vvv output under live Ubuntu:
http://www.madore.org/~david/.tmp/lspci.20151102.ubuntu

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


xhci-hcd does not detect USB3 host controller: how to debug/diagnose?

2015-11-02 Thread David Madore
Dear list,

I have a USB 3.0 controller card connected to a PCI-E bus that appears
on lspci as

06:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev 03) (prog-if 30 [XHCI])
Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413]
Flags: fast devsel, IRQ 10
Memory at efdfe000 (64-bit, non-prefetchable) [size=8K]

My problem is that a 4.3 kernel that I compiled [see link below for
config] does not see this peripheral at all: loading the xhci-hcd
module does not provide any sort of diagnostic, nothing appears in
dmesg (even when I insert something in a USB port), the module is not
marked as in use and does not appear in lspci -v (there is no link
/sys/bus/pci/devices/:06:00.0/driver).  No error, no warning, no
message, nothing: the module simply behaves as a no-op.

(I made sure to load xhci-hcd before ehci-hcd and uhci-hcd, in case
this is important.  I also tried loading xhci-hcd last, but none of
this made any difference.)

This is almost certainly a config problem, because booting a live
Ubuntu 14.04 LTS, the controller is recognized by xhci-hcd, as it
should be:

[4.949337] xhci_hcd :06:00.0: xHCI Host Controller
[4.949344] xhci_hcd :06:00.0: new USB bus registered, assigned bus 
number 1
[4.949653] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[4.949655] usb usb1: New USB device strings: Mfr=3, Product=2, 
SerialNumber=1
[4.949658] usb usb1: Product: xHCI Host Controller
[4.949660] usb usb1: Manufacturer: Linux 3.19.0-25-generic xhci-hcd
[4.949662] usb usb1: SerialNumber: :06:00.0
[4.949790] hub 1-0:1.0: USB hub found
[4.949801] hub 1-0:1.0: 2 ports detected

My question is: how can I diagnose this problem?  What can cause the
driver to fail to see the peripheral and ignore it without any kind of
diagnostic?  Is there any way I can debug this, or ask the xhci-hcd
module "why are you ignoring 06:00.0?"?


Links to more information:

Full config of custom 4.3 kernel:
http://www.madore.org/~david/.tmp/config.20151102.broken

dmesg output of said custom kernel:
http://www.madore.org/~david/.tmp/dmesg.20151102.broken

lspci -vvv output under custom kernel:
http://www.madore.org/~david/.tmp/lspci.20151102.broken

dmesg output of live Ubuntu where the module works:
http://www.madore.org/~david/.tmp/dmesg.20151102.ubuntu

lspci -vvv output under live Ubuntu:
http://www.madore.org/~david/.tmp/lspci.20151102.ubuntu

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
On Wed, Oct 15, 2014 at 03:54:08PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 15, 2014 at 3:30 PM, David Madore  wrote:
> > Note that since the possibility of using SO_PEERCRED on AF_INET
> > sockets does not hitherto exist on Linux, we can be sure that nobody
> > uses it, so it's not like it might open vulnerabilities in existing
> > code.  If you think it's insecure, it can be documented as such (by
> > comparing it with identd): I still think it's better than having no
> > control at all when binding to localhost, which is the present
> > situation (causing, e.g., CVE-2014-2914).
> 
> This doesn't follow.  *Everybody* uses connect on AF_INET.
> 
> IMO anything that sends a caller's credentials needs to be explicit and 
> opt-in.

I'm confused as to whether you mean "opt-in" on the side of the caller
(=process requesting the endpoint's credentials), or on that of the
endpoint (=authenticated process).  On the one hand I don't understand
what it could mean on the caller side, on the other hand you mention
explicit support in OpenSSH, which would be the caller in my scenario.

So, in case I haven't been clear enough, the situation I have in mind
is: on "thishost", I run "ssh -L 14321:remotehost:4321 somehost" to
forward connexions on from the local port 14321 of thishost (where ssh
listens on the loopback) to the port 4321 of remotehost.
Unfortunately, now everyone with an acccount on thishost can connect
to port 14321 and effectively emit a connection from somehost to
remotehost on my behalf.  I think everyone agrees that this is a huge
problem.  But I don't understand how you propose to remedy this.

Patching ssh is an option, but I don't see how to do it (ssh needs to
make sure that the connections it receives on 14321 are from the same
uid, and this seems impossible without the feature I'm discussing).
Patching the kernel is an option.  Patching clients that connect to
14321, on the other hand, is not, because there are many different
ones, and their protocol is defined by immutable Internet standards,
so we have no latitude there (for example, we can't ask a Web browser
to connect to Unix domain sockets: there simply isn't a URL scheme to
refer to them).  Adding iptables rules is not an option if I'm not the
system administrator on thishost.

So, how can we solve this problem securely?

> I believe that there is no secure way to authenticate clients that
> currently don't authenticate themselves without changing the clients.
> That's the whole point: currently-secure are written under the
> assumption that they are not exercising their credentials.  You can't
> safely change that without making it opt-in.

Then what are we to do, given that modifying the clients is
impossible?

What about my proposal that user credentials would be returned only if
they refer to the same user as the caller user and that the caller is
permitted to ptrace the endpoint?  This answers your objection of
leaking credentials: the caller could do anything at all with the
other side since it could ptrace it - we're just permitting a user to
authenticate their own sockets.  A further sysctl could enable the use
of the call in more general cases, for those administrators who think
it should be allowed.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
On Wed, Oct 15, 2014 at 07:41:48AM -0700, Andy Lutomirski wrote:
> On 10/15/2014 06:35 AM, David Madore wrote:
> > Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,)
> > call allows one endpoint to authenticate the other endpoint's pid, uid
> > and gid.
> > 
> > The call is valid on AF_INET and AF_INET6 sockets but returns no data
> > (pid=0, uid=-1, gid=-1).  Obviously it is meaningless to try to get
> > such credentials from a INET/INET6 socket in general, but there is one
> > case where it would make sense: namely, when the endpoint is local
> > (i.e., when the socket is a connection to the same machine, e.g., when
> > connecting to 127.0.0.0/8 or ::1/32).
> 
> I will object to adding it as described, for the same reason that I
> object to anything that extends the current model of socket-based
> credential passing.  Ideally, credentials would *never* be implicitly
> captured by socket syscalls.  We live in the real world, and SO_CRED
> exists, so I think the best we can do is to try to minimize its use.
> 
> I can elaborate further, or you can IIRC search the archives for
> SCM_IDENTITY, and you can also look at CVE-2013-1979 for a nasty example
> of why this model is broken.

>From what I understand, what was broken is mainly that the credentials
were evaluated when the write() system call took place rather than
when socket() or bind(): this violates the Unix security model
(privilege control occurs when the file descriptor is created, not
when it is used).  On the contrary, it is conform to Unix security
principles that credentials are checked implicitly when binding a
socket (this happens when permissions are being checked on the path
when binding or connecting on a Unix domain socket; and to allow
binding to secure ports in the INET domain; and so on).  It seems to
me that a suid program that is willing to create or bind a socket on
behalf of its caller without knowing exactly what it will be
connecting to, it should intrinsically be treated as a security
vulnerability, even when it is not obviously exploitable.

Also, to go along the real world examples, identd exists and is used
for identification on local networks (e.g. localhost), so the capture
of credentials already takes place.  Unix programmers are aware of
this, and know that a privileged program should not bind a socket if
they don't want to leak privileges.  (Another example is the use of -m
owner in iptables.)

And, of course, if Solaris already has this feature, there is some
experience for it.  Has there been any documented vulnerability
relating to the fact that Solaris allows getpeerucred() to
authenticate locally connected AF_INET sockets?

Note that since the possibility of using SO_PEERCRED on AF_INET
sockets does not hitherto exist on Linux, we can be sure that nobody
uses it, so it's not like it might open vulnerabilities in existing
code.  If you think it's insecure, it can be documented as such (by
comparing it with identd): I still think it's better than having no
control at all when binding to localhost, which is the present
situation (causing, e.g., CVE-2014-2914).

Because SO_PEERCRED currently returns {pid=0,uid=-1,gid=-1} on
AF_INET, we might still return this value if there is any risk that
the endpoint would be unwilling to share its credentials: for example,
this value might be returned if the other endpoint is not ptraceable
by the caller - this would still cover the essential use case, which
is for unprivileged users to authenticate the connections from their
own processes.  Would this limitation assuage your worries about the
proposed feature?

The thing is, I don't see any other way the ssh port forwarding mess
can ever be improved.  Do you have another solution in mind that?

Any attempt to have some kind of authentication of local sockets that
required participation on the client (authenticatee)'s part is doomed:
if modifying the protocol and/or client code is an option, we might as
well use some form of crypto / TLS.  Or Unix-domain sockets.  But what
are we supposed to do when modifying the client (to make it send
credentials, use crypto or connect on AF_UNIX) is not an option?

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,)
call allows one endpoint to authenticate the other endpoint's pid, uid
and gid.

The call is valid on AF_INET and AF_INET6 sockets but returns no data
(pid=0, uid=-1, gid=-1).  Obviously it is meaningless to try to get
such credentials from a INET/INET6 socket in general, but there is one
case where it would make sense: namely, when the endpoint is local
(i.e., when the socket is a connection to the same machine, e.g., when
connecting to 127.0.0.0/8 or ::1/32).

Being able to authenticate local INET/INET6 sockets would be immensely
valuable for a number of programs, to provide some kind of access
control to local sockets.  For example, ssh allows port forwarding
using the -L and -D options: by default or by option (cf. the
GatewayPorts option of ssh), these port-forwarding sockets can be
restricted to localhost, but of course they cannot be restricted to
the user running ssh, which makes them a huge security problem.  Many
programs suffer from the same problem (they restrict some kind of
connection to localhost, but they of course cannot make a restriction
on which user will be able to connect).

One cannot simply retort "these programs should be using Unix-domain
sockets instead": I don't think many browsers support using a SOCKS
proxy or an HTTP proxy over a Unix-domain socket, and in the latter
case I'm not even sure it would make sense (protocol-wise).

If I believe http://www.lehman.cuny.edu/cgi-bin/man-cgi?getpeerucred+3
 > ("The system currently supports both sides of connection end-points
for local AF_UNIX, AF_INET, and AF_INET6 sockets"), Solaris, or at
least some version thereof, support authentication of local AF_INET
and AF_INET6 sockets.

I think it would be wonderful if Linux had this.  I'm willing to work
on the implementation if it is considered *a priori* acceptable for
inclusion.

The data seems to be available, since it is exposed in /proc/net/tcp
and /proc/net/tcp6 and whatnots (implementation details left aside, it
is merely a question of matching a line with opposite endpoints to the
current socket and returning it).

[In principle, a userland program can parse /proc/net/tcp so it does
not need the feature I am suggesting, but in practice parsing a text
file to communicate with the kernel is yucky at best, and probably not
very robust (e.g., /proc might not be mounted), and it would be very
difficult to convince, say, the OpenSSH authors to include code that
parses the Linux /proc/net/tcp format (or even link with a library
which does this) in order to add access-control on ssh port-forwards:
having this under a more standard getsockpot() interface is cleaner
and opens at least some kind of hope that programs would agree to use
it.]

Question number 1: If this feature were implemented, would it be
considered acceptable for inclusion in the kernel?  (If there is some
reason why it can't be accepted, I'd like to know in advance, to avoid
working in vain.)

Question number 2: A priori, how difficult would it be to implement
this?  (As mentioned above, it seems trivial in principle to merely go
through the local endpoints to find a matching connection, but maybe
there are locking issues that I don't understand that make it much
more difficult than it would seem.)  Any guidelines on implementation?
(I imagine one should try to fill sk->sk_peer_cred at connect time,
but I don't really know how difficult this might turn out.)

Any comments on the matter are welcome.

Happy hacking,

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,)
call allows one endpoint to authenticate the other endpoint's pid, uid
and gid.

The call is valid on AF_INET and AF_INET6 sockets but returns no data
(pid=0, uid=-1, gid=-1).  Obviously it is meaningless to try to get
such credentials from a INET/INET6 socket in general, but there is one
case where it would make sense: namely, when the endpoint is local
(i.e., when the socket is a connection to the same machine, e.g., when
connecting to 127.0.0.0/8 or ::1/32).

Being able to authenticate local INET/INET6 sockets would be immensely
valuable for a number of programs, to provide some kind of access
control to local sockets.  For example, ssh allows port forwarding
using the -L and -D options: by default or by option (cf. the
GatewayPorts option of ssh), these port-forwarding sockets can be
restricted to localhost, but of course they cannot be restricted to
the user running ssh, which makes them a huge security problem.  Many
programs suffer from the same problem (they restrict some kind of
connection to localhost, but they of course cannot make a restriction
on which user will be able to connect).

One cannot simply retort these programs should be using Unix-domain
sockets instead: I don't think many browsers support using a SOCKS
proxy or an HTTP proxy over a Unix-domain socket, and in the latter
case I'm not even sure it would make sense (protocol-wise).

If I believe URL:
http://www.lehman.cuny.edu/cgi-bin/man-cgi?getpeerucred+3
  (The system currently supports both sides of connection end-points
for local AF_UNIX, AF_INET, and AF_INET6 sockets), Solaris, or at
least some version thereof, support authentication of local AF_INET
and AF_INET6 sockets.

I think it would be wonderful if Linux had this.  I'm willing to work
on the implementation if it is considered *a priori* acceptable for
inclusion.

The data seems to be available, since it is exposed in /proc/net/tcp
and /proc/net/tcp6 and whatnots (implementation details left aside, it
is merely a question of matching a line with opposite endpoints to the
current socket and returning it).

[In principle, a userland program can parse /proc/net/tcp so it does
not need the feature I am suggesting, but in practice parsing a text
file to communicate with the kernel is yucky at best, and probably not
very robust (e.g., /proc might not be mounted), and it would be very
difficult to convince, say, the OpenSSH authors to include code that
parses the Linux /proc/net/tcp format (or even link with a library
which does this) in order to add access-control on ssh port-forwards:
having this under a more standard getsockpot() interface is cleaner
and opens at least some kind of hope that programs would agree to use
it.]

Question number 1: If this feature were implemented, would it be
considered acceptable for inclusion in the kernel?  (If there is some
reason why it can't be accepted, I'd like to know in advance, to avoid
working in vain.)

Question number 2: A priori, how difficult would it be to implement
this?  (As mentioned above, it seems trivial in principle to merely go
through the local endpoints to find a matching connection, but maybe
there are locking issues that I don't understand that make it much
more difficult than it would seem.)  Any guidelines on implementation?
(I imagine one should try to fill sk-sk_peer_cred at connect time,
but I don't really know how difficult this might turn out.)

Any comments on the matter are welcome.

Happy hacking,

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
On Wed, Oct 15, 2014 at 07:41:48AM -0700, Andy Lutomirski wrote:
 On 10/15/2014 06:35 AM, David Madore wrote:
  Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,)
  call allows one endpoint to authenticate the other endpoint's pid, uid
  and gid.
  
  The call is valid on AF_INET and AF_INET6 sockets but returns no data
  (pid=0, uid=-1, gid=-1).  Obviously it is meaningless to try to get
  such credentials from a INET/INET6 socket in general, but there is one
  case where it would make sense: namely, when the endpoint is local
  (i.e., when the socket is a connection to the same machine, e.g., when
  connecting to 127.0.0.0/8 or ::1/32).
 
 I will object to adding it as described, for the same reason that I
 object to anything that extends the current model of socket-based
 credential passing.  Ideally, credentials would *never* be implicitly
 captured by socket syscalls.  We live in the real world, and SO_CRED
 exists, so I think the best we can do is to try to minimize its use.
 
 I can elaborate further, or you can IIRC search the archives for
 SCM_IDENTITY, and you can also look at CVE-2013-1979 for a nasty example
 of why this model is broken.

From what I understand, what was broken is mainly that the credentials
were evaluated when the write() system call took place rather than
when socket() or bind(): this violates the Unix security model
(privilege control occurs when the file descriptor is created, not
when it is used).  On the contrary, it is conform to Unix security
principles that credentials are checked implicitly when binding a
socket (this happens when permissions are being checked on the path
when binding or connecting on a Unix domain socket; and to allow
binding to secure ports in the INET domain; and so on).  It seems to
me that a suid program that is willing to create or bind a socket on
behalf of its caller without knowing exactly what it will be
connecting to, it should intrinsically be treated as a security
vulnerability, even when it is not obviously exploitable.

Also, to go along the real world examples, identd exists and is used
for identification on local networks (e.g. localhost), so the capture
of credentials already takes place.  Unix programmers are aware of
this, and know that a privileged program should not bind a socket if
they don't want to leak privileges.  (Another example is the use of -m
owner in iptables.)

And, of course, if Solaris already has this feature, there is some
experience for it.  Has there been any documented vulnerability
relating to the fact that Solaris allows getpeerucred() to
authenticate locally connected AF_INET sockets?

Note that since the possibility of using SO_PEERCRED on AF_INET
sockets does not hitherto exist on Linux, we can be sure that nobody
uses it, so it's not like it might open vulnerabilities in existing
code.  If you think it's insecure, it can be documented as such (by
comparing it with identd): I still think it's better than having no
control at all when binding to localhost, which is the present
situation (causing, e.g., CVE-2014-2914).

Because SO_PEERCRED currently returns {pid=0,uid=-1,gid=-1} on
AF_INET, we might still return this value if there is any risk that
the endpoint would be unwilling to share its credentials: for example,
this value might be returned if the other endpoint is not ptraceable
by the caller - this would still cover the essential use case, which
is for unprivileged users to authenticate the connections from their
own processes.  Would this limitation assuage your worries about the
proposed feature?

The thing is, I don't see any other way the ssh port forwarding mess
can ever be improved.  Do you have another solution in mind that?

Any attempt to have some kind of authentication of local sockets that
required participation on the client (authenticatee)'s part is doomed:
if modifying the protocol and/or client code is an option, we might as
well use some form of crypto / TLS.  Or Unix-domain sockets.  But what
are we supposed to do when modifying the client (to make it send
credentials, use crypto or connect on AF_UNIX) is not an option?

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)

2014-10-15 Thread David Madore
On Wed, Oct 15, 2014 at 03:54:08PM -0700, Andy Lutomirski wrote:
 On Wed, Oct 15, 2014 at 3:30 PM, David Madore david...@madore.org wrote:
  Note that since the possibility of using SO_PEERCRED on AF_INET
  sockets does not hitherto exist on Linux, we can be sure that nobody
  uses it, so it's not like it might open vulnerabilities in existing
  code.  If you think it's insecure, it can be documented as such (by
  comparing it with identd): I still think it's better than having no
  control at all when binding to localhost, which is the present
  situation (causing, e.g., CVE-2014-2914).
 
 This doesn't follow.  *Everybody* uses connect on AF_INET.
 
 IMO anything that sends a caller's credentials needs to be explicit and 
 opt-in.

I'm confused as to whether you mean opt-in on the side of the caller
(=process requesting the endpoint's credentials), or on that of the
endpoint (=authenticated process).  On the one hand I don't understand
what it could mean on the caller side, on the other hand you mention
explicit support in OpenSSH, which would be the caller in my scenario.

So, in case I haven't been clear enough, the situation I have in mind
is: on thishost, I run ssh -L 14321:remotehost:4321 somehost to
forward connexions on from the local port 14321 of thishost (where ssh
listens on the loopback) to the port 4321 of remotehost.
Unfortunately, now everyone with an acccount on thishost can connect
to port 14321 and effectively emit a connection from somehost to
remotehost on my behalf.  I think everyone agrees that this is a huge
problem.  But I don't understand how you propose to remedy this.

Patching ssh is an option, but I don't see how to do it (ssh needs to
make sure that the connections it receives on 14321 are from the same
uid, and this seems impossible without the feature I'm discussing).
Patching the kernel is an option.  Patching clients that connect to
14321, on the other hand, is not, because there are many different
ones, and their protocol is defined by immutable Internet standards,
so we have no latitude there (for example, we can't ask a Web browser
to connect to Unix domain sockets: there simply isn't a URL scheme to
refer to them).  Adding iptables rules is not an option if I'm not the
system administrator on thishost.

So, how can we solve this problem securely?

 I believe that there is no secure way to authenticate clients that
 currently don't authenticate themselves without changing the clients.
 That's the whole point: currently-secure are written under the
 assumption that they are not exercising their credentials.  You can't
 safely change that without making it opt-in.

Then what are we to do, given that modifying the clients is
impossible?

What about my proposal that user credentials would be returned only if
they refer to the same user as the caller user and that the caller is
permitted to ptrace the endpoint?  This answers your objection of
leaking credentials: the caller could do anything at all with the
other side since it could ptrace it - we're just permitting a user to
authenticate their own sockets.  A further sysctl could enable the use
of the call in more general cases, for those administrators who think
it should be allowed.

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-09-02 Thread David Madore
Since I had a rare occasion to physically access the machine, I did
the following experiment: connect another machine to the serial
console, run

while true ; do date ; cat /proc/slabinfo ; echo '***' ; sleep 3 ; done

and generate lots of IPv6 traffic through the box (as I mentioned, for
some reason, a Firefox compilation through ssh seems particularly
effective).  So I now have lots of slabinfo data and, beyond the
initial WARNING, I also got messages along the lines of "swapper: page
allocation failure: order:10, mode:0x4020".

I put the full log in http://www.madore.org/~david/.tmp/pollux-dump.0
 > (unfortunately a bit garbled, because sometimes the cat slabinfo
was interspaced with printk output, but there are still plenty of
usable lines of each sort).

For completeness, here's a sample message from a page allocation
failure, and a copy of /proc/slabinfo from just about that time (I
have no idea how to read this, but one thing I can say is that there
is no extraordinarily large number in this):

[  567.757489] swapper: page allocation failure: order:10, mode:0x4020
[  567.763815] [] (unwind_backtrace+0x0/0xf0) from [] 
(warn_alloc_failed+0xcc/0x10c)
[  567.773119] [] (warn_alloc_failed+0xcc/0x10c) from [] 
(__alloc_pages_nodemask+0x530/0x68c)
[  567.783184] [] (__alloc_pages_nodemask+0x530/0x68c) from 
[] (__get_free_pages+0x10/0x3c)
[  567.793084] [] (__get_free_pages+0x10/0x3c) from [] 
(kmalloc_order_trace+0x24/0xdc)
[  567.802547] [] (kmalloc_order_trace+0x24/0xdc) from [] 
(pskb_expand_head+0x68/0x298)
[  567.812317] [] (pskb_expand_head+0x68/0x298) from [] 
(ip6_forward+0x4d4/0x7bc [ipv6])
[  567.822056] [] (ip6_forward+0x4d4/0x7bc [ipv6]) from [] 
(ipv6_rcv+0x2bc/0x3dc [ipv6])
[  567.831751] [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [] 
(__netif_receive_skb+0x544/0x66c)
[  567.841521] [] (__netif_receive_skb+0x544/0x66c) from [] 
(br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[  567.853306] [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 
[bridge]) from [] (br_nf_pre_routing+0x59c/0x67c [bridge])
[  567.865673] [] (br_nf_pre_routing+0x59c/0x67c [bridge]) from 
[] (nf_iterate+0x8c/0xb4)
[  567.875387] [] (nf_iterate+0x8c/0xb4) from [] 
(nf_hook_slow+0x5c/0x118)
[  567.883800] [] (nf_hook_slow+0x5c/0x118) from [] 
(br_handle_frame+0x1b8/0x290 [bridge])
[  567.893624] [] (br_handle_frame+0x1b8/0x290 [bridge]) from 
[] (__netif_receive_skb+0x3cc/0x66c)
[  567.904137] [] (__netif_receive_skb+0x3cc/0x66c) from [] 
(mv643xx_eth_poll+0x540/0x734)
[  567.913928] [] (mv643xx_eth_poll+0x540/0x734) from [] 
(net_rx_action+0x118/0x314)
[  567.923215] [] (net_rx_action+0x118/0x314) from [] 
(__do_softirq+0xac/0x234)
[  567.932058] [] (__do_softirq+0xac/0x234) from [] 
(irq_exit+0x94/0x9c)
[  567.940421] [] (irq_exit+0x94/0x9c) from [] 
(handle_IRQ+0x34/0x84)
[  567.948392] [] (handle_IRQ+0x34/0x84) from [] 
(__irq_svc+0x34/0x98)
[  567.956454] [] (__irq_svc+0x34/0x98) from [] 
(kirkwood_enter_idle+0x4c/0x94)
[  567.965299] [] (kirkwood_enter_idle+0x4c/0x94) from [] 
(cpuidle_idle_call+0xc8/0x35c)
[  567.974925] [] (cpuidle_idle_call+0xc8/0x35c) from [] 
(cpu_idle+0x88/0xdc)
[  567.983581] [] (cpu_idle+0x88/0xdc) from [] 
(start_kernel+0x2a0/0x2f0)
[  567.991893] Mem-info:
[  567.994185] Normal per-cpu:
[  567.996995] CPU0: hi:  186, btch:  31 usd:  84
[  568.001815] active_anon:5592 inactive_anon:34 isolated_anon:0
[  568.001820]  active_file:2845 inactive_file:6118 isolated_file:0
[  568.001825]  unevictable:418 dirty:13 writeback:0 unstable:0
[  568.001829]  free:12507 slab_reclaimable:632 slab_unreclaimable:1124
[  568.001835]  mapped:2546 shmem:47 pagetables:152 bounce:0
[  568.031126] Normal free:50028kB min:2884kB low:3604kB high:4324kB 
active_anon:22368kB inactive_anon:136kB active_file:11380kB 
inactive_file:24472kB unevictable:1672kB isolated(anon):0kB isolated(file):0kB 
present:520192kB mlocked:1672kB dirty:52kB writeback:0kB mapped:10184kB 
shmem:188kB slab_reclaimable:2528kB slab_unreclaimable:4496kB 
kernel_stack:584kB pagetables:608kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
[  568.071329] lowmem_reserve[]: 0 0
[  568.074696] Normal: 1*4kB 1*8kB 8*16kB 7*32kB 0*64kB 2*128kB 1*256kB 4*512kB 
20*1024kB 11*2048kB 1*4096kB = 50028kB
[  568.085282] 9350 total pagecache pages
[  568.089039] 0 pages in swap cache
[  568.092363] Swap cache stats: add 0, delete 0, find 0/0
[  568.097621] Free swap  = 0kB
[  568.100506] Total swap = 0kB
[  568.117927] 131072 pages of RAM
[  568.121087] 12771 free pages
[  568.123972] 2839 reserved pages
[  568.127140] 1361 slab pages
[  568.129945] 8003 pages shared
[  568.132919] 0 pages swap cached

slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
nf_conntrack_expect  0  0176   231 : tunables000 : 
slabdata  0  0  0
nf_conntrack_c06d1258128128248   161 : tunables000 
: slabdata  8  8  0
ip6_dst_cache 72 

Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-09-02 Thread David Madore
Since I had a rare occasion to physically access the machine, I did
the following experiment: connect another machine to the serial
console, run

while true ; do date ; cat /proc/slabinfo ; echo '***' ; sleep 3 ; done

and generate lots of IPv6 traffic through the box (as I mentioned, for
some reason, a Firefox compilation through ssh seems particularly
effective).  So I now have lots of slabinfo data and, beyond the
initial WARNING, I also got messages along the lines of swapper: page
allocation failure: order:10, mode:0x4020.

I put the full log in URL:
http://www.madore.org/~david/.tmp/pollux-dump.0
  (unfortunately a bit garbled, because sometimes the cat slabinfo
was interspaced with printk output, but there are still plenty of
usable lines of each sort).

For completeness, here's a sample message from a page allocation
failure, and a copy of /proc/slabinfo from just about that time (I
have no idea how to read this, but one thing I can say is that there
is no extraordinarily large number in this):

[  567.757489] swapper: page allocation failure: order:10, mode:0x4020
[  567.763815] [c000d728] (unwind_backtrace+0x0/0xf0) from [c009a87c] 
(warn_alloc_failed+0xcc/0x10c)
[  567.773119] [c009a87c] (warn_alloc_failed+0xcc/0x10c) from [c009ce48] 
(__alloc_pages_nodemask+0x530/0x68c)
[  567.783184] [c009ce48] (__alloc_pages_nodemask+0x530/0x68c) from 
[c009cfb4] (__get_free_pages+0x10/0x3c)
[  567.793084] [c009cfb4] (__get_free_pages+0x10/0x3c) from [c00c9fd0] 
(kmalloc_order_trace+0x24/0xdc)
[  567.802547] [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) from [c038d638] 
(pskb_expand_head+0x68/0x298)
[  567.812317] [c038d638] (pskb_expand_head+0x68/0x298) from [bf0e93ec] 
(ip6_forward+0x4d4/0x7bc [ipv6])
[  567.822056] [bf0e93ec] (ip6_forward+0x4d4/0x7bc [ipv6]) from [bf0ebebc] 
(ipv6_rcv+0x2bc/0x3dc [ipv6])
[  567.831751] [bf0ebebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [c0394870] 
(__netif_receive_skb+0x544/0x66c)
[  567.841521] [c0394870] (__netif_receive_skb+0x544/0x66c) from [bf1d9054] 
(br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[  567.853306] [bf1d9054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 
[bridge]) from [bf1d9ae8] (br_nf_pre_routing+0x59c/0x67c [bridge])
[  567.865673] [bf1d9ae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) from 
[c03bd2a4] (nf_iterate+0x8c/0xb4)
[  567.875387] [c03bd2a4] (nf_iterate+0x8c/0xb4) from [c03bd328] 
(nf_hook_slow+0x5c/0x118)
[  567.883800] [c03bd328] (nf_hook_slow+0x5c/0x118) from [bf1d3fa4] 
(br_handle_frame+0x1b8/0x290 [bridge])
[  567.893624] [bf1d3fa4] (br_handle_frame+0x1b8/0x290 [bridge]) from 
[c03946f8] (__netif_receive_skb+0x3cc/0x66c)
[  567.904137] [c03946f8] (__netif_receive_skb+0x3cc/0x66c) from [c031e254] 
(mv643xx_eth_poll+0x540/0x734)
[  567.913928] [c031e254] (mv643xx_eth_poll+0x540/0x734) from [c0397390] 
(net_rx_action+0x118/0x314)
[  567.923215] [c0397390] (net_rx_action+0x118/0x314) from [c0029924] 
(__do_softirq+0xac/0x234)
[  567.932058] [c0029924] (__do_softirq+0xac/0x234) from [c0029f00] 
(irq_exit+0x94/0x9c)
[  567.940421] [c0029f00] (irq_exit+0x94/0x9c) from [c00094b0] 
(handle_IRQ+0x34/0x84)
[  567.948392] [c00094b0] (handle_IRQ+0x34/0x84) from [c04398d4] 
(__irq_svc+0x34/0x98)
[  567.956454] [c04398d4] (__irq_svc+0x34/0x98) from [c0011d6c] 
(kirkwood_enter_idle+0x4c/0x94)
[  567.965299] [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) from [c0357a00] 
(cpuidle_idle_call+0xc8/0x35c)
[  567.974925] [c0357a00] (cpuidle_idle_call+0xc8/0x35c) from [c0009764] 
(cpu_idle+0x88/0xdc)
[  567.983581] [c0009764] (cpu_idle+0x88/0xdc) from [c05d8720] 
(start_kernel+0x2a0/0x2f0)
[  567.991893] Mem-info:
[  567.994185] Normal per-cpu:
[  567.996995] CPU0: hi:  186, btch:  31 usd:  84
[  568.001815] active_anon:5592 inactive_anon:34 isolated_anon:0
[  568.001820]  active_file:2845 inactive_file:6118 isolated_file:0
[  568.001825]  unevictable:418 dirty:13 writeback:0 unstable:0
[  568.001829]  free:12507 slab_reclaimable:632 slab_unreclaimable:1124
[  568.001835]  mapped:2546 shmem:47 pagetables:152 bounce:0
[  568.031126] Normal free:50028kB min:2884kB low:3604kB high:4324kB 
active_anon:22368kB inactive_anon:136kB active_file:11380kB 
inactive_file:24472kB unevictable:1672kB isolated(anon):0kB isolated(file):0kB 
present:520192kB mlocked:1672kB dirty:52kB writeback:0kB mapped:10184kB 
shmem:188kB slab_reclaimable:2528kB slab_unreclaimable:4496kB 
kernel_stack:584kB pagetables:608kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
[  568.071329] lowmem_reserve[]: 0 0
[  568.074696] Normal: 1*4kB 1*8kB 8*16kB 7*32kB 0*64kB 2*128kB 1*256kB 4*512kB 
20*1024kB 11*2048kB 1*4096kB = 50028kB
[  568.085282] 9350 total pagecache pages
[  568.089039] 0 pages in swap cache
[  568.092363] Swap cache stats: add 0, delete 0, find 0/0
[  568.097621] Free swap  = 0kB
[  568.100506] Total swap = 0kB
[  568.117927] 131072 pages of RAM
[  568.121087] 12771 free pages
[  568.123972] 2839 reserved pages
[  568.127140] 1361 slab pages
[  

Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-31 Thread David Madore
On Fri, Aug 31, 2012 at 12:59:36PM +0200, David Madore wrote:
> On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
> > David Madore  :
> > [...]
> > > I imagine it being somehow related to the fact that it operates a
> > > network bridge (I imagine this because I have another identical
> > > machine with exactly the same kernel and a very similar config but not
> > > running a bridge, and the warning never pops up).
> > 
> > Could it not be a genuine allocation failure ?
> 
> I have no idea.  How can I tell?  In any case, if having 512MB RAM
> isn't enough for the kernel in the router of a small home's network,
> that's a bug somewhere, isn't it?

PS: I'm also getting the following kind of messages from a wlan
interface that's on the bridge:

[  268.976317] ieee80211 phy0: failed to reallocate TX buffer
[  716.880515] ieee80211 phy0: failed to reallocate TX buffer
[ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer

Could they be related?

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-31 Thread David Madore
On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
> David Madore  :
> [...]
> > I imagine it being somehow related to the fact that it operates a
> > network bridge (I imagine this because I have another identical
> > machine with exactly the same kernel and a very similar config but not
> > running a bridge, and the warning never pops up).
> 
> Could it not be a genuine allocation failure ?

I have no idea.  How can I tell?  In any case, if having 512MB RAM
isn't enough for the kernel in the router of a small home's network,
that's a bug somewhere, isn't it?

Also:

On Wed, Aug 29, 2012 at 02:25:48AM +0200, David Madore wrote:
> Is this worth investigating?  (I will, of course, provide the config
> file and any other relevant data if the answer is "yes".)  Is this
> potentially serious?  (I'm getting hard lockups on this machine which
> I suspect are due to hardware and unrelated to this, but if someone
> tells me it could be the cause, I'd be more than happy to believe it.)

I'm now inclined to believe the hard lockups are indeed related to
this (I can semi-reproducibly make them happen with only network
traffic - actually, with the messages of a compilation taking place on
another machine being routed through this box (over IPv6)).

So how can I help debug this?  (One difficulty is that I have only
remote access to this box, and it's not meant for experimenting with.)

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-31 Thread David Madore
On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
 David Madore david...@madore.org :
 [...]
  I imagine it being somehow related to the fact that it operates a
  network bridge (I imagine this because I have another identical
  machine with exactly the same kernel and a very similar config but not
  running a bridge, and the warning never pops up).
 
 Could it not be a genuine allocation failure ?

I have no idea.  How can I tell?  In any case, if having 512MB RAM
isn't enough for the kernel in the router of a small home's network,
that's a bug somewhere, isn't it?

Also:

On Wed, Aug 29, 2012 at 02:25:48AM +0200, David Madore wrote:
 Is this worth investigating?  (I will, of course, provide the config
 file and any other relevant data if the answer is yes.)  Is this
 potentially serious?  (I'm getting hard lockups on this machine which
 I suspect are due to hardware and unrelated to this, but if someone
 tells me it could be the cause, I'd be more than happy to believe it.)

I'm now inclined to believe the hard lockups are indeed related to
this (I can semi-reproducibly make them happen with only network
traffic - actually, with the messages of a compilation taking place on
another machine being routed through this box (over IPv6)).

So how can I help debug this?  (One difficulty is that I have only
remote access to this box, and it's not meant for experimenting with.)

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-31 Thread David Madore
On Fri, Aug 31, 2012 at 12:59:36PM +0200, David Madore wrote:
 On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote:
  David Madore david...@madore.org :
  [...]
   I imagine it being somehow related to the fact that it operates a
   network bridge (I imagine this because I have another identical
   machine with exactly the same kernel and a very similar config but not
   running a bridge, and the warning never pops up).
  
  Could it not be a genuine allocation failure ?
 
 I have no idea.  How can I tell?  In any case, if having 512MB RAM
 isn't enough for the kernel in the router of a small home's network,
 that's a bug somewhere, isn't it?

PS: I'm also getting the following kind of messages from a wlan
interface that's on the bridge:

[  268.976317] ieee80211 phy0: failed to reallocate TX buffer
[  716.880515] ieee80211 phy0: failed to reallocate TX buffer
[ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer

Could they be related?

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-28 Thread David Madore
Dear all,

I hope this is the right place to send this sort of backtrace dump.

I'm getting the following sort of dumps (below) on a 3.2.27 kernel on
an arm/kirkwood (actually DreamPlug) machine that's used as a router.

I imagine it being somehow related to the fact that it operates a
network bridge (I imagine this because I have another identical
machine with exactly the same kernel and a very similar config but not
running a bridge, and the warning never pops up).

Is this worth investigating?  (I will, of course, provide the config
file and any other relevant data if the answer is "yes".)  Is this
potentially serious?  (I'm getting hard lockups on this machine which
I suspect are due to hardware and unrelated to this, but if someone
tells me it could be the cause, I'd be more than happy to believe it.)

[24711.204492] [ cut here ]
[24711.209151] WARNING: at mm/page_alloc.c:2109 
__alloc_pages_nodemask+0x1d4/0x68c()
[24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common 
ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 
sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT 
ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT 
xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS 
xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc 
snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi 
btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore
[24711.280663] [] (unwind_backtrace+0x0/0xf0) from [] 
(warn_slowpath_common+0x50/0x68)
[24711.290124] [] (warn_slowpath_common+0x50/0x68) from [] 
(warn_slowpath_null+0x1c/0x24)
[24711.299845] [] (warn_slowpath_null+0x1c/0x24) from [] 
(__alloc_pages_nodemask+0x1d4/0x68c)
[24711.309914] [] (__alloc_pages_nodemask+0x1d4/0x68c) from 
[] (__get_free_pages+0x10/0x3c)
[24711.319805] [] (__get_free_pages+0x10/0x3c) from [] 
(kmalloc_order_trace+0x24/0xdc)
[24711.329269] [] (kmalloc_order_trace+0x24/0xdc) from [] 
(pskb_expand_head+0x68/0x298)
[24711.338901] [] (pskb_expand_head+0x68/0x298) from [] 
(ip6_forward+0x4d4/0x7bc [ipv6])
[24711.348638] [] (ip6_forward+0x4d4/0x7bc [ipv6]) from [] 
(ipv6_rcv+0x2bc/0x3dc [ipv6])
[24711.358333] [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [] 
(__netif_receive_skb+0x544/0x66c)
[24711.368106] [] (__netif_receive_skb+0x544/0x66c) from [] 
(br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[24711.379899] [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 
[bridge]) from [] (br_nf_pre_routing+0x59c/0x67c [bridge])
[24711.392271] [] (br_nf_pre_routing+0x59c/0x67c [bridge]) from 
[] (nf_iterate+0x8c/0xb4)
[24711.401988] [] (nf_iterate+0x8c/0xb4) from [] 
(nf_hook_slow+0x5c/0x118)
[24711.410540] [] (nf_hook_slow+0x5c/0x118) from [] 
(br_handle_frame+0x1b8/0x290 [bridge])
[24711.420367] [] (br_handle_frame+0x1b8/0x290 [bridge]) from 
[] (__netif_receive_skb+0x3cc/0x66c)
[24711.430872] [] (__netif_receive_skb+0x3cc/0x66c) from [] 
(mv643xx_eth_poll+0x540/0x734)
[24711.440680] [] (mv643xx_eth_poll+0x540/0x734) from [] 
(net_rx_action+0x118/0x314)
[24711.449970] [] (net_rx_action+0x118/0x314) from [] 
(__do_softirq+0xac/0x234)
[24711.458817] [] (__do_softirq+0xac/0x234) from [] 
(irq_exit+0x94/0x9c)
[24711.467046] [] (irq_exit+0x94/0x9c) from [] 
(handle_IRQ+0x34/0x84)
[24711.475007] [] (handle_IRQ+0x34/0x84) from [] 
(__irq_svc+0x34/0x98)
[24711.483068] [] (__irq_svc+0x34/0x98) from [] 
(kirkwood_enter_idle+0x4c/0x94)
[24711.491908] [] (kirkwood_enter_idle+0x4c/0x94) from [] 
(cpuidle_idle_call+0xc8/0x35c)
[24711.501532] [] (cpuidle_idle_call+0xc8/0x35c) from [] 
(cpu_idle+0x88/0xdc)
[24711.510201] [] (cpu_idle+0x88/0xdc) from [] 
(start_kernel+0x2a0/0x2f0)
[24711.518512] ---[ end trace e1776fbe32468909 ]---

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()

2012-08-28 Thread David Madore
Dear all,

I hope this is the right place to send this sort of backtrace dump.

I'm getting the following sort of dumps (below) on a 3.2.27 kernel on
an arm/kirkwood (actually DreamPlug) machine that's used as a router.

I imagine it being somehow related to the fact that it operates a
network bridge (I imagine this because I have another identical
machine with exactly the same kernel and a very similar config but not
running a bridge, and the warning never pops up).

Is this worth investigating?  (I will, of course, provide the config
file and any other relevant data if the answer is yes.)  Is this
potentially serious?  (I'm getting hard lockups on this machine which
I suspect are due to hardware and unrelated to this, but if someone
tells me it could be the cause, I'd be more than happy to believe it.)

[24711.204492] [ cut here ]
[24711.209151] WARNING: at mm/page_alloc.c:2109 
__alloc_pages_nodemask+0x1d4/0x68c()
[24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common 
ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 
sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT 
ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT 
xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS 
xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc 
snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi 
btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore
[24711.280663] [c000d728] (unwind_backtrace+0x0/0xf0) from [c0022f74] 
(warn_slowpath_common+0x50/0x68)
[24711.290124] [c0022f74] (warn_slowpath_common+0x50/0x68) from [c0022fa8] 
(warn_slowpath_null+0x1c/0x24)
[24711.299845] [c0022fa8] (warn_slowpath_null+0x1c/0x24) from [c009caec] 
(__alloc_pages_nodemask+0x1d4/0x68c)
[24711.309914] [c009caec] (__alloc_pages_nodemask+0x1d4/0x68c) from 
[c009cfb4] (__get_free_pages+0x10/0x3c)
[24711.319805] [c009cfb4] (__get_free_pages+0x10/0x3c) from [c00c9fd0] 
(kmalloc_order_trace+0x24/0xdc)
[24711.329269] [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) from [c038d638] 
(pskb_expand_head+0x68/0x298)
[24711.338901] [c038d638] (pskb_expand_head+0x68/0x298) from [bf0dd3ec] 
(ip6_forward+0x4d4/0x7bc [ipv6])
[24711.348638] [bf0dd3ec] (ip6_forward+0x4d4/0x7bc [ipv6]) from [bf0dfebc] 
(ipv6_rcv+0x2bc/0x3dc [ipv6])
[24711.358333] [bf0dfebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [c0394870] 
(__netif_receive_skb+0x544/0x66c)
[24711.368106] [c0394870] (__netif_receive_skb+0x544/0x66c) from [bf1cd054] 
(br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge])
[24711.379899] [bf1cd054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 
[bridge]) from [bf1cdae8] (br_nf_pre_routing+0x59c/0x67c [bridge])
[24711.392271] [bf1cdae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) from 
[c03bd2a4] (nf_iterate+0x8c/0xb4)
[24711.401988] [c03bd2a4] (nf_iterate+0x8c/0xb4) from [c03bd328] 
(nf_hook_slow+0x5c/0x118)
[24711.410540] [c03bd328] (nf_hook_slow+0x5c/0x118) from [bf1c7fa4] 
(br_handle_frame+0x1b8/0x290 [bridge])
[24711.420367] [bf1c7fa4] (br_handle_frame+0x1b8/0x290 [bridge]) from 
[c03946f8] (__netif_receive_skb+0x3cc/0x66c)
[24711.430872] [c03946f8] (__netif_receive_skb+0x3cc/0x66c) from [c031e254] 
(mv643xx_eth_poll+0x540/0x734)
[24711.440680] [c031e254] (mv643xx_eth_poll+0x540/0x734) from [c0397390] 
(net_rx_action+0x118/0x314)
[24711.449970] [c0397390] (net_rx_action+0x118/0x314) from [c0029924] 
(__do_softirq+0xac/0x234)
[24711.458817] [c0029924] (__do_softirq+0xac/0x234) from [c0029f00] 
(irq_exit+0x94/0x9c)
[24711.467046] [c0029f00] (irq_exit+0x94/0x9c) from [c00094b0] 
(handle_IRQ+0x34/0x84)
[24711.475007] [c00094b0] (handle_IRQ+0x34/0x84) from [c04398d4] 
(__irq_svc+0x34/0x98)
[24711.483068] [c04398d4] (__irq_svc+0x34/0x98) from [c0011d6c] 
(kirkwood_enter_idle+0x4c/0x94)
[24711.491908] [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) from [c0357a00] 
(cpuidle_idle_call+0xc8/0x35c)
[24711.501532] [c0357a00] (cpuidle_idle_call+0xc8/0x35c) from [c0009764] 
(cpu_idle+0x88/0xdc)
[24711.510201] [c0009764] (cpu_idle+0x88/0xdc) from [c05d8720] 
(start_kernel+0x2a0/0x2f0)
[24711.518512] ---[ end trace e1776fbe32468909 ]---

-- 
 David A. Madore
   ( http://www.madore.org/~david/ )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


setting the init process's personality?

2007-11-25 Thread David Madore
Hi,

Is there a simple way (via a kernel boot option or config setting or -
if really necessary - a patch or something like that) to set the
personality for the init process?  I'm running an x86_64 kernel on a
system whose userland is almost entirely 32-bits (but needs an
occasional 64-bit process to be run, hence the choice of kernel), and
I'd like `uname -m` to be i686 unless I take special action.  So I
think that means letting init (which is indeed a 32-bit process) have
the PER_LINUX32 personality (in case I'm wrong about this, the output
of uname -m is essentially what matters to me).

So, where does the default come from?

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


setting the init process's personality?

2007-11-25 Thread David Madore
Hi,

Is there a simple way (via a kernel boot option or config setting or -
if really necessary - a patch or something like that) to set the
personality for the init process?  I'm running an x86_64 kernel on a
system whose userland is almost entirely 32-bits (but needs an
occasional 64-bit process to be run, hence the choice of kernel), and
I'd like `uname -m` to be i686 unless I take special action.  So I
think that means letting init (which is indeed a 32-bit process) have
the PER_LINUX32 personality (in case I'm wrong about this, the output
of uname -m is essentially what matters to me).

So, where does the default come from?

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


any help on RTC-related config? (and "rtc_cmos: probe of 00:03 failed with error -16" error message)

2007-11-09 Thread David Madore
Hi all,

I'm extremely confused as to what all the RTC-related config variables
in the kernel mean and what I'm supposed to do with them, and I wonder
if someone can help me or point me to some doc beside rtc.txt (which
I've read, of course).

I understand (from reading Documentation/rtc.txt) that there are two
different RTC driver systems for Linux: an "old" one, supporting only
one PC-AT-compatible RTC source, which drives /dev/rtc, and a "new"
one, supporting different sources, which drives /dev/rtc[0123...].
But I'm not sure which configuration variables enable which, whether
they should be enabled together or whether I should choose between the
twain, and what I should be using on my system anyway.

I sort of gathered (I hope not too incorrectly) that the "genrtc"
module is brought by the CONFIG_GEN_RTC configuration choice and that
it contains the "old" driver, whereas the "new" driver is split
between modules such as "rtc", "rtc_lib", "rtc_core" and actual
drivers like "rtc_cmos" - right? - and configured by such switches as
CONFIG_RTC_CLASS, CONFIG_RTC_LIB and CONFIG_RTC_DRV_CMOS.  There might
also be a CONFIG_RTC variable, about which I'm not sure.

I'm also very confused about how HPET's tie into this, and what
CONFIG_HPET_EMULATE_RTC does, for example.

Now how do I know what's on my system?  (It's an ASUS P5W64 WS Pro
based x86_64.)  I certainly have some kind of CMOS clock that I can
configure in my BIOS, but I don't know about HPET's or other kind of
RTC sources.

I tried using the following config (this is all with 2.6.22.10):

CONFIG_RTC=m
CONFIG_GEN_RTC=m
CONFIG_GEN_RTC_X=y
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
CONFIG_RTC_DRV_TEST=m
CONFIG_RTC_DRV_CMOS=m
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_V3020 is not set

Now if I load the genrtc module (to use the "old" driver?), I get a
/dev/rtc which may or may not be satisfactory but the
dev.rtc.max-user-freq sysctl does not exist and ALSA does not use
snd_rtctimer.  If I try unloading genrtc and instead loading the rtc,
rtc_lib, rtc_core and rtc_cmos modules (to use the "new" driver?), I
get the following error in dmesg:

Real Time Clock Driver v1.12ac
rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
rtc_cmos: probe of 00:03 failed with error -16

After what attempts to, e.g., play a MIDI file with ALSA, fail (only a
single note is played) and the following error occurs in dmesg:

rtc: lost some interrupts at 1024Hz.

So, why does rtc_cmos fail that way?  And how am I supposed to
configure RTC as a whole?  (I will, of course, gladly provide more
information if requested.)

Thanks for any help!

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


any help on RTC-related config? (and rtc_cmos: probe of 00:03 failed with error -16 error message)

2007-11-09 Thread David Madore
Hi all,

I'm extremely confused as to what all the RTC-related config variables
in the kernel mean and what I'm supposed to do with them, and I wonder
if someone can help me or point me to some doc beside rtc.txt (which
I've read, of course).

I understand (from reading Documentation/rtc.txt) that there are two
different RTC driver systems for Linux: an old one, supporting only
one PC-AT-compatible RTC source, which drives /dev/rtc, and a new
one, supporting different sources, which drives /dev/rtc[0123...].
But I'm not sure which configuration variables enable which, whether
they should be enabled together or whether I should choose between the
twain, and what I should be using on my system anyway.

I sort of gathered (I hope not too incorrectly) that the genrtc
module is brought by the CONFIG_GEN_RTC configuration choice and that
it contains the old driver, whereas the new driver is split
between modules such as rtc, rtc_lib, rtc_core and actual
drivers like rtc_cmos - right? - and configured by such switches as
CONFIG_RTC_CLASS, CONFIG_RTC_LIB and CONFIG_RTC_DRV_CMOS.  There might
also be a CONFIG_RTC variable, about which I'm not sure.

I'm also very confused about how HPET's tie into this, and what
CONFIG_HPET_EMULATE_RTC does, for example.

Now how do I know what's on my system?  (It's an ASUS P5W64 WS Pro
based x86_64.)  I certainly have some kind of CMOS clock that I can
configure in my BIOS, but I don't know about HPET's or other kind of
RTC sources.

I tried using the following config (this is all with 2.6.22.10):

CONFIG_RTC=m
CONFIG_GEN_RTC=m
CONFIG_GEN_RTC_X=y
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
CONFIG_RTC_DRV_TEST=m
CONFIG_RTC_DRV_CMOS=m
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_V3020 is not set

Now if I load the genrtc module (to use the old driver?), I get a
/dev/rtc which may or may not be satisfactory but the
dev.rtc.max-user-freq sysctl does not exist and ALSA does not use
snd_rtctimer.  If I try unloading genrtc and instead loading the rtc,
rtc_lib, rtc_core and rtc_cmos modules (to use the new driver?), I
get the following error in dmesg:

Real Time Clock Driver v1.12ac
rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0
rtc_cmos: probe of 00:03 failed with error -16

After what attempts to, e.g., play a MIDI file with ALSA, fail (only a
single note is played) and the following error occurs in dmesg:

rtc: lost some interrupts at 1024Hz.

So, why does rtc_cmos fail that way?  And how am I supposed to
configure RTC as a whole?  (I will, of course, gladly provide more
information if requested.)

Thanks for any help!

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch/option to wipe memory at boot?

2007-09-17 Thread David Madore
On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote:
> Boot memtest86 for a little while before booting the kernel?  And if you
> haven't already run it for a while, then that would be your first step
> anyway.

Indeed, that does the trick, thanks for the suggestion.  So I can be
quite confident, now, that my RAM is sane and it's just that the BIOS
doesn't initialize it properly.

But I'd still like some way of filling the RAM when Linux starts (or
perhaps in the bootloader), because letting memtest86 run after every
cold reboot isn't a very satisfactory solution.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


patch/option to wipe memory at boot?

2007-09-17 Thread David Madore
Hi,

Is there a patch or a boot option or something which wipes all
available (physical) RAM at boot (or better, fills it with a fixed
signature like 0xdeadbeef)?  I'm getting phony ECC errors and I'd like
to test whether they go away when the RAM is properly initialized.
Also, I'd like to know exactly which parts of RAM are being used and
which are untouched since boot (hence the 0xdeadbeef signature).

If this patch/option doesn't exist, can anyone give me a hint as to
where and how it would be best to add this?  (I'm afraid I'm very
ignorant as to how Linux sets up its RAM mapping.)  I'm concerned
about x86 and x86_64.

PS: I'm not finicky: it's all right if a couple of megabytes at the
bottom of RAM are not scrubbed (I'm more interested about the top
gigabyte-or-so), especially if they're guaranteed to be used by the
kernel.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


patch/option to wipe memory at boot?

2007-09-17 Thread David Madore
Hi,

Is there a patch or a boot option or something which wipes all
available (physical) RAM at boot (or better, fills it with a fixed
signature like 0xdeadbeef)?  I'm getting phony ECC errors and I'd like
to test whether they go away when the RAM is properly initialized.
Also, I'd like to know exactly which parts of RAM are being used and
which are untouched since boot (hence the 0xdeadbeef signature).

If this patch/option doesn't exist, can anyone give me a hint as to
where and how it would be best to add this?  (I'm afraid I'm very
ignorant as to how Linux sets up its RAM mapping.)  I'm concerned
about x86 and x86_64.

PS: I'm not finicky: it's all right if a couple of megabytes at the
bottom of RAM are not scrubbed (I'm more interested about the top
gigabyte-or-so), especially if they're guaranteed to be used by the
kernel.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: patch/option to wipe memory at boot?

2007-09-17 Thread David Madore
On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote:
 Boot memtest86 for a little while before booting the kernel?  And if you
 haven't already run it for a while, then that would be your first step
 anyway.

Indeed, that does the trick, thanks for the suggestion.  So I can be
quite confident, now, that my RAM is sane and it's just that the BIOS
doesn't initialize it properly.

But I'd still like some way of filling the RAM when Linux starts (or
perhaps in the bootloader), because letting memtest86 run after every
cold reboot isn't a very satisfactory solution.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


setsystz utility: set the kernel's system time zone

2007-02-19 Thread David Madore
// Hi.  I felt the need to write the following little utility (which
// is mostly comments, really), to prevent my digital camera's image
// files to have incorrect modification when I mount them under Linux.
// Comments are welcome.

// Enjoy!

/// cut after ///
/* setsystz: set the Linux kernel's idea of the time zone */
/* David A. Madore <[EMAIL PROTECTED]>, 2007-02-19.  Public Domain */

/* Rationale: the Linux kernel needs to have some idea of time zone,
   notably because some filesystems (e.g. FAT) store file
   modification/access times in local time rather than UTC(=GMT)
   (which Unix uses internally for all timestamps).  This kernel
   (system) time zone is set through the settimeofday() system call;
   unfortunately, there does not seem to be a practical way to do it,
   and some (all?) Linux distributions get it wrong: e.g., simply
   because my CMOS clock is set to GMT (as recommended), my Debian
   init scripts apparently assume that any FAT filesystems I'll be
   mounting will have GMT timestamps (uh?).  Note: IMHO, the whole
   idea of having a per-system global time zone is probably wrong, and
   FAT mounts should probably better use an adhoc option to specify
   GMT offset (defaulting to the libc time zone for the mount
   process), and CMOS clock thingies should be kept separate. */

/* What this does: called without arguments, setsystz sets the
   kernel's time zone to the userland's time zone (typically from the
   /etc/localtime file, overridden by the TZ environment variable if
   it exists).  With an explicit argument, setsystz sets the kernel's
   time zone to that many minutes west of GMT (see settimeofday(2) man
   page for explanations).  This program takes care _not_ to
   change/warp the system clock while changing the time zone: see
   comments on avoid_linux_braindeadness() below. */

/* How to use: probably just call "setsystz" (as root) before mounting
   a FAT filesystem, if the files it contains are in your usual system
   time zone.  If they are, e.g., from the Shanghai time zone, then
   use "TZ=Asia/Shanghai setsystz" before mounting.  Note: it's
   probably wiser not to do this while there are existing mounted FAT
   filesystems. */

#include 
#include 
#include 
#include 
#include 

int
auto_minutes (void)
 /* Determine localtime GMT offset and return it in minutes west
of GMT (as expected by a struct timezone).  This will
typically use the TZ environment variable if it is defined or,
as a fallback, the contents of /etc/localtime (see libc
documentation for more details). */
{
  time_t now = time(NULL);
  struct tm *lt = localtime ();
  long int gmtoff = lt->tm_gmtoff;
  fprintf (stderr, "GMT offset=%lds\n", gmtoff);
  if ( gmtoff%60 )
fprintf (stderr, "warning: GMT offset %lds "
 "is not an integer number of minutes\n", gmtoff);
  gmtoff /= 60;
  return -gmtoff;
}

void
avoid_linux_braindeadness (void)
 /* We ___DO NOT___ want to change the system time, only the
system time zone!  Since Linux does something special
(warp_clock() semantics) the very first time settimeofday() is
called with tz!=NULL, we call it once with tz pointing to a
GMT-filled structure, i.e., tz->tz_minuteswest==0 (so the
clock won't be warped).  The settimeofday(2) man page claims
that tz->tz_minuteswest==0 will not count toward cancelling
the warp_clock() semantics, i.e., that our trick does not
work: fortunately, it is wrong (at least under 2.6.19 and
whereabouts) and our trick works.  Note however that this
still resets the time interpolator the first time:
unfortunately there does not seem to be a way around this
problem.  See /usr/src/linux/kernel/time.c for details
about the whole mess.  -- David A. Madore 2007-02-19 */
{
  struct timezone tz;
  memset (, 0, sizeof(struct timezone));
  tz.tz_minuteswest = 0;
  tz.tz_dsttime = 0;
  settimeofday (NULL, );
}

int
main (int argc, char *argv[])
{
  int minuteswest;
  if ( argc == 1 )
minuteswest = auto_minutes();
  else if ( argc == 2 )
{
  if ( sscanf (argv[1], "%d", ) != 1 )
{
  fprintf (stderr, "invalid argument: %s\n", argv[1]);
  exit (2);
}
}
  else
{
  fprintf (stderr, "wrong number or arguments\n");
  exit (2);
}
  struct timezone tz;
  memset (, 0, sizeof(struct timezone));
  tz.tz_minuteswest = minuteswest;
  tz.tz_dsttime = 0;
  fprintf (stderr, "setting system time zone to tz_minuteswest=%d\n",
   minuteswest);
#if 1
  avoid_linux_braindeadness ();
  if ( settimeofday (NULL, ) == -1 )
{
  perror ("settimeofday()");
  exit (EXIT_FAILURE);
}
#endif
  return 0;
}
/// cut before ///
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at 

setsystz utility: set the kernel's system time zone

2007-02-19 Thread David Madore
// Hi.  I felt the need to write the following little utility (which
// is mostly comments, really), to prevent my digital camera's image
// files to have incorrect modification when I mount them under Linux.
// Comments are welcome.

// Enjoy!

/// cut after ///
/* setsystz: set the Linux kernel's idea of the time zone */
/* David A. Madore [EMAIL PROTECTED], 2007-02-19.  Public Domain */

/* Rationale: the Linux kernel needs to have some idea of time zone,
   notably because some filesystems (e.g. FAT) store file
   modification/access times in local time rather than UTC(=GMT)
   (which Unix uses internally for all timestamps).  This kernel
   (system) time zone is set through the settimeofday() system call;
   unfortunately, there does not seem to be a practical way to do it,
   and some (all?) Linux distributions get it wrong: e.g., simply
   because my CMOS clock is set to GMT (as recommended), my Debian
   init scripts apparently assume that any FAT filesystems I'll be
   mounting will have GMT timestamps (uh?).  Note: IMHO, the whole
   idea of having a per-system global time zone is probably wrong, and
   FAT mounts should probably better use an adhoc option to specify
   GMT offset (defaulting to the libc time zone for the mount
   process), and CMOS clock thingies should be kept separate. */

/* What this does: called without arguments, setsystz sets the
   kernel's time zone to the userland's time zone (typically from the
   /etc/localtime file, overridden by the TZ environment variable if
   it exists).  With an explicit argument, setsystz sets the kernel's
   time zone to that many minutes west of GMT (see settimeofday(2) man
   page for explanations).  This program takes care _not_ to
   change/warp the system clock while changing the time zone: see
   comments on avoid_linux_braindeadness() below. */

/* How to use: probably just call setsystz (as root) before mounting
   a FAT filesystem, if the files it contains are in your usual system
   time zone.  If they are, e.g., from the Shanghai time zone, then
   use TZ=Asia/Shanghai setsystz before mounting.  Note: it's
   probably wiser not to do this while there are existing mounted FAT
   filesystems. */

#include stdio.h
#include stdlib.h
#include string.h
#include time.h
#include sys/time.h

int
auto_minutes (void)
 /* Determine localtime GMT offset and return it in minutes west
of GMT (as expected by a struct timezone).  This will
typically use the TZ environment variable if it is defined or,
as a fallback, the contents of /etc/localtime (see libc
documentation for more details). */
{
  time_t now = time(NULL);
  struct tm *lt = localtime (now);
  long int gmtoff = lt-tm_gmtoff;
  fprintf (stderr, GMT offset=%lds\n, gmtoff);
  if ( gmtoff%60 )
fprintf (stderr, warning: GMT offset %lds 
 is not an integer number of minutes\n, gmtoff);
  gmtoff /= 60;
  return -gmtoff;
}

void
avoid_linux_braindeadness (void)
 /* We ___DO NOT___ want to change the system time, only the
system time zone!  Since Linux does something special
(warp_clock() semantics) the very first time settimeofday() is
called with tz!=NULL, we call it once with tz pointing to a
GMT-filled structure, i.e., tz-tz_minuteswest==0 (so the
clock won't be warped).  The settimeofday(2) man page claims
that tz-tz_minuteswest==0 will not count toward cancelling
the warp_clock() semantics, i.e., that our trick does not
work: fortunately, it is wrong (at least under 2.6.19 and
whereabouts) and our trick works.  Note however that this
still resets the time interpolator the first time:
unfortunately there does not seem to be a way around this
problem.  See /usr/src/linux/kernel/time.c for details
about the whole mess.  -- David A. Madore 2007-02-19 */
{
  struct timezone tz;
  memset (tz, 0, sizeof(struct timezone));
  tz.tz_minuteswest = 0;
  tz.tz_dsttime = 0;
  settimeofday (NULL, tz);
}

int
main (int argc, char *argv[])
{
  int minuteswest;
  if ( argc == 1 )
minuteswest = auto_minutes();
  else if ( argc == 2 )
{
  if ( sscanf (argv[1], %d, minuteswest) != 1 )
{
  fprintf (stderr, invalid argument: %s\n, argv[1]);
  exit (2);
}
}
  else
{
  fprintf (stderr, wrong number or arguments\n);
  exit (2);
}
  struct timezone tz;
  memset (tz, 0, sizeof(struct timezone));
  tz.tz_minuteswest = minuteswest;
  tz.tz_dsttime = 0;
  fprintf (stderr, setting system time zone to tz_minuteswest=%d\n,
   minuteswest);
#if 1
  avoid_linux_braindeadness ();
  if ( settimeofday (NULL, tz) == -1 )
{
  perror (settimeofday());
  exit (EXIT_FAILURE);
}
#endif
  return 0;
}
/// cut before ///
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [patch] netfilter: implement TCPMSS target for IPv6

2007-01-14 Thread David Madore
On Sun, Jan 14, 2007 at 09:10:45PM +0100, Jan Engelhardt wrote:
> On Jan 14 2007 20:20, David Madore wrote:
> >Implement TCPMSS target for IPv6 by shamelessly copying from
> >Marc Boucher's IPv4 implementation.
> 
> Would not it be worthwhile to merge ipt_TCPMSS and
> ip6t_TCPMSS to xt_TCPMSS instead?

It may be, but I'm afraid that's outside my competence.  I happened to
need ip6t_TCPMSS badly and soon, so I went for the quickest solution.
Of course, I'd appreciate it if someone were to do it in a better way.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] netfilter: implement TCPMSS target for IPv6

2007-01-14 Thread David Madore
Implement TCPMSS target for IPv6 by shamelessly copying from
Marc Boucher's IPv4 implementation.

Signed-off-by: David A. Madore <[EMAIL PROTECTED]>

---

 Note: The patch for ip6tables to make use of this module can be
 obtained from ftp://quatramaran.ens.fr/pub/madore/misc/ip6t-TCPMSS/
 > (also contains a version of this same patch for 2.6.19.2).

 include/linux/netfilter_ipv6/ip6t_TCPMSS.h |   10 ++
 net/ipv6/netfilter/Kconfig |   26 
 net/ipv6/netfilter/Makefile|1 +
 net/ipv6/netfilter/ip6t_TCPMSS.c   |  225 
 4 files changed, 262 insertions(+), 0 deletions(-)

diff --git a/include/linux/netfilter_ipv6/ip6t_TCPMSS.h 
b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h
new file mode 100644
index 000..412d1cb
--- /dev/null
+++ b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h
@@ -0,0 +1,10 @@
+#ifndef _IP6T_TCPMSS_H
+#define _IP6T_TCPMSS_H
+
+struct ip6t_tcpmss_info {
+   u_int16_t mss;
+};
+
+#define IP6T_TCPMSS_CLAMP_PMTU 0x
+
+#endif /*_IP6T_TCPMSS_H*/
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index adcd613..3890a59 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -154,6 +154,32 @@ config IP6_NF_TARGET_REJECT
 
  To compile it as a module, choose M here.  If unsure, say N.
 
+config IP6_NF_TARGET_TCPMSS
+   tristate "TCPMSS target support"
+   depends on IP6_NF_IPTABLES
+   ---help---
+ This option adds a `TCPMSS' target, which allows you to alter the
+ MSS value of TCP SYN packets, to control the maximum size for that
+ connection (usually limiting it to your outgoing interface's MTU
+ minus 60).
+
+ This is used to overcome criminally braindead ISPs or servers which
+ block ICMPv6 Packet Too Big packets.  The symptoms of this
+ problem are that everything works fine from your Linux
+ firewall/router, but machines behind it can never exchange large
+ packets:
+   1) Web browsers connect, then hang with no data received.
+   2) Small mail works fine, but large emails hang.
+   3) ssh works fine, but scp hangs after initial handshaking.
+
+ Workaround: activate this option and add a rule to your firewall
+ configuration like:
+
+ ip6tables -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
+-j TCPMSS --clamp-mss-to-pmtu
+
+ To compile it as a module, choose M here.  If unsure, say N.
+
 config IP6_NF_MANGLE
tristate "Packet mangling"
depends on IP6_NF_IPTABLES
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index ac1dfeb..616a006 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_IP6_NF_TARGET_LOG) += ip6t_LOG.o
 obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o
 obj-$(CONFIG_IP6_NF_MATCH_HL) += ip6t_hl.o
 obj-$(CONFIG_IP6_NF_TARGET_REJECT) += ip6t_REJECT.o
+obj-$(CONFIG_IP6_NF_TARGET_TCPMSS) += ip6t_TCPMSS.o
 
 # objects for l3 independent conntrack
 nf_conntrack_ipv6-objs  :=  nf_conntrack_l3proto_ipv6.o 
nf_conntrack_proto_icmpv6.o nf_conntrack_reasm.o
diff --git a/net/ipv6/netfilter/ip6t_TCPMSS.c b/net/ipv6/netfilter/ip6t_TCPMSS.c
new file mode 100644
index 000..ab492c3
--- /dev/null
+++ b/net/ipv6/netfilter/ip6t_TCPMSS.c
@@ -0,0 +1,225 @@
+/*
+ * This is a module which is used for setting the MSS option in TCP packets.
+ *
+ * Copyright (C) 2007 David Madore <[EMAIL PROTECTED]>
+ *
+ * Shamelessly based on net/ipv4/netfilter/ipt_TCPMSS.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("David Madore <[EMAIL PROTECTED]>");
+MODULE_DESCRIPTION("ip6tables TCP MSS modification module");
+
+static inline unsigned int
+optlen(const u_int8_t *opt, unsigned int offset)
+{
+   /* Beware zero-length options: make finite progress */
+   if (opt[offset] <= TCPOPT_NOP || opt[offset+1] == 0)
+   return 1;
+   else
+   return opt[offset+1];
+}
+
+static unsigned int
+ip6t_tcpmss_target(struct sk_buff **pskb,
+  const struct net_device *in,
+  const struct net_device *out,
+  unsigned int hooknum,
+  const struct xt_target *target,
+  const void *targinfo)
+{
+   const struct ip6t_tcpmss_info *tcpmssinfo = targinfo;
+   struct tcphdr *tcph;
+   struct ipv6hdr *ipv6h;
+   u_int8_t nexthdr;
+   int tcphoff;
+   u_int16_t tcplen, newmss;
+   __be16 newiplen, oldval;
+   unsigned int i;
+   u_int8_t *opt;
+
+   if (!skb_make_writable(psk

[patch] netfilter: implement TCPMSS target for IPv6

2007-01-14 Thread David Madore
Implement TCPMSS target for IPv6 by shamelessly copying from
Marc Boucher's IPv4 implementation.

Signed-off-by: David A. Madore [EMAIL PROTECTED]

---

 Note: The patch for ip6tables to make use of this module can be
 obtained from URL:
 ftp://quatramaran.ens.fr/pub/madore/misc/ip6t-TCPMSS/
  (also contains a version of this same patch for 2.6.19.2).

 include/linux/netfilter_ipv6/ip6t_TCPMSS.h |   10 ++
 net/ipv6/netfilter/Kconfig |   26 
 net/ipv6/netfilter/Makefile|1 +
 net/ipv6/netfilter/ip6t_TCPMSS.c   |  225 
 4 files changed, 262 insertions(+), 0 deletions(-)

diff --git a/include/linux/netfilter_ipv6/ip6t_TCPMSS.h 
b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h
new file mode 100644
index 000..412d1cb
--- /dev/null
+++ b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h
@@ -0,0 +1,10 @@
+#ifndef _IP6T_TCPMSS_H
+#define _IP6T_TCPMSS_H
+
+struct ip6t_tcpmss_info {
+   u_int16_t mss;
+};
+
+#define IP6T_TCPMSS_CLAMP_PMTU 0x
+
+#endif /*_IP6T_TCPMSS_H*/
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index adcd613..3890a59 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -154,6 +154,32 @@ config IP6_NF_TARGET_REJECT
 
  To compile it as a module, choose M here.  If unsure, say N.
 
+config IP6_NF_TARGET_TCPMSS
+   tristate TCPMSS target support
+   depends on IP6_NF_IPTABLES
+   ---help---
+ This option adds a `TCPMSS' target, which allows you to alter the
+ MSS value of TCP SYN packets, to control the maximum size for that
+ connection (usually limiting it to your outgoing interface's MTU
+ minus 60).
+
+ This is used to overcome criminally braindead ISPs or servers which
+ block ICMPv6 Packet Too Big packets.  The symptoms of this
+ problem are that everything works fine from your Linux
+ firewall/router, but machines behind it can never exchange large
+ packets:
+   1) Web browsers connect, then hang with no data received.
+   2) Small mail works fine, but large emails hang.
+   3) ssh works fine, but scp hangs after initial handshaking.
+
+ Workaround: activate this option and add a rule to your firewall
+ configuration like:
+
+ ip6tables -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
+-j TCPMSS --clamp-mss-to-pmtu
+
+ To compile it as a module, choose M here.  If unsure, say N.
+
 config IP6_NF_MANGLE
tristate Packet mangling
depends on IP6_NF_IPTABLES
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index ac1dfeb..616a006 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_IP6_NF_TARGET_LOG) += ip6t_LOG.o
 obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o
 obj-$(CONFIG_IP6_NF_MATCH_HL) += ip6t_hl.o
 obj-$(CONFIG_IP6_NF_TARGET_REJECT) += ip6t_REJECT.o
+obj-$(CONFIG_IP6_NF_TARGET_TCPMSS) += ip6t_TCPMSS.o
 
 # objects for l3 independent conntrack
 nf_conntrack_ipv6-objs  :=  nf_conntrack_l3proto_ipv6.o 
nf_conntrack_proto_icmpv6.o nf_conntrack_reasm.o
diff --git a/net/ipv6/netfilter/ip6t_TCPMSS.c b/net/ipv6/netfilter/ip6t_TCPMSS.c
new file mode 100644
index 000..ab492c3
--- /dev/null
+++ b/net/ipv6/netfilter/ip6t_TCPMSS.c
@@ -0,0 +1,225 @@
+/*
+ * This is a module which is used for setting the MSS option in TCP packets.
+ *
+ * Copyright (C) 2007 David Madore [EMAIL PROTECTED]
+ *
+ * Shamelessly based on net/ipv4/netfilter/ipt_TCPMSS.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include linux/module.h
+#include linux/skbuff.h
+
+#include net/ipv6.h
+#include net/tcp.h
+
+#include linux/netfilter_ipv6/ip6_tables.h
+#include linux/netfilter_ipv6/ip6t_TCPMSS.h
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(David Madore [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(ip6tables TCP MSS modification module);
+
+static inline unsigned int
+optlen(const u_int8_t *opt, unsigned int offset)
+{
+   /* Beware zero-length options: make finite progress */
+   if (opt[offset] = TCPOPT_NOP || opt[offset+1] == 0)
+   return 1;
+   else
+   return opt[offset+1];
+}
+
+static unsigned int
+ip6t_tcpmss_target(struct sk_buff **pskb,
+  const struct net_device *in,
+  const struct net_device *out,
+  unsigned int hooknum,
+  const struct xt_target *target,
+  const void *targinfo)
+{
+   const struct ip6t_tcpmss_info *tcpmssinfo = targinfo;
+   struct tcphdr *tcph;
+   struct ipv6hdr *ipv6h;
+   u_int8_t nexthdr;
+   int tcphoff;
+   u_int16_t tcplen, newmss;
+   __be16 newiplen, oldval;
+   unsigned int i;
+   u_int8_t *opt

Re: [patch] netfilter: implement TCPMSS target for IPv6

2007-01-14 Thread David Madore
On Sun, Jan 14, 2007 at 09:10:45PM +0100, Jan Engelhardt wrote:
 On Jan 14 2007 20:20, David Madore wrote:
 Implement TCPMSS target for IPv6 by shamelessly copying from
 Marc Boucher's IPv4 implementation.
 
 Would not it be worthwhile to merge ipt_TCPMSS and
 ip6t_TCPMSS to xt_TCPMSS instead?

It may be, but I'm afraid that's outside my competence.  I happened to
need ip6t_TCPMSS badly and soon, so I went for the quickest solution.
Of course, I'd appreciate it if someone were to do it in a better way.

Happy hacking,

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] Support UTF-8 scripts

2005-08-16 Thread David Madore
On Sun, Aug 14, 2005 at 08:00:31PM +, Lee Revell wrote:
>  We write code in ASCII, dammit.

http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 >

:-)

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] Support UTF-8 scripts

2005-08-16 Thread David Madore
On Sun, Aug 14, 2005 at 08:00:31PM +, Lee Revell wrote:
  We write code in ASCII, dammit.

URL: http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 

:-)

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[slightly OT] what's in RAM at 0x3ffe5000 ?

2005-08-12 Thread David Madore
Hi.

I have ECC RAM on my system and I wanted to check it, so (because
there doesn't seem to be any Linux ECC support for my P5WD2
motherboard) I wrote my own kernel module[#] to interrogate the
northbridge.  I was a little annoyed to find that the northbridge had
reported an ECC error, and a multi-bit uncorrectable error at that!,
at memory location 0x3ffe5000.  I cleared the error flag and ran
multiple checks and couldn't find any other error, so I stared
thinking about this address I realized that it was very near the top
of memory (I have 1GB RAM).  In fact, it is reported as "reserved" by
Linux:

BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e4000 - 0010 (reserved)
 BIOS-e820: 0010 - 3ff8 (usable)
 BIOS-e820: 3ff8 - 3ff8e000 (ACPI data)
 BIOS-e820: 3ff8e000 - 3ffe (ACPI NVS)
 BIOS-e820: 3ffe - 4000 (reserved)
 BIOS-e820: ffb0 - 0001 (reserved)

Now /dev/mem won't work that far so I can't read what's there, but I
suspect there's something very strange in that place and the ECC error
reported by the northbridge is not really an error.  Interestingly
enough, I always get an error at 0x3ffe5000 when I boot, and then
later on I get an error at 0x3fff0580.  This is consistent: I always
get those "errors" at the same memory locations, and they're always
multiple-bit errors.

So here are my questions:

* What does "reserved" mean in the BIOS physical RAM table?  Reserved
by whom?  Who owns my memory?  Do all my base are belong to him?

* What's the simplest way, under Linux (whether in userspace or in
kernel), to read the contents of a _physical_ memory location, given
that /dev/mem won't do it:

vega david ~ $ sudo dd if=/dev/mem bs=4096 count=1 skip=262117 of=/tmp/page
dd: reading `/dev/mem': Bad address
0+0 records in
0+0 records out
0 bytes transferred in 0.000118 seconds (0 bytes/sec)

* Why am I getting ECC errors in this strange place, and only there?
Do I need to worry about them?  (I mean, if it's something strange
like memory-mapped I/O I would expect the northbridge to know about it
and not report an error!)

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#] Source available on demand - it's pretty damn ugly, I wouldn't
want Mr. Torvalds to see it!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[slightly OT] what's in RAM at 0x3ffe5000 ?

2005-08-12 Thread David Madore
Hi.

I have ECC RAM on my system and I wanted to check it, so (because
there doesn't seem to be any Linux ECC support for my P5WD2
motherboard) I wrote my own kernel module[#] to interrogate the
northbridge.  I was a little annoyed to find that the northbridge had
reported an ECC error, and a multi-bit uncorrectable error at that!,
at memory location 0x3ffe5000.  I cleared the error flag and ran
multiple checks and couldn't find any other error, so I stared
thinking about this address I realized that it was very near the top
of memory (I have 1GB RAM).  In fact, it is reported as reserved by
Linux:

BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e4000 - 0010 (reserved)
 BIOS-e820: 0010 - 3ff8 (usable)
 BIOS-e820: 3ff8 - 3ff8e000 (ACPI data)
 BIOS-e820: 3ff8e000 - 3ffe (ACPI NVS)
 BIOS-e820: 3ffe - 4000 (reserved)
 BIOS-e820: ffb0 - 0001 (reserved)

Now /dev/mem won't work that far so I can't read what's there, but I
suspect there's something very strange in that place and the ECC error
reported by the northbridge is not really an error.  Interestingly
enough, I always get an error at 0x3ffe5000 when I boot, and then
later on I get an error at 0x3fff0580.  This is consistent: I always
get those errors at the same memory locations, and they're always
multiple-bit errors.

So here are my questions:

* What does reserved mean in the BIOS physical RAM table?  Reserved
by whom?  Who owns my memory?  Do all my base are belong to him?

* What's the simplest way, under Linux (whether in userspace or in
kernel), to read the contents of a _physical_ memory location, given
that /dev/mem won't do it:

vega david ~ $ sudo dd if=/dev/mem bs=4096 count=1 skip=262117 of=/tmp/page
dd: reading `/dev/mem': Bad address
0+0 records in
0+0 records out
0 bytes transferred in 0.000118 seconds (0 bytes/sec)

* Why am I getting ECC errors in this strange place, and only there?
Do I need to worry about them?  (I mean, if it's something strange
like memory-mapped I/O I would expect the northbridge to know about it
and not report an error!)

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#] Source available on demand - it's pretty damn ugly, I wouldn't
want Mr. Torvalds to see it!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


how do I read CPU temperature in ACPI? (w/ P5WD2 motherboard)

2005-08-10 Thread David Madore
Hi.  I apologize for what is surely a stupid question: I understand
that ACPI should be able to tell me what my CPU's temperature is (I
have a sever overheating problem and I am trying to solve it by
underclocking somewhat, but I need to be able to read the temperature
to do anything worth while), but no matter what ACPI modules I load, I
can't find any hint of a CPU temperature reading anywhere below
/proc/acpi (the /proc/acpi/thermal_zone/ directory, for example,
remains empty).

That's with the "thermal", "processor" and "fan" modules loaded (and a
few others; full listing follows signature).  I tried to load the
asus_acpi module also, since I have an ASUS motherboard (a P5WD2
Premium - precise details are given below signature), but I got a "No
such device" error.  Does that mean my motherboard is unsupported and
I cannot read my CPU temperature at all?  (But I thought the whole
_point_ of ACPI was that it was an abstraction away from the hardware:
so why is there such a thing as "Asus" ACPI?)  Or else, what am I
doing wrong?

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

Full details:

hardware config:

Asus P5WD2 Premium motherboard
Intel 995X chipset
Intel Pentium 4 550 @3.4GHz processor

lsmod output:

Module  Size  Used by
ac  5892  0 
container   5504  0 
fan 5636  0 
video  17156  0 
thermal14600  0 
processor  24328  1 thermal
nvidia   3714948  12 
agpgart37580  1 nvidia
ide_cd 44804  0 
cdrom  42528  1 ide_cd
af_packet  25480  4 
ip6table_filter 3840  1 
ip6_tables 21120  1 ip6table_filter
ipt_REJECT  6656  8 
ipt_TOS 3584  1 
reiserfs  281716  7 
snd_emu10k1_synth   8960  0 
snd_emu10k1   120068  1 snd_emu10k1_synth
snd_ac97_codec 84216  1 snd_emu10k1
snd_pcm_oss54688  0 
snd_mixer_oss  20736  1 snd_pcm_oss
snd_pcm96900  3 snd_emu10k1,snd_ac97_codec,snd_pcm_oss
snd_page_alloc 11140  2 snd_emu10k1,snd_pcm
snd_emux_synth 39680  1 snd_emu10k1_synth
snd_seq_virmidi 9088  1 snd_emux_synth
snd_seq_midi_emul   8576  1 snd_emux_synth
snd_seq_dummy   4740  0 
snd_seq_oss36992  0 
snd_seq_midi   10144  0 
snd_rawmidi27040  3 snd_emu10k1,snd_seq_virmidi,snd_seq_midi
snd_seq_midi_event  8960  3 snd_seq_virmidi,snd_seq_oss,snd_seq_midi
snd_seq57360  9 
snd_emux_synth,snd_seq_virmidi,snd_seq_midi_emul,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_seq_midi_event
snd_timer  27012  3 snd_emu10k1,snd_pcm,snd_seq
snd_seq_device  9740  8 
snd_emu10k1_synth,snd_emu10k1,snd_emux_synth,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_rawmidi,snd_seq
snd_hwdep  10400  2 snd_emu10k1,snd_emux_synth
snd57956  13 
snd_emu10k1,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_emux_synth,snd_seq_virmidi,snd_seq_oss,snd_rawmidi,snd_seq,snd_timer,snd_seq_device,snd_hwdep
soundcore  11232  1 snd
snd_util_mem5632  2 snd_emu10k1,snd_emux_synth
ipv6  272544  24 
mousedev   13220  2 
iptable_mangle  3968  1 
iptable_nat25268  0 
ip_conntrack   47208  1 iptable_nat
iptable_filter  4096  1 
ip_tables  23296  5 
ipt_REJECT,ipt_TOS,iptable_mangle,iptable_nat,iptable_filter
capability  5896  0 
commoncap   8064  1 capability
ext2   72328  6 
ext3  146952  0 
jbd64920  1 ext3
mbcache11268  2 ext2,ext3
ppp_deflate 7424  0 
zlib_deflate   23704  1 ppp_deflate
bsd_comp7168  0 
tun13056  1 
ppp_async  13312  1 
ppp_generic31892  7 ppp_deflate,bsd_comp,ppp_async
slhc8064  1 ppp_generic
crc_ccitt   3072  1 ppp_async
dummy   4100  0 
dm_mod 62368  0 
ohci1394   37300  0 
ieee1394  107704  1 ohci1394
usbhid 36832  0 
ohci_hcd   23172  0 
uhci_hcd   34960  0 
usbcore   125820  4 usbhid,ohci_hcd,uhci_hcd
e1000 108724  0 
rtc14664  0 
unix   31248  364 

lspci output:

:00:00.0 Host bridge: Intel Corp.: Unknown device 2774 (rev 81)
:00:01.0 PCI bridge: Intel Corp.: Unknown device 2775 (rev 81)
:00:1b.0 0403: Intel Corp.: Unknown device 27d8 (rev 01)
:00:1c.0 PCI bridge: Intel Corp.: Unknown device 27d0 (rev 01)
:00:1c.1 PCI bridge: Intel Corp.: Unknown device 27d2 (rev 01)
:00:1c.2 PCI bridge: Intel Corp.: Unknown device 27d4 (rev 01)
:00:1c.3 PCI bridge: Intel Corp.: Unknown device 27d6 (rev 01)

how do I read CPU temperature in ACPI? (w/ P5WD2 motherboard)

2005-08-10 Thread David Madore
Hi.  I apologize for what is surely a stupid question: I understand
that ACPI should be able to tell me what my CPU's temperature is (I
have a sever overheating problem and I am trying to solve it by
underclocking somewhat, but I need to be able to read the temperature
to do anything worth while), but no matter what ACPI modules I load, I
can't find any hint of a CPU temperature reading anywhere below
/proc/acpi (the /proc/acpi/thermal_zone/ directory, for example,
remains empty).

That's with the thermal, processor and fan modules loaded (and a
few others; full listing follows signature).  I tried to load the
asus_acpi module also, since I have an ASUS motherboard (a P5WD2
Premium - precise details are given below signature), but I got a No
such device error.  Does that mean my motherboard is unsupported and
I cannot read my CPU temperature at all?  (But I thought the whole
_point_ of ACPI was that it was an abstraction away from the hardware:
so why is there such a thing as Asus ACPI?)  Or else, what am I
doing wrong?

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

Full details:

hardware config:

Asus P5WD2 Premium motherboard
Intel 995X chipset
Intel Pentium 4 550 @3.4GHz processor

lsmod output:

Module  Size  Used by
ac  5892  0 
container   5504  0 
fan 5636  0 
video  17156  0 
thermal14600  0 
processor  24328  1 thermal
nvidia   3714948  12 
agpgart37580  1 nvidia
ide_cd 44804  0 
cdrom  42528  1 ide_cd
af_packet  25480  4 
ip6table_filter 3840  1 
ip6_tables 21120  1 ip6table_filter
ipt_REJECT  6656  8 
ipt_TOS 3584  1 
reiserfs  281716  7 
snd_emu10k1_synth   8960  0 
snd_emu10k1   120068  1 snd_emu10k1_synth
snd_ac97_codec 84216  1 snd_emu10k1
snd_pcm_oss54688  0 
snd_mixer_oss  20736  1 snd_pcm_oss
snd_pcm96900  3 snd_emu10k1,snd_ac97_codec,snd_pcm_oss
snd_page_alloc 11140  2 snd_emu10k1,snd_pcm
snd_emux_synth 39680  1 snd_emu10k1_synth
snd_seq_virmidi 9088  1 snd_emux_synth
snd_seq_midi_emul   8576  1 snd_emux_synth
snd_seq_dummy   4740  0 
snd_seq_oss36992  0 
snd_seq_midi   10144  0 
snd_rawmidi27040  3 snd_emu10k1,snd_seq_virmidi,snd_seq_midi
snd_seq_midi_event  8960  3 snd_seq_virmidi,snd_seq_oss,snd_seq_midi
snd_seq57360  9 
snd_emux_synth,snd_seq_virmidi,snd_seq_midi_emul,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_seq_midi_event
snd_timer  27012  3 snd_emu10k1,snd_pcm,snd_seq
snd_seq_device  9740  8 
snd_emu10k1_synth,snd_emu10k1,snd_emux_synth,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_rawmidi,snd_seq
snd_hwdep  10400  2 snd_emu10k1,snd_emux_synth
snd57956  13 
snd_emu10k1,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_emux_synth,snd_seq_virmidi,snd_seq_oss,snd_rawmidi,snd_seq,snd_timer,snd_seq_device,snd_hwdep
soundcore  11232  1 snd
snd_util_mem5632  2 snd_emu10k1,snd_emux_synth
ipv6  272544  24 
mousedev   13220  2 
iptable_mangle  3968  1 
iptable_nat25268  0 
ip_conntrack   47208  1 iptable_nat
iptable_filter  4096  1 
ip_tables  23296  5 
ipt_REJECT,ipt_TOS,iptable_mangle,iptable_nat,iptable_filter
capability  5896  0 
commoncap   8064  1 capability
ext2   72328  6 
ext3  146952  0 
jbd64920  1 ext3
mbcache11268  2 ext2,ext3
ppp_deflate 7424  0 
zlib_deflate   23704  1 ppp_deflate
bsd_comp7168  0 
tun13056  1 
ppp_async  13312  1 
ppp_generic31892  7 ppp_deflate,bsd_comp,ppp_async
slhc8064  1 ppp_generic
crc_ccitt   3072  1 ppp_async
dummy   4100  0 
dm_mod 62368  0 
ohci1394   37300  0 
ieee1394  107704  1 ohci1394
usbhid 36832  0 
ohci_hcd   23172  0 
uhci_hcd   34960  0 
usbcore   125820  4 usbhid,ohci_hcd,uhci_hcd
e1000 108724  0 
rtc14664  0 
unix   31248  364 

lspci output:

:00:00.0 Host bridge: Intel Corp.: Unknown device 2774 (rev 81)
:00:01.0 PCI bridge: Intel Corp.: Unknown device 2775 (rev 81)
:00:1b.0 0403: Intel Corp.: Unknown device 27d8 (rev 01)
:00:1c.0 PCI bridge: Intel Corp.: Unknown device 27d0 (rev 01)
:00:1c.1 PCI bridge: Intel Corp.: Unknown device 27d2 (rev 01)
:00:1c.2 PCI bridge: Intel Corp.: Unknown device 27d4 (rev 01)
:00:1c.3 PCI bridge: Intel Corp.: Unknown device 27d6 (rev 01)

Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 11:36:00PM +0200, Bodo Eggert wrote:
> 1) I wouldn't want an exploited service to gain any privileges, even by
>chaining userspace exploits (e.g. exec sendmail < exploitstring).  For
>most services, I'd like CAP_EXEC being unset (but it doesn't exist).

I intend to add a couple of capabilities which are normally available
to all user processes, including capability to exec(), capability to
fork() and a couple of others (maybe a capability to perform any kind
of write operation, but that seems a bit more difficult to implement).
So keep an eye open[#] for future versions of my patch.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#] On the other hand, I have a strong tendency not to finish anything
I start :-( so maybe this is all just vaporware.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 01:52:06PM -0700, Chris Wright wrote:
> * Bodo Eggert ([EMAIL PROTECTED]) wrote:
> > How are you going to tell processes that may exec suid (or set-capability-)
> > programs from those that aren't supposed to gain certain capabilities?
> 
> typically you'd expect exec suid will reset to full caps.

suid exec _must_ reset to full caps or we have the sendmail disaster
again.  However, that is _if_ execve() succeeds.  It is quite possible
that execve() should fail, and that is precisely what my patch does:
if a process has bounded capabilities, it _may not_ exec suid.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 04:28:31PM -0400, [EMAIL PROTECTED] wrote:
> On Tue, 09 Aug 2005 07:26:21 +0200, David Madore said:
> > * Second, a much more extensive change, the patch introduces a third
> > set of capabilities for every process, the "bounding" set.  Normally
> > the bounding set has every capability in it
> 
> How is this different in semantics from the existing 'permitted' capset?

The permitted sets is a set of capabilities really available to the
process (though they may be temporarily dropped by removing them from
the effective set, they are still available to take back).  In
contrast, the bounding set capabilities are not readily available to
the process; it just means that the capabilities in question *might*
be acquired by running a suid program (or setcap program if filesystem
support for capabilities ever comes to Linux).

Currently this is more or less an all-or-nothing process: since
capabilities can only be acquired by running a suid program, removing
any capability from the bounding set means the program will never be
permitted to execute a suid program any more (execve() will fail with
EPERM).  But maybe I'll reinstate the CAP_SETPCAP thing in some future
version of the patch (I'm still waiting for someone to tell me what
was wrong with CAP_SETPCAP and why it was removed), and then the
bounding set should also prohibit capabilities being given through
that interface.

The bottom line is: if you have some untrusted process, it might be
wise to remove empty its bounding set, making it incapable of
executing a suid root program and thus acquiring new capabilities.  (I
also plan to add some normally-available-to-all capabilities such as
"permission to fork()", "permission to exec()" and so on, and then it
will also be useful to remove these from a process's permitted set.)

> include/linux/capabilities.h:
> 
> typedef struct __user_cap_data_struct {
> __u32 effective;
> __u32 permitted;
> __u32 inheritable;
> } __user *cap_user_data_t;
> 

And my patch adds a __u32 bounding to that structure.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 05:37:56AM +, Chris Wright wrote:
> * David Madore ([EMAIL PROTECTED]) wrote:
> > * Second, a much more extensive change, the patch introduces a third
> > set of capabilities for every process, the "bounding" set.  Normally
> 
> this is not a good idea.  don't add more sets.

Could you elaborate?  Why is adding sets bad?  From what I read of the
June 2000 discussions on the linux-privs-discuss mailing-list (http://sourceforge.net/mailarchive/forum.php?forum_id=25120_rows=25=flat=26
 >), a rather large consensus had formed around the idea that some
kind of bounding set was a useful idea (as a matter of fact, the
sendmail problem came essentially from the fact that some people
wanted an inheritable set and other people wanted a bounding set, and
the code was some mixture of the two); and it had been argued
convincincly that it could be made POSIX compliant if that is the
issue.  Plus, Solaris privileges also come in four sets.

If it's compatibility you're worried about, it seems to me that the
user interface can be made so that it will still work with the old
libcap and merely ignore the bounding set.  So full binary
compatibility will be achieved, at least on the user level.

Finally, if it's a matter of kernel policy, I seem to understand that
my patch has a snowball's chance in hell of ever being accepted in the
mainstream kernel (I mean, it's not as though this were new: patches
to make capabilities work have been available ever since the sendmail
exploit, and in five years they haven't ever been accepted, so I
suppose there's a reason to this), so adding a fourth set of
capabilities of my own initiative isn't going to change a thing there.

So what's the problem?

>if you really want to
> work on this i'll give you all the patches that have been done thus far,
> plus a set of tests that look at all the execve, ptrace, setuid type of
> corner cases.

Yes, I'm very interested in the test suite.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 05:37:56AM +, Chris Wright wrote:
 * David Madore ([EMAIL PROTECTED]) wrote:
  * Second, a much more extensive change, the patch introduces a third
  set of capabilities for every process, the bounding set.  Normally
 
 this is not a good idea.  don't add more sets.

Could you elaborate?  Why is adding sets bad?  From what I read of the
June 2000 discussions on the linux-privs-discuss mailing-list (URL:
http://sourceforge.net/mailarchive/forum.php?forum_id=25120max_rows=25style=flatviewmonth=26
 ), a rather large consensus had formed around the idea that some
kind of bounding set was a useful idea (as a matter of fact, the
sendmail problem came essentially from the fact that some people
wanted an inheritable set and other people wanted a bounding set, and
the code was some mixture of the two); and it had been argued
convincincly that it could be made POSIX compliant if that is the
issue.  Plus, Solaris privileges also come in four sets.

If it's compatibility you're worried about, it seems to me that the
user interface can be made so that it will still work with the old
libcap and merely ignore the bounding set.  So full binary
compatibility will be achieved, at least on the user level.

Finally, if it's a matter of kernel policy, I seem to understand that
my patch has a snowball's chance in hell of ever being accepted in the
mainstream kernel (I mean, it's not as though this were new: patches
to make capabilities work have been available ever since the sendmail
exploit, and in five years they haven't ever been accepted, so I
suppose there's a reason to this), so adding a fourth set of
capabilities of my own initiative isn't going to change a thing there.

So what's the problem?

if you really want to
 work on this i'll give you all the patches that have been done thus far,
 plus a set of tests that look at all the execve, ptrace, setuid type of
 corner cases.

Yes, I'm very interested in the test suite.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 04:28:31PM -0400, [EMAIL PROTECTED] wrote:
 On Tue, 09 Aug 2005 07:26:21 +0200, David Madore said:
  * Second, a much more extensive change, the patch introduces a third
  set of capabilities for every process, the bounding set.  Normally
  the bounding set has every capability in it
 
 How is this different in semantics from the existing 'permitted' capset?

The permitted sets is a set of capabilities really available to the
process (though they may be temporarily dropped by removing them from
the effective set, they are still available to take back).  In
contrast, the bounding set capabilities are not readily available to
the process; it just means that the capabilities in question *might*
be acquired by running a suid program (or setcap program if filesystem
support for capabilities ever comes to Linux).

Currently this is more or less an all-or-nothing process: since
capabilities can only be acquired by running a suid program, removing
any capability from the bounding set means the program will never be
permitted to execute a suid program any more (execve() will fail with
EPERM).  But maybe I'll reinstate the CAP_SETPCAP thing in some future
version of the patch (I'm still waiting for someone to tell me what
was wrong with CAP_SETPCAP and why it was removed), and then the
bounding set should also prohibit capabilities being given through
that interface.

The bottom line is: if you have some untrusted process, it might be
wise to remove empty its bounding set, making it incapable of
executing a suid root program and thus acquiring new capabilities.  (I
also plan to add some normally-available-to-all capabilities such as
permission to fork(), permission to exec() and so on, and then it
will also be useful to remove these from a process's permitted set.)

 include/linux/capabilities.h:
 
 typedef struct __user_cap_data_struct {
 __u32 effective;
 __u32 permitted;
 __u32 inheritable;
 } __user *cap_user_data_t;
 

And my patch adds a __u32 bounding to that structure.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 01:52:06PM -0700, Chris Wright wrote:
 * Bodo Eggert ([EMAIL PROTECTED]) wrote:
  How are you going to tell processes that may exec suid (or set-capability-)
  programs from those that aren't supposed to gain certain capabilities?
 
 typically you'd expect exec suid will reset to full caps.

suid exec _must_ reset to full caps or we have the sendmail disaster
again.  However, that is _if_ execve() succeeds.  It is quite possible
that execve() should fail, and that is precisely what my patch does:
if a process has bounded capabilities, it _may not_ exec suid.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: capabilities patch (v 0.1)

2005-08-09 Thread David Madore
On Tue, Aug 09, 2005 at 11:36:00PM +0200, Bodo Eggert wrote:
 1) I wouldn't want an exploited service to gain any privileges, even by
chaining userspace exploits (e.g. exec sendmail  exploitstring).  For
most services, I'd like CAP_EXEC being unset (but it doesn't exist).

I intend to add a couple of capabilities which are normally available
to all user processes, including capability to exec(), capability to
fork() and a couple of others (maybe a capability to perform any kind
of write operation, but that seems a bit more difficult to implement).
So keep an eye open[#] for future versions of my patch.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#] On the other hand, I have a strong tendency not to finish anything
I start :-( so maybe this is all just vaporware.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


capabilities patch (v 0.1)

2005-08-08 Thread David Madore
Well, I wasn't sleepy tonight, so I produced the following patch for
Linux capabilities, which attempts to make them useful.  It is
supposed to do the following (which may or may not conform with the
POSIX semantics, I don't think it matters much):

* First, and most importantly, capabilities are carried across
execve().  More precisely, on execve(), the inheritable set of the
process (which is always a subset of the permitted set) is copied to
the permitted set, and ANDed on the effective set; except when a suid
root binary is executed, in which case the permitted, inheritable and
effective sets are fully set.

* Second, a much more extensive change, the patch introduces a third
set of capabilities for every process, the "bounding" set.  Normally
the bounding set has every capability in it.  If a capability is
removed from it, that means the process is never allowed to gain it on
exec.  In the current state of affairs, since the only way of gaining
capabilities is through suid root programs, the bounding set is
essentially an all-or-nothing affair: if you do not have every
capability in your bounding set, you may not run a suid root program
(execve() will fail with EPERM).  This can still be very useful on
untrusted programs.

This patch hasn't been tested very much; in fact, it has been hardly
tested at all (I just ran the kernel in a qemu and made a few basic
checks).  Since it adds a whole new set of capabilities to every
process, it also requires a specially modified version of libcap (the
one I have right now is pretty buggy, so I'm not posting the patch
here).

Consider this more a "proof of concept" than a serious patch, but I'm
interested in any comments.

### cut after ###
--- linux-2.6.12.4/fs/proc/array.c  2005-08-05 09:04:37.0 +0200
+++ linux-2.6.12.4.caps/fs/proc/array.c 2005-08-09 01:32:07.0 +0200
@@ -281,10 +281,12 @@ static inline char *task_cap(struct task
 {
 return buffer + sprintf(buffer, "CapInh:\t%016x\n"
"CapPrm:\t%016x\n"
-   "CapEff:\t%016x\n",
+   "CapEff:\t%016x\n"
+   "CapBnd:\t%016x\n",
cap_t(p->cap_inheritable),
cap_t(p->cap_permitted),
-   cap_t(p->cap_effective));
+   cap_t(p->cap_effective),
+   cap_t(p->cap_bounding));
 }
 
 int proc_pid_status(struct task_struct *task, char * buffer)
--- linux-2.6.12.4/include/linux/binfmts.h  2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/binfmts.h 2005-08-09 01:41:03.0 
+0200
@@ -28,7 +28,7 @@ struct linux_binprm{
int sh_bang;
struct file * file;
int e_uid, e_gid;
-   kernel_cap_t cap_inheritable, cap_permitted, cap_effective;
+   kernel_cap_t cap_inheritable, cap_permitted, cap_effective, 
cap_bounding;
void *security;
int argc, envc;
char * filename;/* Name of binary as seen by procps */
--- linux-2.6.12.4/include/linux/capability.h   2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/capability.h  2005-08-09 
03:14:36.0 +0200
@@ -27,7 +27,7 @@
library since the draft standard requires the use of malloc/free
etc.. */
  
-#define _LINUX_CAPABILITY_VERSION  0x19980330
+#define _LINUX_CAPABILITY_VERSION  0x20050809
 
 typedef struct __user_cap_header_struct {
__u32 version;
@@ -38,6 +38,7 @@ typedef struct __user_cap_data_struct {
 __u32 effective;
 __u32 permitted;
 __u32 inheritable;
+__u32 bounding;
 } __user *cap_user_data_t;
   
 #ifdef __KERNEL__
@@ -311,7 +312,7 @@ extern kernel_cap_t cap_bset;
 #define CAP_EMPTY_SET   to_cap_t(0)
 #define CAP_FULL_SETto_cap_t(~0)
 #define CAP_INIT_EFF_SETto_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP))
-#define CAP_INIT_INH_SETto_cap_t(0)
+#define CAP_INIT_INH_SETto_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP))
 
 #define CAP_TO_MASK(x) (1 << (x))
 #define cap_raise(c, flag)   (cap_t(c) |=  CAP_TO_MASK(flag))
--- linux-2.6.12.4/include/linux/init_task.h2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/init_task.h   2005-08-09 
05:19:32.0 +0200
@@ -94,6 +94,7 @@ extern struct group_info init_groups;
.cap_effective  = CAP_INIT_EFF_SET, \
.cap_inheritable = CAP_INIT_INH_SET,\
.cap_permitted  = CAP_FULL_SET, \
+   .cap_bounding   = CAP_FULL_SET, \
.keep_capabilities = 0, \
.user   = INIT_USER,\
.comm   = "swapper",\
--- linux-2.6.12.4/include/linux/sched.h2005-08-05 09:04:37.0 
+0200
+++ 

Re: understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
On Tue, Aug 09, 2005 at 01:53:50AM +, Theodore Ts'o wrote:
> The POSIX specification for capabilities requires filesystem support,
> so that each executables can be marked with three capability sets ---
> which indicate which capabilities are asserted when the executable
> starts, which capabilities the executable is allowed to request, and
> which capabilities the executable is allowed to inherit from its
> parent process.  This effectively takes a single setuid bit and splits
> it into a hundred-odd bits.  

You point out various reasons why the POSIX (draft-)specification is
problematic.  But nobody says Linux has to abide it, especially as it
is a mere withdrawn draft.  Solaris 10 has capabilities (except that
they're called "privileges") which are similar, but not identical, to
the POSIX ones.

And even capabilities with no filesystem support can be useful.  In
fact, as far as I see it, the main interest in capabilities lies in
the "process management" part.  For example, I might like to run this
or that binary, which claims it needs to be run as root, with a
limited set of capabilities: the current Linux kernels make this quite
impossible.  Conversely, I might wish to give a particular capability
to a given user; in association with sudo, this might be quite useful:
instead of telling sudo to let the user run a given command as root,
just let him run a capability-aware wrapper which drops every
capability except the required ones and then calls the actual program
- so even if the latter is not secure, damage is more limited.  I can
think of thousands of other uses not requiring any kind of filesystem
support.

>   Note that many some setuid
> programs don't necessarily check error returns, and sometimes turning
> off permissions can sometimes open up vulnerabilities.

Yes, the sendmail vulnerability proved this quite clearly.  So
certainly a luser should not be permitted to run a suid root program
with anything in between the empty set and the full set of
capabilities.

> Another problem with the POSIX capabilities is that most of the
> programs that system administrators run to look for setuid programs
> will miss programs that have capabilities encoded in extended
> attributes.  This problem could be fixed by requiring the setuid bit
> to be set before paying attention to the capability EA's; but this
> could lead to surprising results if the filesystem is mounted on a
> system that doesn't use filesystem capabilities at all.

I might suggest encoding the presence of capabilities by a sgid bit
for a specific group (say, wheel) on top of the extended attributes.
So the careful sysadmin will notice the programs (because sgid wheel
is significant enough to be noted) but it will not cause total
disaster if mounted on a non-capability-capable ;-) filesystem.

> Yet another issue is that the POSIX capabilities model means that a
> default executable, such as gcc for example, is not allowed to inherit
> _any_ capabilities, even if it is run from a setuid root shell.  This
> is good from a security point of view, since it means that people
> can't get in trouble by doing silly things like typing
> "./configure;make" as root and expect any of the build tools to have
> override arbitrary file controls.  The bad news is that system
> administrators aren't particularly happy when their own private tools
> have to especially marked to allow them to run with elevated
> privileges.

Yes, this seems like a reason to deviate from the POSIX model under
Linux.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
Sorry for replying to myself...

On Mon, Aug 08, 2005 at 09:13:06PM +, David Madore wrote:
> However, what I do not understand is precisely _how_ one gets a
> sendmail process without CAP_SETUID: for that is the heart of the
> problem, and that is where the bug really was.  But [#3] and [#4] are
> very obscure (and I found nothing conclusive in lkml archives).  I
> understand that the problem lies in some combination of the
> inheritable capability set and the CAP_SETPCAP capability, but I don't
> see what that combination is.  Certainly removing capabilities from
> the inheritable set should not prevent suid root programs from having
> them reinstated (in the language of [#6], the suid root bit should
> correspond to a full forced set of capabilities), so I don't see what
> that has to do with it, and CAP_SETPCAP indeed allows to remove
> capabilities from a given process but I don't see how the user could
> gain that capability (and indeed if he can then we can expect him to
> gain all capabilities very rapidly).

After some more intensive Googling, I found the answer in the archives
of the linux-privs-discuss mailing-list (whose existence I did not
know of):

http://sourceforge.net/mailarchive/forum.php?thread_id=1588083_id=25120
 >

The explanation from the sendmail team was incorrect: CAP_SETPCAP is a
red herring, it's only about CAP_SETUID, the implementation of the
inheritable set was broken in that it controlled not only capabilities
automatically passed across execve() but also those _gained_ by suid
root programs (contrary to the claim in the sendmail analysis) and,
worse, instead of failing on execve() when the program could not gain
privileges, it proceeded with the capabilities missing.  Hence the
catastrophic failure.

This does not tell me, then, why CAP_SETPCAP was globally disabled by
default, nor why passing of capabilities across execve() was entirely
removed instead of being fixed.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
Hi.

Like many people[#1][#2], I have found out that the Linux capability
handling utilities are non-functional, and cannot be repaired because
the kernel deliberately cripples capabilities (they are reset on every
call to execve()).  I have found that various people[#1][#2] have
proposed patches to restore working capabilities.  However, the matter
seems rather complicted and I would like to understand the full story.
Hours of Google-grepping through the lkml archives has not helped me
very much, so I hope someone can get the history straight.

I understand that Linux capabilities first appared seven or eight
years ago, and in 2000-06 there was a serious fault discovered which
caused a local root exploit through the use of the sendmail problem.
Rading [#3] and [#4], I understand that the problem was this:

  When sendmail is invoked by a non-root user, it attempts to drop its
  root privileges (which it has because the binary is installed suid)
  by calling setuid(getuid()), which, due to the stupidity of
  traditional Unix semantics enshrined in the POSIX/SUS standards,
  operates differently according to whether the process has
  "appropriate privileges" (in which case it sets all its UIDs to its
  real UID) or not (in which case it preserves the saved UID); now
  under Linux, "appropriate privileges" is defined[#5] as possessing
  the CAP_SETUID capability.  So if a non-root user manages to execute
  sendmail without the CAP_SETUID capability, the setuid(getuid())
  call will fail (or rather, not perform as expected), and the genie
  is out of the bottle.

However, what I do not understand is precisely _how_ one gets a
sendmail process without CAP_SETUID: for that is the heart of the
problem, and that is where the bug really was.  But [#3] and [#4] are
very obscure (and I found nothing conclusive in lkml archives).  I
understand that the problem lies in some combination of the
inheritable capability set and the CAP_SETPCAP capability, but I don't
see what that combination is.  Certainly removing capabilities from
the inheritable set should not prevent suid root programs from having
them reinstated (in the language of [#6], the suid root bit should
correspond to a full forced set of capabilities), so I don't see what
that has to do with it, and CAP_SETPCAP indeed allows to remove
capabilities from a given process but I don't see how the user could
gain that capability (and indeed if he can then we can expect him to
gain all capabilities very rapidly).

Can someone describe very accurately what the problem was?  And why
was it "fixed"[#7] by completely disabling capability inheritance and
also by disabling the CAP_SETPCAP capability?  In other words, suppose
I restore CAP_SETPCAP on my system and/or make capabilities fully
inheritable on execve() (that is, just take the logical AND of the
permitted set with the inheritable set, except if the executed program
is suid root, in which case all three sets - permitted, effective and
inheritable - are set to full): what is the security problem in this?

Assuming I want to make capabilities inheritable, is there a
recommended patch for doing so?  Alexander Nyberg's patch in [#1]
looks good to me (at least, it seems to do exactly what I want), but
how well has it been tested?  Is this something that might eventually
make its way into the official kernel, or is this a no-goer?  Also, if
the author happens to read this, I'd like an explanation on the "Is
this a root task that did seteuid before execve? if so it wanted its
effective permissions dropped" comment in cap_bprm_apply_creds().

Thanks!

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#1] http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/f76dcb9447a77c34
 >

[#2] http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/4366e557a75a933d
 >

[#3] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf >

[#4] http://www.sendmail.org/sendmail.8.10.1.LINUX-SECURITY.txt >

[#5] I tend to think that the behavior of setuid() is wrong in the
first place, that is, setuid(getuid()) should also change the saved
UID as soon as the effective UID is zero, even if CAP_SETUID is not
set, to make sure that traditional Unix semantics are observed.  (More
recent, capability-aware, programs will use setresuid() anyway.)  But
that is rather beside the point.

[#6] http://ftp.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.4/capfaq-0.2.txt
 >

[#7] I wanted to find exactly on which kernel version the changes took
place.  Unfortunately, http://lxr.linux.no/ > only has major
versions, the 2.2.15->2.2.16 patch is very hard to read, and I have
neither the patience nor the bandwidth to unpack entire kernel trees
on my PC to unravel the full history...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
Hi.

Like many people[#1][#2], I have found out that the Linux capability
handling utilities are non-functional, and cannot be repaired because
the kernel deliberately cripples capabilities (they are reset on every
call to execve()).  I have found that various people[#1][#2] have
proposed patches to restore working capabilities.  However, the matter
seems rather complicted and I would like to understand the full story.
Hours of Google-grepping through the lkml archives has not helped me
very much, so I hope someone can get the history straight.

I understand that Linux capabilities first appared seven or eight
years ago, and in 2000-06 there was a serious fault discovered which
caused a local root exploit through the use of the sendmail problem.
Rading [#3] and [#4], I understand that the problem was this:

  When sendmail is invoked by a non-root user, it attempts to drop its
  root privileges (which it has because the binary is installed suid)
  by calling setuid(getuid()), which, due to the stupidity of
  traditional Unix semantics enshrined in the POSIX/SUS standards,
  operates differently according to whether the process has
  appropriate privileges (in which case it sets all its UIDs to its
  real UID) or not (in which case it preserves the saved UID); now
  under Linux, appropriate privileges is defined[#5] as possessing
  the CAP_SETUID capability.  So if a non-root user manages to execute
  sendmail without the CAP_SETUID capability, the setuid(getuid())
  call will fail (or rather, not perform as expected), and the genie
  is out of the bottle.

However, what I do not understand is precisely _how_ one gets a
sendmail process without CAP_SETUID: for that is the heart of the
problem, and that is where the bug really was.  But [#3] and [#4] are
very obscure (and I found nothing conclusive in lkml archives).  I
understand that the problem lies in some combination of the
inheritable capability set and the CAP_SETPCAP capability, but I don't
see what that combination is.  Certainly removing capabilities from
the inheritable set should not prevent suid root programs from having
them reinstated (in the language of [#6], the suid root bit should
correspond to a full forced set of capabilities), so I don't see what
that has to do with it, and CAP_SETPCAP indeed allows to remove
capabilities from a given process but I don't see how the user could
gain that capability (and indeed if he can then we can expect him to
gain all capabilities very rapidly).

Can someone describe very accurately what the problem was?  And why
was it fixed[#7] by completely disabling capability inheritance and
also by disabling the CAP_SETPCAP capability?  In other words, suppose
I restore CAP_SETPCAP on my system and/or make capabilities fully
inheritable on execve() (that is, just take the logical AND of the
permitted set with the inheritable set, except if the executed program
is suid root, in which case all three sets - permitted, effective and
inheritable - are set to full): what is the security problem in this?

Assuming I want to make capabilities inheritable, is there a
recommended patch for doing so?  Alexander Nyberg's patch in [#1]
looks good to me (at least, it seems to do exactly what I want), but
how well has it been tested?  Is this something that might eventually
make its way into the official kernel, or is this a no-goer?  Also, if
the author happens to read this, I'd like an explanation on the Is
this a root task that did seteuid before execve? if so it wanted its
effective permissions dropped comment in cap_bprm_apply_creds().

Thanks!

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )

[#1] URL: 
http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/f76dcb9447a77c34
 

[#2] URL: 
http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/4366e557a75a933d
 

[#3] URL: http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf 

[#4] URL: http://www.sendmail.org/sendmail.8.10.1.LINUX-SECURITY.txt 

[#5] I tend to think that the behavior of setuid() is wrong in the
first place, that is, setuid(getuid()) should also change the saved
UID as soon as the effective UID is zero, even if CAP_SETUID is not
set, to make sure that traditional Unix semantics are observed.  (More
recent, capability-aware, programs will use setresuid() anyway.)  But
that is rather beside the point.

[#6] URL: 
http://ftp.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.4/capfaq-0.2.txt
 

[#7] I wanted to find exactly on which kernel version the changes took
place.  Unfortunately, URL: http://lxr.linux.no/  only has major
versions, the 2.2.15-2.2.16 patch is very hard to read, and I have
neither the patience nor the bandwidth to unpack entire kernel trees
on my PC to unravel the full history...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
Sorry for replying to myself...

On Mon, Aug 08, 2005 at 09:13:06PM +, David Madore wrote:
 However, what I do not understand is precisely _how_ one gets a
 sendmail process without CAP_SETUID: for that is the heart of the
 problem, and that is where the bug really was.  But [#3] and [#4] are
 very obscure (and I found nothing conclusive in lkml archives).  I
 understand that the problem lies in some combination of the
 inheritable capability set and the CAP_SETPCAP capability, but I don't
 see what that combination is.  Certainly removing capabilities from
 the inheritable set should not prevent suid root programs from having
 them reinstated (in the language of [#6], the suid root bit should
 correspond to a full forced set of capabilities), so I don't see what
 that has to do with it, and CAP_SETPCAP indeed allows to remove
 capabilities from a given process but I don't see how the user could
 gain that capability (and indeed if he can then we can expect him to
 gain all capabilities very rapidly).

After some more intensive Googling, I found the answer in the archives
of the linux-privs-discuss mailing-list (whose existence I did not
know of):

URL:
http://sourceforge.net/mailarchive/forum.php?thread_id=1588083forum_id=25120
 

The explanation from the sendmail team was incorrect: CAP_SETPCAP is a
red herring, it's only about CAP_SETUID, the implementation of the
inheritable set was broken in that it controlled not only capabilities
automatically passed across execve() but also those _gained_ by suid
root programs (contrary to the claim in the sendmail analysis) and,
worse, instead of failing on execve() when the program could not gain
privileges, it proceeded with the capabilities missing.  Hence the
catastrophic failure.

This does not tell me, then, why CAP_SETPCAP was globally disabled by
default, nor why passing of capabilities across execve() was entirely
removed instead of being fixed.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: understanding Linux capabilities brokenness

2005-08-08 Thread David Madore
On Tue, Aug 09, 2005 at 01:53:50AM +, Theodore Ts'o wrote:
 The POSIX specification for capabilities requires filesystem support,
 so that each executables can be marked with three capability sets ---
 which indicate which capabilities are asserted when the executable
 starts, which capabilities the executable is allowed to request, and
 which capabilities the executable is allowed to inherit from its
 parent process.  This effectively takes a single setuid bit and splits
 it into a hundred-odd bits.  

You point out various reasons why the POSIX (draft-)specification is
problematic.  But nobody says Linux has to abide it, especially as it
is a mere withdrawn draft.  Solaris 10 has capabilities (except that
they're called privileges) which are similar, but not identical, to
the POSIX ones.

And even capabilities with no filesystem support can be useful.  In
fact, as far as I see it, the main interest in capabilities lies in
the process management part.  For example, I might like to run this
or that binary, which claims it needs to be run as root, with a
limited set of capabilities: the current Linux kernels make this quite
impossible.  Conversely, I might wish to give a particular capability
to a given user; in association with sudo, this might be quite useful:
instead of telling sudo to let the user run a given command as root,
just let him run a capability-aware wrapper which drops every
capability except the required ones and then calls the actual program
- so even if the latter is not secure, damage is more limited.  I can
think of thousands of other uses not requiring any kind of filesystem
support.

   Note that many some setuid
 programs don't necessarily check error returns, and sometimes turning
 off permissions can sometimes open up vulnerabilities.

Yes, the sendmail vulnerability proved this quite clearly.  So
certainly a luser should not be permitted to run a suid root program
with anything in between the empty set and the full set of
capabilities.

 Another problem with the POSIX capabilities is that most of the
 programs that system administrators run to look for setuid programs
 will miss programs that have capabilities encoded in extended
 attributes.  This problem could be fixed by requiring the setuid bit
 to be set before paying attention to the capability EA's; but this
 could lead to surprising results if the filesystem is mounted on a
 system that doesn't use filesystem capabilities at all.

I might suggest encoding the presence of capabilities by a sgid bit
for a specific group (say, wheel) on top of the extended attributes.
So the careful sysadmin will notice the programs (because sgid wheel
is significant enough to be noted) but it will not cause total
disaster if mounted on a non-capability-capable ;-) filesystem.

 Yet another issue is that the POSIX capabilities model means that a
 default executable, such as gcc for example, is not allowed to inherit
 _any_ capabilities, even if it is run from a setuid root shell.  This
 is good from a security point of view, since it means that people
 can't get in trouble by doing silly things like typing
 ./configure;make as root and expect any of the build tools to have
 override arbitrary file controls.  The bad news is that system
 administrators aren't particularly happy when their own private tools
 have to especially marked to allow them to run with elevated
 privileges.

Yes, this seems like a reason to deviate from the POSIX model under
Linux.

-- 
 David A. Madore
([EMAIL PROTECTED],
 http://www.madore.org/~david/ )
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


capabilities patch (v 0.1)

2005-08-08 Thread David Madore
Well, I wasn't sleepy tonight, so I produced the following patch for
Linux capabilities, which attempts to make them useful.  It is
supposed to do the following (which may or may not conform with the
POSIX semantics, I don't think it matters much):

* First, and most importantly, capabilities are carried across
execve().  More precisely, on execve(), the inheritable set of the
process (which is always a subset of the permitted set) is copied to
the permitted set, and ANDed on the effective set; except when a suid
root binary is executed, in which case the permitted, inheritable and
effective sets are fully set.

* Second, a much more extensive change, the patch introduces a third
set of capabilities for every process, the bounding set.  Normally
the bounding set has every capability in it.  If a capability is
removed from it, that means the process is never allowed to gain it on
exec.  In the current state of affairs, since the only way of gaining
capabilities is through suid root programs, the bounding set is
essentially an all-or-nothing affair: if you do not have every
capability in your bounding set, you may not run a suid root program
(execve() will fail with EPERM).  This can still be very useful on
untrusted programs.

This patch hasn't been tested very much; in fact, it has been hardly
tested at all (I just ran the kernel in a qemu and made a few basic
checks).  Since it adds a whole new set of capabilities to every
process, it also requires a specially modified version of libcap (the
one I have right now is pretty buggy, so I'm not posting the patch
here).

Consider this more a proof of concept than a serious patch, but I'm
interested in any comments.

### cut after ###
--- linux-2.6.12.4/fs/proc/array.c  2005-08-05 09:04:37.0 +0200
+++ linux-2.6.12.4.caps/fs/proc/array.c 2005-08-09 01:32:07.0 +0200
@@ -281,10 +281,12 @@ static inline char *task_cap(struct task
 {
 return buffer + sprintf(buffer, CapInh:\t%016x\n
CapPrm:\t%016x\n
-   CapEff:\t%016x\n,
+   CapEff:\t%016x\n
+   CapBnd:\t%016x\n,
cap_t(p-cap_inheritable),
cap_t(p-cap_permitted),
-   cap_t(p-cap_effective));
+   cap_t(p-cap_effective),
+   cap_t(p-cap_bounding));
 }
 
 int proc_pid_status(struct task_struct *task, char * buffer)
--- linux-2.6.12.4/include/linux/binfmts.h  2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/binfmts.h 2005-08-09 01:41:03.0 
+0200
@@ -28,7 +28,7 @@ struct linux_binprm{
int sh_bang;
struct file * file;
int e_uid, e_gid;
-   kernel_cap_t cap_inheritable, cap_permitted, cap_effective;
+   kernel_cap_t cap_inheritable, cap_permitted, cap_effective, 
cap_bounding;
void *security;
int argc, envc;
char * filename;/* Name of binary as seen by procps */
--- linux-2.6.12.4/include/linux/capability.h   2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/capability.h  2005-08-09 
03:14:36.0 +0200
@@ -27,7 +27,7 @@
library since the draft standard requires the use of malloc/free
etc.. */
  
-#define _LINUX_CAPABILITY_VERSION  0x19980330
+#define _LINUX_CAPABILITY_VERSION  0x20050809
 
 typedef struct __user_cap_header_struct {
__u32 version;
@@ -38,6 +38,7 @@ typedef struct __user_cap_data_struct {
 __u32 effective;
 __u32 permitted;
 __u32 inheritable;
+__u32 bounding;
 } __user *cap_user_data_t;
   
 #ifdef __KERNEL__
@@ -311,7 +312,7 @@ extern kernel_cap_t cap_bset;
 #define CAP_EMPTY_SET   to_cap_t(0)
 #define CAP_FULL_SETto_cap_t(~0)
 #define CAP_INIT_EFF_SETto_cap_t(~0  ~CAP_TO_MASK(CAP_SETPCAP))
-#define CAP_INIT_INH_SETto_cap_t(0)
+#define CAP_INIT_INH_SETto_cap_t(~0  ~CAP_TO_MASK(CAP_SETPCAP))
 
 #define CAP_TO_MASK(x) (1  (x))
 #define cap_raise(c, flag)   (cap_t(c) |=  CAP_TO_MASK(flag))
--- linux-2.6.12.4/include/linux/init_task.h2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/init_task.h   2005-08-09 
05:19:32.0 +0200
@@ -94,6 +94,7 @@ extern struct group_info init_groups;
.cap_effective  = CAP_INIT_EFF_SET, \
.cap_inheritable = CAP_INIT_INH_SET,\
.cap_permitted  = CAP_FULL_SET, \
+   .cap_bounding   = CAP_FULL_SET, \
.keep_capabilities = 0, \
.user   = INIT_USER,\
.comm   = swapper,\
--- linux-2.6.12.4/include/linux/sched.h2005-08-05 09:04:37.0 
+0200
+++ linux-2.6.12.4.caps/include/linux/sched.h