from:"\"Sergio Gelato\""

Bug#1072573: rpc.idmapd runs out of file descriptors

2024-06-04 Thread Sergio Gelato

Package: nfs-common
Version: 2.6.2-4
Severity: important
Tags: patch upstream

On some of our bookworm systems I've seen what looked like a file descriptor 
leak. Sample lsof output:

[...]
rpc.idmap 675 root  126r  DIR   0,400  10813 
/run/rpc_pipefs/nfs/clnt11e6 (deleted)
rpc.idmap 675 root  127u FIFO   0,40  0t0  10817 
/run/rpc_pipefs/nfs/clnt11e6/idmap (deleted)
rpc.idmap 675 root  128r  DIR   0,400  10834 
/run/rpc_pipefs/nfs/clnt11ef (deleted)
rpc.idmap 675 root  129u FIFO   0,40  0t0  10838 
/run/rpc_pipefs/nfs/clnt11ef/idmap (deleted)
rpc.idmap 675 root  130r  DIR   0,400  10855 
/run/rpc_pipefs/nfs/clnt11f8 (deleted)
rpc.idmap 675 root  131u FIFO   0,40  0t0  10859 
/run/rpc_pipefs/nfs/clnt11f8/idmap (deleted)

Cranking up the verbosity level to 3 showed that dirscancb never reaps stale 
entries in its queue (no "Stale client" lines). The reason turns out to be that 
the scan terminates on the first directory entry that doesn't contain an 
"idmap" file. Applying the attached patch seems to have solved the problem for 
me.

As far as I can tell the bug is still present upstream, and has been for many 
years (that "goto out" is from 2007 and replaced a "return" so the bug is even 
older than that).

Marking "important" since this has actually caused observable problems in our 
environment.From: Sergio Gelato 
Date: Tue, 4 Jun 2024 16:02:59 +0200
Subject: rpc.idmapd: nfsopen() failures should not be fatal

dirscancb() loops over all clnt* subdirectories of /run/rpc_pipefs/nfs/.
Some of these directories contain /idmap files, others don't. nfsopen()
returns -1 for the latter; we then want to skip the directory, not abort
the entire scan.
---
 utils/idmapd/idmapd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/utils/idmapd/idmapd.c b/utils/idmapd/idmapd.c
index e79c124..f3c540d 100644
--- a/utils/idmapd/idmapd.c
+++ b/utils/idmapd/idmapd.c
@@ -556,7 +556,7 @@ dirscancb(int fd, short UNUSED(which), void *data)
 			if (nfsopen(ic) == -1) {
 close(ic->ic_dirfd);
 free(ic);
-goto out;
+continue;
 			}
 
 			if (verbose > 2)

Bug#1023240: firmware-iwlwifi: iwlwifi-so-a0-gf-a0-72.ucode possibly missing

2022-11-10 Thread Sergio Gelato

It was added to linux-firmware.git after 2022-10-12, should be in 20221109:

commit 06dbfbc74388a2e9a7228f4215b884a3139ece56
Author: Gregory Greenman 
Date:   Wed Oct 26 12:15:12 2022 +0300

iwlwifi: add new FWs from core69-81 release

Add the -69.ucode firmwares for the currently supported hardware.
This is not the latest core, but we didn't send it before and it still
can be useful.

I'll check tomorrow whether it will cure the errors I'm getting with kernel
5.19.11-1~bpo11+1 on an AX211 with -71.ucode (regression from 5.18.16-1~bpo11+1
which works on the same hardware/firmware combo).

Bug#837907: stat() hangs on a particular file

2021-05-10 Thread Sergio Gelato

* Salvatore Bonaccorso [2021-05-09 22:37:02 +0200]:
> Control: tags -1 + moreinfo
> 
> Do you still can reproduce the issue with a recent kernel?

No, I haven't seen it in a while (and never with 4.9 or 4.19 either; nor with
5.10 but that's statistically less significant).

Bug#952446: Please update to latest nfs-utils

2020-05-12 Thread Sergio Gelato

When doing this (or sooner?), please also take over (and absorb)
orphaned package libnfsidmap, the upstream for which was merged
into nfs-utils three years ago.

Bug#867067: nfs-kernel-server: nfsdcltrack fails to init database

2019-12-30 Thread Sergio Gelato

I have empirical reasons to believe that the fix for CVE-2019-3689 (cf. #940848)
will take care of this bug as well.

Bug#924587: please add nfs-idmapd.service to nfs-client.target

2019-12-22 Thread Sergio Gelato

Please don't.

I'm running an NFSv4.1 environment with sec=krb5p and autofs-ldap just fine
without this.

The rpc.idmapd(8) man page says among other things:

"Note that on more recent kernels only the NFSv4 server uses rpc.idmapd.
 The NFSv4 client instead uses nfsidmap(8), and only falls back to
 rpc.idmapd if there was a problem running the nfsidmap(8) program."

(Aside: the libnfsidmap2 package mistakenly has the nfsidmap man page
in section 5 rather than 8, in both stretch and buster (I haven't checked sid).
Minor documentation bug.)

This applies to Debian stretch and later. I'm not starting rpc.idmapd on
client-only machines at all (the nfsidmap mechanism works for me) and
home directories are reliably being automounted. This makes me believe
that your proposed "fix" is incorrect (at least for general use; if you
need to run an old/odd kernel you may need something like this, but then
you should fix it locally in /etc/systemd/system/ and not impose it on
everyone else).

>From reading #842199 I get the impression that autofs is starting too soon,
before the LDAP server can be reached. On my systems autofs.service has an
effective
After=nslcd.service 
which seems to be generated on the basis of an
# X-Start-Before: autofs
comment in /etc/init.d/nslcd (so you probably won't have this on systems
without nslcd). There are many ways to influence the order in which systemd
starts services; why pick on rpc.idmapd?

Bug#939498: systemd warns about use of /var/run in rpc-statd.service

2019-09-05 Thread Sergio Gelato

Package: nfs-common
Version: 1:1.3.4-2.5
Severity: minor

After upgrading to buster, systemd issues the following warning:

systemd[1]: /lib/systemd/system/rpc-statd.service:13: PIDFile= references path 
below legacy directory /var/run/, updating /var/run/rpc.statd.pid → 
/run/rpc.statd.pid; please update the unit file accordingly.

Bug#926483: FTBFS: utils/blkmapd/device-discovery.c:156: undefined reference to `major'

2019-04-06 Thread Sergio Gelato

major(dev) is defined as a macro in  which is provided by 
libc6-dev.  I don't see that listed as an explicit build dependency, so it's 
conceivable that it might be present in some build environments but not in 
others. The configure script tests for it, potentially leading to #define 
MAJOR_IN_SYSMACROS 1 (and/or to #define MAJOR_IN_MKDEV 1, but  
seems associated with ZFS and the Solaris Porting Layer only).

Would it be wrong to add an explicit build dependency on libc6-dev? Does it 
actually cure the FTBFS?

Bug#923434: regression in kernel 4.9.144-3 on VIA C7

2019-02-28 Thread Sergio Gelato

Package: linux-image-4.9.0-8-686-pae
Version: 4.9.144-3
Severity: normal

We have a VIA CN700-8237R board (with a Centaur C7 CPU). The kernel says
DMI:  /CN700-8237R, BIOS 6.00 PG 11/25/2009

This used to boot successfully with 4.9.130-2, but consistently fails with
4.9.144-3. It also boots successfully with 4.19.16-1~bpo9+1, so I have a
workaround (at least for the time being).

With 4.9.144-3, the boot usually hangs after printing the message
console [tty0] enabled

Sometimes I see one more line on the console:
tsc: Fast TSC calibration using PIT

There seems to be a difference in behaviour between cold and warm boots, with
cold boots proceeding further (but still hanging eventually). Also, booting
in single user mode sometimes succeeds. This makes me think that the hardware
is not being (re)initialised correctly. I only have this one VIA C7 board,
so I cannot rule out a hardware defect.

I'm reporting this bug just in case someone has a use for the information,
but since (a) 4.19.16-1~bpo9+1 works, and (b) it's time for us to replace
this hardware anyway, I don't particularly need a fix.

Bug#895381: Severity

2019-01-20 Thread Sergio Gelato

* micah anderson [2019-01-20 21:03:53 +0100]:
> I'm not disputing this bug exists, I'm just trying to determine why it
> is you set the severity to "Serious". As you are probably aware, this
> severity indicates that this is a sever violation of Debian policy
> (violates a "must" or "required" directive), or in the package
> maintainer's opinion, makes the package unsuitable for release.

Oh. In the package maintainer's opinion. I had missed that part;
my apologies. No, I don't claim a policy violation on this one
and of course I'll defer to the package maintainer's assessment.

I do find the bug scary, though, due to the possibility of silent
corruption, and have been running a privately patched version of the
package for that reason.

Bug#898137: nfsd: increase DRC cache limit

2018-05-07 Thread Sergio Gelato

Source: linux
Version: 4.9.88-1
Severity: wishlist
Tags: patch

I've run into this capacity limitation in stretch, which is addressed
upstream in Linux 4.15 by the following commit:

commit 44d8660d3bb0a1c8363ebcb906af2343ea8e15f6
Author: J. Bruce Fields 
Date:   Tue Sep 19 20:51:31 2017 -0400

nfsd: increase DRC cache limit

which trivially applies to Linux 4.9 (I haven't checked 3.16) and provides
significant relief in my use case. It would save me (and perhaps others)
work if this change could be included in Debian's 4.9 kernel packages;
otherwise I'll have to keep maintaining my own fork. (4.15 has other
issues so I don't want to use it in production yet.)

For the benefit of others who may be running into the same problem, here
is a more detailed description.

Symptom: an NFS server accepts only a limited number of concurrent v4.1+
mounts. Once that limit is reached, new clients get NFS4ERR_DELAY (10008)
replies to CREATE_SESSION. (This can be seen in the server's dmesg after
rpcdebug -m nfsd -s proc.) Increasing the number of nfsd threads has no
impact on the number of mounts allowed. A server with 512MB of RAM
only accepts 7 or 8 concurrent NFSv4.1+ mounts. From the perspective of
an affected client, mount.nfs appears to hang (triggering a kernel backtrace
after 120 seconds); in reality, though, it just keeps reissuing CREATE_SESSION
calls until one of them succeeds.

Pre-v4.1 clients are unaffected by this since sessions are new to NFS v4.1.

The proposed patch just increases the limit by an order of magnitude, at
the cost of using more kernel memory. As noted in comments in the source
code, it would be nice to make this tuneable by the server administrator.

Bug#895404: NFS server stops accepting mount request / mounted NFS directories became inaccessible on client

2018-04-12 Thread Sergio Gelato

control: severity -1 normal
control: tags -1 + moreinfo

Dear reporter,

I'm sorry to hear that you have lost data. However, it doesn't seem very
constructive to make a bug release-critical without providing enough detail
to make a fix possible. NFS is a complex network protocol, and the root cause
of unexpected behaviour isn't always obvious at first glance.

First of all, has this bug been filed against the right package? The nfsd
processes are actually kernel threads (that's one reason "kill -9" doesn't
work on them), the corresponding package is the kernel image.

How many clients are accessing that NFS server when the problem occurs?
I see that you have RPCNFSDCOUNT=8 but the address range for allowed
clients is a /27. If you have 30 clients all trying to write at the same
time, some of them are going to have to wait until a server thread becomes
available. "server not responding, still trying" is a common symptom of
this. Have you tried tuning the server? You can adjust the thread count
without a reboot.

I don't see sec=krb5p in your /etc/exports, so NFS traffic on the wire
is probably unencrypted. Have you looked at it with tcpdump or a similar
tool, particularly when the problem occurs? For example it would be nice
to know whether that "Connection timed out" you get from mount.nfs is for
the portmapper (unlikely), for mountd, or for nfsd itself. (strace may
also tell you some of this.)

Are you familiar with rpcdebug? If client traffic is coming in but the
server isn't replying, you could set debugging flags and look at kernel
log output.

Other available debugging tools include the kernel's event tracing
subsystem, as well as nfsstat, nfsiostat and mountstats from package
nfs-common. (The last two are client-side, so maybe not so useful
if your problem really is at the server end.)

I can't help you much more than this: my own environment is NFSv4-only
(and I feel no urge to look back) while yours is anything but. But if
you manage to pinpoint more precisely what's wrong, someone else may
be able to provide better hints (or you may figure it out yourself).

Bug#884284: nfs4_reclaim_open_state: Lock reclaim failed

2018-04-10 Thread Sergio Gelato

Control: retitle -1 nfs4_reclaim_open_state: Lock reclaim failed

For what it's worth, I've seen the same symptoms in jessie (kernel 3.16.36
at the time) and Ubuntu trusty (3.13.0-93). In my experience, NFSv4 in
stretch is no worse than in jessie.

Rate-limiting those "Lock reclaim failed!" messages would be useful. I've
had to add a filter for them in rsyslog to prevent a DoS on my central
logging infrastructure. I don't see them often, but when a client gets stuck
it can emit this message *many* times.

There is definitely more than one trigger for these. I'm under the impression
that network partitioning events generate short bursts of such messages, but
this is usually benign and does not require a reboot for recovery. Not sure
what causes the more severe incidents (I haven't had one in a while, and
my NFS environment is intentionally v4-only).

My troubleshooting checklist for the next incident includes
  echo 1 > /sys/kernel/debug/tracing/events/nfs4_lock_reclaim/enable
but I haven't had a chance to put this into practice yet.

Bug#895384: nfs-utils: debian/watch pattern matches ../

2018-04-10 Thread Sergio Gelato

Source: nfs-utils
Version: 1:1.3.4-2.1
Tags: patch

https://tracker.debian.org/pkg/nfs-utils mentions "Problems while searching
for a new upstream version". The reason turns out to be that the version
string pattern in debian/watch matches .. as well as 2.3.1 etc. and uscan
treats .. as newest. A minimal fix (tested) is to change /([\.\d]+)/ to
/(\d[\.\d]*)/ . There may be better solutions.

Incidentally, the latest upstream release of nfs-utils is 2.3.1. Debian sid
still has 1.3.4.

Bug#895381: rpc.gssd: WARNING: handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5'

2018-04-10 Thread Sergio Gelato

Package: nfs-common
Version: 1:1.3.4-2.1
Severity: serious
Tags: fixed-upstream patch

One of my systems has logged
rpc.gssd[1168]: WARNING: handle_gssd_upcall: failed to find uid in upcall 
string 'mech=krb5'

This turns out to be a known problem, covered extensively in
https://bugzilla.redhat.com/show_bug.cgi?id=1419280

Please cherry-pick upstream commit 5ae8be8b6af1a0fdf2fa26051a05d4c04d028849
(and possibly 0a4f5e4daccdeba767b9ef36e30efbd7fd9a76d8 as well, although
I'd rate that at a lower severity).

Bug#884871: rpc.svcgssd starts while disabled in /etc/default/nfs-kernel-server

2018-02-16 Thread Sergio Gelato

rpc.svcgssd is also needed on clients in order to support NFSv4.0 callbacks.
It was moved from nfs-kernel-server to nfs-common for this reason. See
Debian bug #651558.

Apparently the task of starting rpc.svcgssd under SysV init is still entrusted
to the nfs-kernel-server package. Maybe something needs to be done about that.

If you are using systemd, you may find the file systemd/README in the source
package to be of interest. The relevant portion reads:

"rpc.gssd and rpc.svcgssd are assumed to be needed if /etc/krb5.keytab
is present.
If a site needs this file present but does not want the gss daemons
running, it should create
   /etc/systemd/system/rpc-gssd.service.d/01-disable.conf
and
   /etc/systemd/system/rpc-svcgssd.service.d/01-disable.conf

containing
   [Unit]
   ConditionNull=false
"

I think this (or equivalent information; I'd have suggested "systemctl disable"
instead of the above approach) should be included somewhere under
/usr/share/doc/nfs-common/.

As for the side question on how to disable version 4.0 but not 4.1:
try passing --no-nfs-version=4.0 to nfsd. (I haven't tested this myself
yet, only read utils/nfsd/nfsd.c. The man page is too terse about this.)

Bug#884094: register_key_type symbol missing, ABI counter not incremented

2017-12-11 Thread Sergio Gelato

* Ben Hutchings [2017-12-11 15:59:38 +]:
> I don't think there's any good way to deal with this
> now, other than to force a rebuild of the module.

Was afraid of that. It's what I did, of course, but it complicates the rollout.
(I ran "dkms remove openafs/1.6.20 -k $(uname -r); dkms install 
openafs/1.6.20".)

Will there be a jessie backport of that kernel, to replace 4.9.51-1~bpo8+1 ?

Bug#884094: register_key_type symbol missing, ABI counter not incremented

2017-12-11 Thread Sergio Gelato

Package: linux-image-4.9.0-4-amd64
Version: 4.9.65-3
Severity: important
Affects: openafs-modules-dkms

debian/patches/debian/keys-limit-abi-change-in-4.9.59.patch renames
register_key_type to register_key_type_2 without bumping the ABI counter,
breaking OpenAFS kernel modules (package openafs-modules-dkms) built
against 4.9.51.

Symptom:
[   35.911739] openafs: Unknown symbol register_key_type (err 0)
and the module fails to load.

Bug#837907: more on NFS client hangs

2017-03-28 Thread Sergio Gelato

I have some more information about [what I believe to be] this problem.

We've had similar incidents from several clients, running kernels 3.16.{36,39}
and 4.9 (jessie-backports). I think this rules out a client hardware issue.

The trigger (from the client's perspective) seems to be loss of contact with
the NFS server. The incidents are almost always preceded by one or more

nfs: server  not responding, still trying

log entries. Sometimes there is a known server-side explanation (e.g.,
nfsd thread exhaustion), but not always. In any case, the effects persist
well after communication with the server has recovered; "reboot -f" seems
to be necessary for client recovery, as sync() also hangs indefinitely.

Kernel stack traces on the client vary, as do the affected files and
applications; the issue is by no means limited to Firefox or sqlite.
If desired, I can submit a selection of stack traces (as one bug or as several).

I'm looking for suggestions on how to debug this. I'm thinking of turning on
logging with rpcdebug on the the most frequently affected clients, to better
understand the trigger. Is there anything else I should be looking at?

Bug#767389: module hpsa no longer detects MSL2024 tape changer

2017-03-03 Thread Sergio Gelato

control: fixed -1 4.9.2-2~bpo8+1

I confirm that the latest kernel in jessie-backports does not suffer from
this problem.

Am building a patched 3.16.39 (it compiles OK) but won't be able to test it
until the next maintenance window for the affected system.

Bug#767389: module hpsa no longer detects MSL2024 tape changer

2017-03-02 Thread Sergio Gelato

control: tags -1 - moreinfo + fixed-upstream
control: found -1 3.16.39-1+deb8u1

There is a report at https://wiki.debian.org/HP/ProLiant to the effect that 
this was no longer an issue in kernel 4.6.3 from jessie-backports.

Upstream commit 
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/hpsa.c?id=9a4178b76a973684750d20b684bae4f57ab9a355
 looks highly relevant and trivial to cherry-pick. Will test it ASAP.

Bug#837907: stat() hangs on a particular file

2016-10-11 Thread Sergio Gelato

This has now been seen several times on two different clients (so it
probably isn't a hardware issue such as bad RAM; the second client has
ECC). The backtraces vary slightly but always involve NFS. I haven't found
a way to recover without a reboot.

I'm now upgrading these systems to the kernel from jessie-backports; we'll
see if the problem still occurs with the newer kernel.

I'm wondering if this might be a regression in 3.16.36 since the problem
appeared recently. I haven't seen it on Ubuntu's 3.13 kernels (of which
we have many instances in production) either.

Bisecting is difficult since I still don't have a sure-fire trigger for the bug.

Bug#837907: stat() hangs on a particular file

2016-09-15 Thread Sergio Gelato

Package: linux-image-3.16.0-4-amd64
Version: 3.16.36-1+deb8u1

One of our systems is suddenly unable to stat() a particular file
(cookies.sqlite-wal in a user's Firefox profile). Any attempt to
do so hangs in the system call, as shown by strace. The file resides
on an NFSv4 share (sec=krb5p). Other files in the same directory on
the same share remain accessible. The affected file is normally accessible
on the NFS server and from other NFS clients running the same kernel.

The user has reported a similar incident yesterday on some directories
on a different NFS share (also sec=krb5p, but hosted on a different
server). He rebooted to clear up the problem.

I'd like advice on how to troubleshoot this effectively. I've tried
rpcdebug -m {nfs,rpc,nlm} -s all but didn't see any smoking gun; maybe
some information cached by the kernel is suppressing NFS activity
associated with the stat() calls. The log entries I do see say
NFS: nfs_lookup_revalidate(cookies.sqlite-wal) is valid

Some related kernel traces from this system's logs (in chronological order,
with an intervening reboot; the first trace is associated with yesterday's
incident, the second trace is 2-3 minutes newer than the timestamp on
cookies.sqlite-wal):

[97483.663949] INFO: task ls:23767 blocked for more than 120 seconds.
[97483.663951]   Tainted: PW  O  3.16.0-4-amd64 #1
[97483.663952] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[97483.663954] ls  D 8800afd3b808 0 23767  1 0x0004
[97483.663957]  8800afd3b3b0 0086 00012f40 
88039a057fd8
[97483.663960]  00012f40 8800afd3b3b0 88041ea137f0 
88041edcd128
[97483.663962]  0002 8113eb50 88039a057d60 
88039a057e40
[97483.663965] Call Trace:
[97483.663968]  [] ? wait_on_page_read+0x60/0x60
[97483.663971]  [] ? io_schedule+0x99/0x120
[97483.663974]  [] ? sleep_on_page+0xa/0x10
[97483.663977]  [] ? __wait_on_bit+0x5c/0x90
[97483.663980]  [] ? wait_on_page_bit+0xc6/0xd0
[97483.663983]  [] ? autoremove_wake_function+0x30/0x30
[97483.663986]  [] ? pagevec_lookup_tag+0x1d/0x30
[97483.663989]  [] ? filemap_fdatawait_range+0xd0/0x160
[97483.663993]  [] ? filemap_write_and_wait+0x36/0x50
[97483.664002]  [] ? nfs_getattr+0x108/0x220 [nfs]
[97483.664005]  [] ? vfs_fstatat+0x57/0x90
[97483.664009]  [] ? SYSC_newlstat+0x1d/0x40
[97483.664013]  [] ? system_call_fast_compare_end+0x10/0x15

[ 9724.415533] INFO: task mozStorage #5:2748 blocked for more than 120 seconds.
[ 9724.415537]   Tainted: PW  O 3.16.0-4-amd64 #1
[ 9724.415538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 9724.415539] mozStorage #5   D 8803ee153a88 0  2748   2323 0x
[ 9724.415542]  8803ee153630 0082 00012f40 
8803ee2cffd8
[ 9724.415544]  00012f40 8803ee153630 88041ea537f0 
88041edaf7a0
[ 9724.415545]  0002 a0669800 8803ee2cfc30 
88040b1152c0
[ 9724.415547] Call Trace:
[ 9724.415557]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
[ 9724.415560]  [] ? io_schedule+0x99/0x120
[ 9724.415563]  [] ? nfs_wait_bit_uninterruptible+0xa/0x10 
[nfs]
[ 9724.415566]  [] ? __wait_on_bit+0x5c/0x90
[ 9724.415568]  [] ? internal_add_timer+0x2a/0x70
[ 9724.415571]  [] ? nfs_pageio_doio+0x50/0x50 [nfs]
[ 9724.415573]  [] ? out_of_line_wait_on_bit+0x77/0x90
[ 9724.415575]  [] ? autoremove_wake_function+0x30/0x30
[ 9724.415578]  [] ? nfs_updatepage+0x15e/0x830 [nfs]
[ 9724.415582]  [] ? nfs_write_end+0x57/0x320 [nfs]
[ 9724.415585]  [] ? iov_iter_copy_from_user_atomic+0x75/0x190
[ 9724.415588]  [] ? generic_perform_write+0x11b/0x1c0
[ 9724.415590]  [] ? __generic_file_write_iter+0x158/0x340
[ 9724.415592]  [] ? generic_file_write_iter+0x39/0xa0
[ 9724.415595]  [] ? nfs_file_write+0x83/0x1a0 [nfs]
[ 9724.415598]  [] ? new_sync_write+0x74/0xa0
[ 9724.415600]  [] ? vfs_write+0xb2/0x1f0
[ 9724.415601]  [] ? SyS_write+0x42/0xa0
[ 9724.415603]  [] ? SyS_lseek+0x43/0xa0
[ 9724.415605]  [] ? system_call_fast_compare_end+0x10/0x15

Bug#816621: ioremap error on /sys/firmware/dmi/entries/*/raw is triggered by mcelog

2016-03-05 Thread Sergio Gelato

* Ben Hutchings [2016-03-04 20:16:18 +]:
> On Fri, 2016-03-04 at 11:12 +0100, Sergio Gelato wrote:
> > A workaround may be to teach mcelog (and dmidecode, while we're at it)
> > to use the /sys/firmware/dmi interface when available.
> 
> As another motivating point, I plan to disable /dev/mem by default in
> time for stretch.

Noted. I've filed bug #816825 against mcelog, with a simple patch to
release the mapping right away and a reference to this bug and to your
comment above. (The simple patch solves my immediate problem at work
and should fall below the copyrightability threshold. It doesn't address
the disappearance of /dev/mem; that will require more work.)

libsmbios also depends on the /dev/mem interface at the moment.

Bug#816621: ioremap error on /sys/firmware/dmi/entries/*/raw is triggered by mcelog

2016-03-04 Thread Sergio Gelato

* Sergio Gelato [2016-03-04 09:12:14 +0100]:
> I still think there is a kernel issue here: mcelog shouldn't be able to
> request the wrong page cache mode and spoil things for everyone else.

It turns out that mcelog, just like dmidecode, mmap()s portions of /dev/mem,
which results in the pages being marked WB for the lifetime of the mapping,
which can be short (dmidecode, mcelog --dmi) or long (mcelog --daemon).

Any attempt to read from /sys/firmware/dmi/entries/*/raw while the pages
are marked WB results in EINVAL. This is because dmi_remap() is an alias
for ioremap(), and the latter is currently a wrapper around ioremap_nocache().

> (Or is it dmi_remap() that's asking for the wrong mode? I'm not quite sure:
> if DMI data are non-volatile and read-only (are they always?) why shouldn't
> they be cached?)

In other words: in arch/x86/include/asm/dmi.h (and perhaps in
arch/ia64/include/asm/dmi.h), would it be safe to

#define dmi_remap ioremap_cache

instead of the current definition? If the answer is yes, that should solve
the problem. Otherwise it's the mmap code that may need adjusting.

A workaround may be to teach mcelog (and dmidecode, while we're at it)
to use the /sys/firmware/dmi interface when available.

Bug#816621: ioremap error on /sys/firmware/dmi/entries/*/raw is triggered by mcelog

2016-03-04 Thread Sergio Gelato

A closer look at the events around t=9s (when the page cache mode is switched
to WB) pointed to mcelog as a suspect. Indeed the problem went away after
purging the mcelog package. With mcelog (104-1) installed I was getting the
following message in the logs:

mcelog: failed to prefill DIMM database from DMI data

I still think there is a kernel issue here: mcelog shouldn't be able to
request the wrong page cache mode and spoil things for everyone else.

(Or is it dmi_remap() that's asking for the wrong mode? I'm not quite sure:
if DMI data are non-volatile and read-only (are they always?) why shouldn't
they be cached?)

Bug#816621: ioremap error reading /sys/firmware/dmi/entries/1-0/raw

2016-03-03 Thread Sergio Gelato

Package: linux-image-3.16.0-4-amd64
Version: 3.16.7-ckt20-1+deb8u3

Seen on an ASUS B150M-C motherboard with BIOS version 0806 (the latest as of
this writing), booting in UEFI mode. The problem is hardware-dependent, I'm
not seeing it on my other jessie hosts with the same kernel. (None of the
others support UEFI, though.)

All attempts to read /sys/firmware/dmi/entries/1-0/raw (or indeed any other
/sys/firmware/dmi/entries/*/raw file; 1-0/raw is the one "/usr/bin/facter 
virtual"
tries to read in order to recognise Google Compute Engine instances)
FAIL with, e.g.,

$ od -t cx1 /sys/firmware/dmi/entries/1-0/raw
od: /sys/firmware/dmi/entries/1-0/raw: read error: Invalid argument
000

The attempt generates the following kernel log entries (when booting with 
debugpat):

[ 3967.651507] Overlap at 0xbfed6000-0xbfed8000
[ 3967.651516] reserve_memtype added [mem 0xbfed6000-0xbfed7fff], track 
write-back, req uncached-minus, ret write-back
[ 3967.651521] ioremap error for 0xbfed6000-0xbfed8000, requested 0x10, got 0x0
[ 3967.651528] free_memtype request [mem 0xbfed6000-0xbfed7fff]

/sys/kernel/debug/x86/pat_memtype_list confirms that the relevant range is WB:
write-back @ 0xbfed6000-0xbfed8000

Early during the boot, that address range is marked UC-:

[0.612932] reserve_memtype added [mem 0xbfed6000-0xbfed7fff], track 
uncached-minus, req uncached-minus, ret uncached-minus
[0.613332] free_memtype request [mem 0xbfed6000-0xbfed7fff]

but soon that changes to WB:

[8.328236] reserve_memtype added [mem 0xbfed6000-0xbfed7fff], track 
write-back, req write-back, ret write-back
[8.328580] Overlap at 0xbfed6000-0xbfed8000
[8.328582] reserve_memtype added [mem 0xbfed6000-0xbfed7fff], track 
write-back, req write-back, ret write-back
[8.328611] free_memtype request [mem 0xbfed6000-0xbfed7fff]

I'm attaching a full dmesg output (obtained after another reboot, so the
timings are slightly different).

dmidecode has no trouble reading the information (also attached).
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) 
(gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt20-1+deb8u3 
(2016-01-17)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 
root=/dev/mapper/xxx--vg-root ro quiet debugpat
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009efff] usable
[0.00] BIOS-e820: [mem 0x0009f000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0xb5ea] usable
[0.00] BIOS-e820: [mem 0xb5eb-0xb5edcfff] ACPI data
[0.00] BIOS-e820: [mem 0xb5edd000-0xb7a90fff] usable
[0.00] BIOS-e820: [mem 0xb7a91000-0xb7a91fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xb7a92000-0xb7adbfff] reserved
[0.00] BIOS-e820: [mem 0xb7adc000-0xb7b82fff] usable
[0.00] BIOS-e820: [mem 0xb7b83000-0xb8203fff] reserved
[0.00] BIOS-e820: [mem 0xb8204000-0xbdadefff] usable
[0.00] BIOS-e820: [mem 0xbdadf000-0xbf217fff] reserved
[0.00] BIOS-e820: [mem 0xbf218000-0xbf224fff] ACPI data
[0.00] BIOS-e820: [mem 0xbf225000-0xbf3d] usable
[0.00] BIOS-e820: [mem 0xbf3e-0xbfa1cfff] ACPI NVS
[0.00] BIOS-e820: [mem 0xbfa1d000-0xbff94fff] reserved
[0.00] BIOS-e820: [mem 0xbff95000-0xbfffefff] type 20
[0.00] BIOS-e820: [mem 0xb000-0xbfff] usable
[0.00] BIOS-e820: [mem 0xc000-0xc00f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00103aff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] efi: EFI v2.40 by American Megatrends
[0.00] efi:  ACPI=0xb5eb  ACPI 2.0=0xb5eb  SMBIOS=0xf05b0  
MPS=0xfc9c0 
[0.00] efi: mem00: type=3, attr=0xf, 
range=[0x-0x8000) (0MB)
[0.00] efi: mem01: type=7, attr=0xf, 
range=[0x8000-0x00058000) (0MB)
[0.00] efi: mem02: type=0, attr=0xf, 
range=[0x00058000-0x000

Bug#754420: NULL pointer dereference in set_nfsv4_acl_one()

2014-07-26 Thread Sergio Gelato

See my report of the same bug(*) in Ubuntu at
https://bugs.launchpad.net/debian/+source/linux/+bug/1348670

(*) identification based on a comparison of the stack traces and on the
fact that it is a regression introduced in 3.2.60.


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20140726083621.ga23...@hanuman.astro.su.se

Bug#728255: Please cherrypick upstream commit c93e8d8eeafec3e3228e24dfebef113e0a79a788

2014-06-18 Thread Sergio Gelato

tags 728255 + security fixed-upstream
thanks

My question was answered in an upstream git commit, sadly post-1.2.8 so it
isn't in any Debian or Ubuntu packages yet. Please cherrypick this commit:

author  Signed-off-by: NeilBrown 
Tue, 28 May 2013 16:59:22 + (12:59 -0400)
committer   Steve Dickson
Tue, 28 May 2013 18:28:38 + (14:28 -0400)
commit  c93e8d8eeafec3e3228e24dfebef113e0a79a788

gssd: Fix recent fix to Avoid DNS reverse resolution in gssd.

The final version for this fix that was committed inverted the test
so makes no change in the important cases.

The documentation didn't really help a naive user know when the new -D
flag should be used.

And the code (once fixed) avoided DNS resolution on non-qualified names too,
which probably isn't a good idea.

This patch fixes all three issues.

Signed-off-by: NeilBrown 
Signed-off-by: Steve Dickson 


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20140618091937.gc22...@hanuman.astro.su.se

Bug#728255: rpc.gssd: Cannot determine realm for numeric host address

2013-10-29 Thread Sergio Gelato

Package: nfs-common
Version: 1:1.2.6-4

This appears to be a regression caused by the fix for CVE-2013-1923.

Symptoms (hostnames and IP addresses have been changed but nothing else):

Oct 29 14:29:39 MYHOST rpc.gssd[15905]: ERROR: Cannot determine realm for 
numeric host address while getting realm(s) for host '192.0.2.34'
Oct 29 14:29:39 MYHOST rpc.gssd[15905]: ERROR: 
gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab 
/etc/krb5.keytab for connection with host 192.0.2.34
Oct 29 14:29:39 MYHOST kernel: [1321146.189554] RPC: AUTH_GSS upcall timed out.
Oct 29 14:29:39 MYHOST kernel: [1321146.189557] Please check user daemon is 
running.
Oct 29 14:29:39 MYHOST rpc.gssd[15905]: ERROR: Cannot determine realm for 
numeric host address while getting realm(s) for host '192.0.2.34'
Oct 29 14:29:39 MYHOST rpc.gssd[15905]: ERROR: 
gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab 
/etc/krb5.keytab for connection with host 192.0.2.34

It doesn't happen particularly often; maybe a couple of times a day on
this (admittedly lightly loaded) system.

What I think is happening here is that:
(a) the kernel (3.2.0-4-686-pae #1 SMP Debian 3.2.51-1 i686 GNU/Linux) 
sometimes publishes a numeric IP address instead of the server name 
in the first line of /var/lib/nfs/rpc_pipefs/nfs/clnt*/info ;
(b) when this happens, utils/gssd/gssd_proc.c:get_servername() 
(with avoid_dns==1 since the security fix) simply returns "192.0.2.34"
instead of calling getnameinfo();
(c) utils/gssd/krb5_util.c:get_full_hostname() only calls getaddrinfo(),
not getnameinfo(), and returns "192.0.2.34" when fed "192.0.2.34" as input;
(d) krb5_get_host_realm() doesn't know the realm name for "192.0.2.34".

I'll try the new -D option, but this just disables the security fix.

I wonder if the security fix has been coded correctly. The associated
comments say that the intent is not to do DNS lookups on server names,
but "[i]f it is an IP address, do the DNS lookup". The logic, however,
seems reversed. Could someone please double-check this? (I'm fairly
confident that I'm not misreading the code; what I'd like a second
opinion on is the coder's intent and the security implications of
reversing the logic.)

Another question is why the kernel upcall sometimes (not very often)
refers to the server by IP address instead of by name. There may be
a kernel bug lurking here.


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20131029230650.ga23...@hanuman.astro.su.se

Bug#701616: shouldn't CVE-2012-4530 fix have bumped ABI revision counter?

2013-02-25 Thread Sergio Gelato

Package: linux-2.6
Version: 2.6.32-48

If I upgrade the linux-image package on a running system from
2.6.32-46 to 2.6.32-48, then run
modprobe binfmt_misc
before rebooting, the kernel fails to load the module and reports
binfmt_misc: Unknown symbol bprm_change_interp

That symbol was introduced by
debian/patches/bugfix/all/exec-do-not-leave-bprm-interp-on-stack.patch
(as part of the fix for CVE-2012-4530, says the changelog).

I know this will go away after a reboot, but isn't the point of kernel ABI
revision numbers to prevent this kind of problem? Is there a bug in the tools
the kernel package maintainers use to detect ABI changes?

I've seen hints of a similar issue with the lockd module, by the way. No
new symbols as far as I can tell, but trying to load the 2.6.32-48 module
into a 2.6.32-46 kernel results in
lockd_up: makesock failed, error=-13
and lots of
svc: failed to register lockdv1 RPC service (errno 13).
with NFS mounts failing. This also goes away after rebooting into 2.6.32-48.


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130225112418.gb2...@ebisu.astro.su.se

Bug#676360: xen: oops at atomic64_read_cx8+0x4

2012-06-07 Thread Sergio Gelato

* Andrea Arcangeli [2012-06-07 12:33:55 +0200]:
> I guess if Xen can't be updated to handle an atomic64_read on a pmd in
> the guest, 

I'm not sure if it makes a difference, but just in case: I observed the
problem in a dom0.

>we can add a pmd_read paravirt op? Or if we don't want to
> break the paravirt interface a loop like gup_fast with irq disabled
> should also work but looping + local_irq_disable()/enable() sounded
> worse and more complex than a atomic64_read (gup fast already disables
> irqs because it doesn't hold the mmap_sem so it's a different cost
> looping there). AFIK Xen disables THP during boot, so a check on THP
> being enabled and falling back in the THP=n version of
> pmd_read_atomic, would also be safe, but it's not so nice to do it
> with a runtime check.
> 
> Thanks,
> Andrea



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120607123031.gb2...@hanuman.astro.su.se

Bug#676360: nouveau: PFIFO_CACHE_ERROR - Ch 0/7

2012-06-06 Thread Sergio Gelato

* Jonathan Nieder [2012-06-06 11:30:00 -0500]:
> Please test 3.4.1 from experimental (it should finish building and
> show up at incoming.debian.org soon[1]).
> 
> Hope that helps,

Sorry to disappoint you. That 3.4.1-1~experimental.1 build
(3.4-trunk-686-pae #1 SMP Wed Jun 6 15:11:31 UTC 2012 i686 GNU/Linux)
is even less well-behaved under Xen: I'm getting a kernel OOPS at
EIP: [] atomic64_read_cx8+0x4/0xc SS:ESP e021:ca853c6c
The top of the trace message unfortunately scrolled off the console before I
could see it, and the message doesn't have time to make it to syslog (either
local or remote).

An interesting twist: the trace is timestamped 0.776065 but there is exactly
one more message on the console at 1.344071 before the system hangs:
Refined TSC clocksource calibration: 3191.999 MHz.
(This is indeed a 3.2 GHz CPU.)

Non-Xen boots proceed normally.

Feel free to split this bug report since this new OOPS is clearly unrelated
to the nouveau issue (it happens even when the nouveau module is blacklisted);
I only mention it here because it's preventing me from checking whether the
problem with nouveau is still present.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120607064017.ga2...@hanuman.astro.su.se

Bug#676360: nouveau: PFIFO_CACHE_ERROR - Ch 0/7

2012-06-06 Thread Sergio Gelato

Package: linux-image-3.2.0-2-686-pae
Version: 3.2.18-1

On a fresh wheezy install I got a flood (~ 50 kHz) of messages of the following
type:

Jun  5 09:54:17 hostname kernel: [   14.410948] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x1470 Data 0x4956cc77
Jun  5 09:54:17 hostname kernel: [   14.410969] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x1474 Data 0x204f4544
Jun  5 09:54:17 hostname kernel: [   14.410989] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x1478 Data 0x000d
Jun  5 09:54:17 hostname kernel: [   14.411009] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x147c Data 0x0d260138
Jun  5 09:54:17 hostname kernel: [   14.411029] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x1480 Data 0x4249
Jun  5 09:54:17 hostname kernel: [   14.411049] [drm] nouveau :01:00.0: 
PFIFO_CACHE_ERROR - Ch 0/7 Mthd 0x1484 Data 0x4756204d

when booting as a Xen 4.1 dom0. The console displays old content from a 
previous, non-Xen boot instead of the expected boot messages and login
prompt.

Workaround (successfully tested): blacklist nouveau in /etc/modprobe.d/.

Some hardware information:
# lspci -nnvv -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation NV34 [GeForce FX 
5200] [10de:0322] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:01b9]
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- 

Kernel version:
[0.00] Linux version 3.2.0-2-686-pae (Debian 3.2.18-1) 
(debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-5) ) #1 SMP 
Mon May 21 18:24:12 UTC 2012

Xen version:
(XEN) Xen version 4.1.2 (Debian 4.1.2-6) (wa...@debian.org) (gcc version 4.6.3 
(Debian 4.6.3-5) ) Sun May  6 18:55:17 UTC 2012
(XEN) Bootloader: GRUB 1.99-21
(XEN) Command line: placeholder
(XEN) Video information:
(XEN)  VGA is text mode 80x25, font 8x16
(XEN)  VBE/DDC methods: V2; EDID transfer time: 1 seconds

nouveau-related messages in a non-Xen boot:
Jun  5 09:05:08 hostname kernel: [7.424742] nouveau :01:00.0: PCI INT A 
-> GSI 16 (level, low) -> IRQ 16
Jun  5 09:05:08 hostname kernel: [7.426597] [drm] nouveau :01:00.0: 
Detected an NV30 generation card (0x034200a2)
Jun  5 09:05:08 hostname kernel: [7.426781] [drm] nouveau :01:00.0: 
Attempting to load BIOS image from PRAMIN
Jun  5 09:05:08 hostname kernel: [7.470839] [drm] nouveau :01:00.0: ... 
appears to be valid
Jun  5 09:05:08 hostname kernel: [7.471109] [drm] nouveau :01:00.0: BMP 
BIOS found
Jun  5 09:05:08 hostname kernel: [7.471112] [drm] nouveau :01:00.0: BMP 
version 5.39
Jun  5 09:05:08 hostname kernel: [7.471115] [drm] nouveau :01:00.0: 
Bios version 04.34.20.22
Jun  5 09:05:08 hostname kernel: [7.471118] [drm] nouveau :01:00.0: 
Found Display Configuration Block version 2.2
Jun  5 09:05:08 hostname kernel: [7.471123] [drm] nouveau :01:00.0: Raw 
DCB entry 0: 01000300 88b8
Jun  5 09:05:08 hostname kernel: [7.471126] [drm] nouveau :01:00.0: Raw 
DCB entry 1: 02010310 88b8
Jun  5 09:05:08 hostname kernel: [7.471129] [drm] nouveau :01:00.0: Raw 
DCB entry 2: 01010312 
Jun  5 09:05:08 hostname kernel: [7.471132] [drm] nouveau :01:00.0: Raw 
DCB entry 3: 02020321 0003
Jun  5 09:05:08 hostname kernel: [7.471319] [drm] nouveau :01:00.0: 
Loading NV17 power sequencing microcode
Jun  5 09:05:08 hostname kernel: [7.471323] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 0 at offset 0xE947
Jun  5 09:05:08 hostname kernel: [7.473347] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 1 at offset 0xEB87
Jun  5 09:05:08 hostname kernel: [7.473356] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 2 at offset 0xECCD
Jun  5 09:05:08 hostname kernel: [7.473392] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 3 at offset 0xEE53
Jun  5 09:05:08 hostname kernel: [7.473396] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 4 at offset 0xEE70
Jun  5 09:05:08 hostname kernel: [7.473401] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 5 at offset 0xEE8D
Jun  5 09:05:08 hostname kernel: [7.493002] [drm] nouveau :01:00.0: 
Parsing VBIOS init table 6 at offset 0xF011
Jun  5 09:05:08 hostname kernel: [7.512553] [drm] nouveau :01:00.0: 1 
available performance level(s)
Jun  5 09:05:08 hostname kernel: [7.512558] [drm] nouveau :01:00.0: 0:
Jun  5 09:05:08 hostname kernel: [7.512566] [drm] nouveau :01:00.0: c: 
core 249MHz memory 405MHz
Jun  5 09:05:08 hostname kernel: [7.512769] [drm] nouveau :01:00.0: 
Detected 128MiB VRAM
Jun  5 09:05:08 hostname kernel: [7.512977] nouveau :01:00.0: putting 
AGP V3 device into 8x mode
Jun  5 09:05:08 hostname kernel: [7.512985] [drm] nou

Bug#606482: no headphone output on ASUS M4A785T-D motherboard

2012-02-27 Thread Sergio Gelato

* Ben Hutchings [2012-02-27 14:50:45 +]:
> On Mon, 2012-02-27 at 09:06 +0100, Sergio Gelato wrote:
> > * Jonathan Nieder [2012-02-25 21:19:43 -0600]:
> > > Sergio Gelato wrote:
> > > 
> > > > The problem turned out to be due to an inappropriate BIOS configuration
> > > > setting. The "Front Panel Select" setting needed (for my specific case)
> > > > to be set to "AC97" instead of "HD Audio". Found thanks to some 
> > > > comments in
> > > > https://bugtrack.alsa-project.org/alsa-bug/view.php?id=5309
> > > > (especially note 23449).
> > > >
> > > > No kernel changes needed.
> > > 
> > > That sounds like a workaround rather than a fix.
> > 
> > I don't see it that way: my front panel is in fact of the older AC97 type,
> > so the earlier BIOS setting was incorrect and changing it was the proper
> > thing to do.
> > 
> > When I installed this motherboard I wasn't sure whether I had HDA or AC97,
> > so I decided to try one setting and see if it worked. The fact that the
> > Ubuntu 10.04 kernel appeared to work even with the setting I tried first
> > actually made it harder for me to correctly diagnose the problem.
> 
> AC97 and HDA are specifications for the interface from the PCI(e) bus to
> the sound chip.  Some chips support both interfaces, either at the same
> time or selected by firmware (BIOS setting).
> 
> [...]
> > My preferred approach would be to add this to the troubleshooting guides:
> > if your audio front panel is misbehaving, check that it is of the right type
> > for your motherboard. (Probably with some additional words about AC97 vs.
> > HDA and/or a link to an external reference.)
> [...]
> 
> The connection from the sound chip to any external connectors is
> independent of such specifications; there is no such thing as an 'HDA
> front panel'.  However the two specifications have different ways for
> the chip/board to describe which connectors are wired to it.  The
> problem you're seeing is very likely related to some oddity of the
> description.

If you say so. I don't claim any real expertise in this area. I do note that
http://www.intel.com/support/motherboards/desktop/sb/cs-020642.htm#standards
says in part:
"To identify your front panel audio solution’s audio codec, refer 
 to the specifications or documentation for your PC chassis or 
 front panel module. Note that AC’97 and HD Audio front panel 
 solutions are different and may not be directly compatible 
 or interchangeable."
and goes on to point out a physical wiring difference involving pin 4.



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120227155030.gg2...@hanuman.astro.su.se

Bug#606482: no headphone output on ASUS M4A785T-D motherboard

2012-02-27 Thread Sergio Gelato

* Jonathan Nieder [2012-02-25 21:19:43 -0600]:
> Sergio Gelato wrote:
> 
> > The problem turned out to be due to an inappropriate BIOS configuration
> > setting. The "Front Panel Select" setting needed (for my specific case)
> > to be set to "AC97" instead of "HD Audio". Found thanks to some comments in
> > https://bugtrack.alsa-project.org/alsa-bug/view.php?id=5309
> > (especially note 23449).
> >
> > No kernel changes needed.
> 
> That sounds like a workaround rather than a fix.

I don't see it that way: my front panel is in fact of the older AC97 type,
so the earlier BIOS setting was incorrect and changing it was the proper
thing to do.

When I installed this motherboard I wasn't sure whether I had HDA or AC97,
so I decided to try one setting and see if it worked. The fact that the
Ubuntu 10.04 kernel appeared to work even with the setting I tried first
actually made it harder for me to correctly diagnose the problem.

> If it is possible to get the headphone jack working in HDA mode as well
> as AC97, we would like to do that, to avoid new users having to learn what
> BIOS knob to change.

That way lies madness. Maybe in this particular instance one can get away with 
it, but in general this approach will add complexity to the software. 
It's already bad enough to have to work around hardware bugs.

My impression so far is that the newer HDA front panels allow better power
management and that the driver change that apparently broke headphone sound
for me was actually an enhancement to make better use of the capabilities of
HDA. So of course one can revert to the older approach, but then one
probably loses some benefits of the newer one. One could add a kernel 
(module) option to control this, but then the user needs to figure out what 
setting is needed and it's just as easy to find out about the BIOS switch 
instead.

My preferred approach would be to add this to the troubleshooting guides:
if your audio front panel is misbehaving, check that it is of the right type
for your motherboard. (Probably with some additional words about AC97 vs.
HDA and/or a link to an external reference.)

Now, if there was a way for the kernel to detect a misconfigured front panel
and issue a warning in dmesg that would be great. I have no idea whether that
is feasible.

>  Based on the upstream report you mentioned it
> seems that 2.6.39 might fix this; could you try 3.2.y from wheezy or
> unstable?  (The only packages needed from outside squeeze for this
> test are the kernel image itself, linux-base, and initramfs-tools.)

The reason I revisited this bug now is that I tried 3.2.0-0.bpo.1 (for
other reasons) and found that the headphone functionality was still
broken. That prompted me to make a new search through the ALSA knowledge 
base, which yielded the hint about the two types of front panel. The
report I found the hint in ended up addressing some other issue.

> Thanks for the update,
> Jonathan

-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120227080637.ga2...@hanuman.astro.su.se

Bug#632141: rpc.gssd: ERROR: can't open /var/lib/nfs/rpc_pipefs/nfs/clnt[[:xdigit:]]+: No such file or directory

2011-06-29 Thread Sergio Gelato

Package: nfs-common
Version: 1:1.2.2-4

On a VM where autofs is being kept rather busy mounting and unmounting NFS
shares I get a stream of messages like the following:

rpc.gssd[513]: ERROR: can't open /var/lib/nfs/rpc_pipefs/nfs/clnt1ac6: No such 
file or directory
rpc.gssd[513]: ERROR: can't open /var/lib/nfs/rpc_pipefs/nfs/clnt1aca: No such 
file or directory

and so on with the client counter incrementing in uneven steps. From eyeballing
the logs I'd say that about 1-3% of the clnt* directories give rise to such a
message.

The message is generated by utils/gssd/gssd_proc.c:process_clnt_dir(), which
is called from utils/gssd/gssd_proc.c:process_pipedir(). The latter does a
scandir(3) on /var/lib/nfs/rpc_pipefs/nfs, then calls process_clnt_dir()
for every entry whose name begins with "clnt". process_clnt_dir() tries to
open(2) the clnt* directory and sometimes gets ENOENT. My guess is that there
is a race condition: something else removes clnt* directories between the
scandir() and the open(). If so, maybe the message should be downgraded to
a WARNING and not printed at verbosity 0?

I haven't been able to discern an actual impact other than the stream of
error messages in the logs. Of course that doesn't mean that there is no 
such impact.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110629202823.gb14...@astro.su.se

Bug#606482: 2.6.32-5-amd64: no headphone output on ASUS M4A785T-D motherboard

2010-12-14 Thread Sergio Gelato

* Ben Hutchings [2010-12-13 02:15:03 +]:
> On Sun, 2010-12-12 at 11:41 +0100, Sergio Gelato wrote:
> > * Ben Hutchings [2010-12-12 03:10:35 +]:
> > > This might be a problem with the headphone detection feature.  You could
> > > try to disable this by turning off the 'Jack Detect' switch.
> > 
> > I would, if I knew how. amixer mentions no such switch.
> 
> I'm just reading the code and it looks like it will create that switch,
> but maybe I am wrong.
> 
> Please test Linux 2.6.37-rc5 as packaged in experimental.

I have now done so, and got the same symptoms: no sound through the headphones
(except for a single "pop" during boot, which I assume to be related to 
hardware initialization). Still no 'Jack Detect' switch either.

For completeness I'll mention that I tried with both settings of the
'Independent HP' switch. No headphone output in either case. Not even
if I explicitly select "Analog Headphones" as the output connector in
Sound Preferences (which mutes the rear speaker output).

-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101214090246.ga21...@astro.su.se

Bug#606482: 2.6.32-5-amd64: no headphone output on ASUS M4A785T-D motherboard

2010-12-12 Thread Sergio Gelato

* Ben Hutchings [2010-12-12 03:10:35 +]:
> This might be a problem with the headphone detection feature.  You could
> try to disable this by turning off the 'Jack Detect' switch.

I would, if I knew how. amixer mentions no such switch.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101212104113.ga29...@astro.su.se

Bug#606482: 2.6.32-5-amd64: no headphone output on ASUS M4A785T-D motherboard

2010-12-09 Thread Sergio Gelato

Package: linux-image-2.6.32-5-amd64
Version: 2.6.32-28
Severity: normal

On this hardware:

00:14.2 Audio device [0403]: ATI Technologies Inc SBx00 Azalia (Intel HDA) 
[1002:4383]
Subsystem: ASUSTeK Computer Inc. M4A785TD Motherboard [1043:836c]
Kernel driver in use: HDA Intel
01:05.1 Audio device [0403]: ATI Technologies Inc RS880 Audio Device [Radeon HD 
4200] [1002:970f]
Subsystem: ASUSTeK Computer Inc. M4A785TD Motherboard [1043:83a2]
Kernel driver in use: HDA Intel

I get no sound through the headphone output when using the above named kernel.
I do get sound through the rear speaker output.

Using the kernel in Ubuntu 10.04 (2.6.32-26-generic version 2.6.32-26.48)
on the same hardware I do get sound through the headphone output.

Looking at the kernel source code, I see that Debian's kernel includes
many patches backported from newer kernels (there are 19 patches in
debian/patches/features/all/hda-via/).

Codec information collected by alsa-info:

Codec: VIA VT1708S
Address: 0
Function Id: 0x1
Vendor Id: 0x11060397
Subsystem Id: 0x1043836c
Revision Id: 0x10
No Modem Function Group found
Default PCM:
rates [0x0]:
bits [0x0]:
formats [0x0]:
efault Amp-In caps: N/A
Default Amp-Out caps: N/A
GPIO: io=1, o=0, i=0, unsolicited=1, wake=0
  IO[0]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
Node 0x10 [Audio Output] wcaps 0x41d: Stereo Amp-Out
  Amp-Out caps: ofs=0x2a, nsteps=0x2a, stepsize=0x05, mute=0
  Amp-Out vals:  [0x2a 0x2a]
  Converter: stream=0, channel=0
  PCM:
rates [0x5e0]: 44100 48000 88200 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D3, actual=D3
Node 0x11 [Audio Output] wcaps 0x41d: Stereo Amp-Out
  Amp-Out caps: ofs=0x2a, nsteps=0x2a, stepsize=0x05, mute=0
  Amp-Out vals:  [0x2a 0x2a]
  Converter: stream=0, channel=0
  PCM:
rates [0x5e0]: 44100 48000 88200 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D3, actual=D3
Node 0x12 [Audio Output] wcaps 0x611: Stereo Digital
  Converter: stream=0, channel=0
  Digital:
  Digital category: 0x0
  PCM:
rates [0x5e0]: 44100 48000 88200 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D0, actual=D0
Node 0x13 [Audio Input] wcaps 0x10051b: Stereo Amp-In
  Amp-In caps: ofs=0x0b, nsteps=0x1f, stepsize=0x05, mute=1
  Amp-In vals:  [0x00 0x00]
  Converter: stream=0, channel=0
  SDI-Select: 0
  PCM:
rates [0x560]: 44100 48000 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D0, actual=D0
  Connection: 1
 0x17
Node 0x14 [Audio Input] wcaps 0x10051b: Stereo Amp-In
  Amp-In caps: ofs=0x0b, nsteps=0x1f, stepsize=0x05, mute=1
  Amp-In vals:  [0x00 0x00]
  Converter: stream=0, channel=0
  SDI-Select: 0
  PCM:
rates [0x560]: 44100 48000 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D0, actual=D0
  Connection: 1
 0x1e
Node 0x15 [Audio Output] wcaps 0x611: Stereo Digital
  Converter: stream=0, channel=0
  Digital:
  Digital category: 0x0
  PCM:
rates [0x5e0]: 44100 48000 88200 96000 192000
bits [0xe]: 16 20 24
formats [0x1]: PCM
  Power: setting=D0, actual=D0
Node 0x16 [Audio Mixer] wcaps 0x20050b: Stereo Amp-In
  Amp-In caps: ofs=0x17, nsteps=0x1f, stepsize=0x05, mute=1
  Amp-In vals:  [0x1f 0x1f] [0x00 0x00] [0x00 0x00] [0x00 0x00] [0x00 0x00] 
[0x97 0x97] [0x97 0x97]
  Power: setting=D3, actual=D3
  Connection: 7
 0x10 0x1f 0x1a 0x1b 0x1e 0x1d 0x25
Node 0x17 [Audio Selector] wcaps 0x300501: Stereo
  Power: setting=D0, actual=D0
  Connection: 6
 0x1f 0x1a* 0x1b 0x1e 0x1d 0x16
Node 0x18 [Audio Selector] wcaps 0x30050d: Stereo Amp-Out
  Amp-Out caps: ofs=0x00, nsteps=0x00, stepsize=0x00, mute=1
  Amp-Out vals:  [0x00 0x00]
  Power: setting=D3, actual=D3
  Connection: 1
 0x11
Node 0x19 [Pin Complex] wcaps 0x400581: Stereo
  Pincap 0x0014: OUT Detect
  Pin Default 0x01011012: [Jack] Line Out at Ext Rear
Conn = 1/8, Color = Black
DefAssociation = 0x1, Sequence = 0x2
  Pin-ctls: 0x40: OUT
  Unsolicited: tag=04, enabled=1
  Power: setting=D3, actual=D3
  Connection: 1
 0x18
Node 0x1a [Pin Complex] wcaps 0x400581: Stereo
  Pincap 0x2334: IN OUT Detect
Vref caps: HIZ 50 100
  Pin Default 0x01a19036: [Jack] Mic at Ext Rear
Conn = 1/8, Color = Pink
DefAssociation = 0x3, Sequence = 0x6
  Pin-ctls: 0x21: IN VREF_50
  Unsolicited: tag=04, enabled=1
  Power: setting=D3, actual=D3
  Connection: 1
 0x26
Node 0x1b [Pin Complex] wcaps 0x400581: Stereo
  Pincap 0x2334: IN OUT Detect
Vref caps: HIZ 50 100
  Pin Default 0x0181303e: [Jack] Line In at Ext Rear
Conn = 1/8, Color = Blue
DefAssociation = 0x3, Sequence = 0xe
  Pin-ctls: 0x20: IN VREF_HIZ
  Unsolicited: tag=04, enabled=1
  Power: setting=D0, actual=D0
  Connection: 1
 0x18
Node 0x1c [Pin Complex] wcaps 0x40058d: Stereo Amp-Out
  Amp-Out caps: ofs=0x00, nsteps=0x00, stepsize=0x00, mute=1
  Amp-Out vals:  [0x00 0x00]
  Pin

Bug#577199: linux-image-2.6.26-2-686: NULL pointer dereference at xfs:xfs_vn_getattr+0x16/0x1cf

2010-04-10 Thread Sergio Gelato

Package: linux-image-2.6.26-2-686
Version: 2.6.26-21lenny4

I periodically walk through local filesystems compiling file ownership
statistics. On one large XFS filesystem on one server, the process has
started failing after about 42 days uptime. This is the second time I
see these symptoms; the first time, a reboot cleared them. The failure
always occurs stat()ing a particular file, and once it starts happening
it is 100% reproducible. My guess is that some kernel-internal data
structure got corrupted.

Here are the relevant kernel logs (I've obfuscated the actual host name):

Apr  9 22:03:40 HOST kernel: [6289152.198741] BUG: unable to handle kernel NULL 
pointer dereference at 0008
Apr  9 22:03:40 HOST kernel: [6289152.198809] IP: [] 
:xfs:xfs_vn_getattr+0x16/0x1cf
Apr  9 22:03:40 HOST kernel: [6289152.198873] *pde = 
Apr  9 22:03:40 HOST kernel: [6289152.198903] Oops:  [#1] SMP
Apr  9 22:03:40 HOST kernel: [6289152.198933] Modules linked in: openafs(P) 
inet_diag ppdev parport_pc lp parport autofs4 ipv6 microcode firmware_class 
nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc xt_tcpudp nf_conntrack_ipv4 
xt_state nf_conntrack iptable_filter ip_tables x_tables xfs battery ac 
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore evdev snd_page_alloc 
serio_raw pcspkr psmouse i2c_piix4 button i2c_core sworks_agp agpgart dcdbas 
ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod sg sd_mod ide_cd_mod cdrom 
ide_pci_generic serverworks ide_core mptspi ata_generic mptscsih floppy 
ohci_hcd mptbase scsi_transport_spi libata tg3 usbcore scsi_mod dock thermal 
processor fan thermal_sys [last unloaded: scsi_wait_scan]
Apr  9 22:03:40 HOST kernel: [6289152.199382]
Apr  9 22:03:40 HOST kernel: [6289152.199405] Pid: 26505, comm: uid-stats 
Tainted: P  (2.6.26-2-686 #1)
Apr  9 22:03:40 HOST kernel: [6289152.199453] EIP: 0060:[] EFLAGS: 
00010246 CPU: 0
Apr  9 22:03:40 HOST kernel: [6289152.199502] EIP is at 
xfs_vn_getattr+0x16/0x1cf [xfs]
Apr  9 22:03:40 HOST kernel: [6289152.199531] EAX: df41dd80 EBX:  ECX: 
 EDX: 
Apr  9 22:03:40 HOST kernel: [6289152.199563] ESI: d1e05f64 EDI: e0c0d5a0 EBP: 
d9f26080 ESP: d1e05ed0
Apr  9 22:03:40 HOST kernel: [6289152.199594]  DS: 007b ES: 007b FS: 00d8 GS: 
0033 SS: 0068
Apr  9 22:03:40 HOST kernel: [6289152.199624] Process uid-stats (pid: 26505, 
ti=d1e04000 task=deadaca0 task.ti=d1e04000)
Apr  9 22:03:40 HOST kernel: [6289152.199658] Stack: 082efb10  c1d055b8 
e0c0d5a0 d9f26080 c0177452 d1e05f64 df41dd80
Apr  9 22:03:40 HOST kernel: [6289152.199730] d1e05f64 d1e05f04 
d1e04000 c017752a df41dd80 c1d055b8 03f6
Apr  9 22:03:40 HOST kernel: [6289152.199801]   
0001  1000 0010 
Apr  9 22:03:40 HOST kernel: [6289152.199872] Call Trace:
Apr  9 22:03:40 HOST kernel: [6289152.199919]  [] 
xfs_vn_getattr+0x0/0x1cf [xfs]
Apr  9 22:03:40 HOST kernel: [6289152.199971]  [] 
vfs_getattr+0x36/0x4d
Apr  9 22:03:40 HOST kernel: [6289152.200019]  [] 
vfs_lstat_fd+0x27/0x39
Apr  9 22:03:40 HOST kernel: [6289152.200090]  [] sys_lstat64+0xf/0x23
Apr  9 22:03:40 HOST kernel: [6289152.200154]  [] 
sysenter_past_esp+0x78/0xb1
Apr  9 22:03:40 HOST kernel: [6289152.200203]  [] 
quirk_vt8235_acpi+0x5e/0x7a
Apr  9 22:03:40 HOST kernel: [6289152.200250]  ===
Apr  9 22:03:40 HOST kernel: [6289152.200276] Code: 74 12 8b 42 4c 89 81 e8 00 
00 00 8b 42 50 89 81 ec 00 00 00 c3 55 57 56 89 ce 53 83 ec 04 8b 6a 0c 31 d2 
89 d1 8b 9d 50 01 00 00 <8b> 7b 08 8b 87 04 02 00 00 c7 04 24 05 00 00 00 83 e0 
10 09 c1
Apr  9 22:03:40 HOST kernel: [6289152.200474] EIP: [] 
xfs_vn_getattr+0x16/0x1cf [xfs] SS:ESP 0068:d1e05ed0
Apr  9 22:03:40 HOST kernel: [6289152.201029] ---[ end trace 33600cae36aa9ba8 
]---

Note: although the openafs module is loaded this problem doesn't appear to
involve OpenAFS in any way. It happens on an XFS filesystem that contains
static, archived data and is not currently being exported to clients. The
only activity on that filesystem should be my nightly uid-stats script.
Of course I can't rule out that the problem may arise as a side effect of
other system activity.

This is a production server and the problem isn't severe enough to warrant
much fiddling with the system. I can, however, peek at selected portions of
kernel memory and of the filesystem on request. I should mention that
xfs_check didn't find any fault with the filesystem, and that unmounting
the filesystem and mounting it again doesn't help. Rebooting, as I said,
did clear the problem but it has now recurred.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20100410120054.ga6...@astro.su.se

Bug#534978: clock drift in Xen domU with clocksource=xen

2009-10-04 Thread Sergio Gelato

severity 534978 normal
thanks

I've made some more progress in understanding this behaviour, and have now
figured out a workaround. I find the documentation at 
http://wiki.debian.org/Xen very misleading in several respects.

The domU kernel is receiving time info from the hypervisor as it should. My
earlier suspicion that the vcpu_info wasn't being updated turned out to be
both unfounded (I had missed the significance of "vcpu_info placement") and
irrelevant (even without the vcpu_info updates the time as computed by
pvclock_read_wallclock() drifts several orders of magnitude more slowly 
than observed).

Linux' generic time code (in kernel/time/timekeeping.c), however, doesn't
use the value from pvclock_read_wallclock() directly; instead, it uses the
clocksource (xen_clocksource in this case) to compute an increment to the
xtime kernel variable. For some reason I haven't fully worked out, this
results in the drift I observed. I still suspect there is a bug here (the
accuracy of the time calculation ought to be better than this) so I'm
downgrading the severity to "normal" rather than "wishlist".

The good news is that the NTP support code can correct for this. My workaround
is therefore to run NTP in the domU. It is neither possible nor necessary to
set xen.independent_wallclock=1 (that parameter is only supported by the
featureset=xen kernels, which include the SuSE patch), and it is neither 
necessary nor desirable to change the domU clocksource to jiffies (I tried
that and found that the time accuracy got much worse).

I had been led to expect (again, by the wiki page, but also by common sense) 
that the domU was meant to get its clock from Xen and didn't need to run NTP. 
This appears to only be the case when the domU kernel includes the SuSE patch, 
not for the pv_ops-based approach. 

I would very much appreciate being relieved of the need to run NTP in each and
every domU: that seems wasteful.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#534978: clock drift in Xen domU with clocksource=xen

2009-09-27 Thread Sergio Gelato

I think I've made some progress towards figuring out what's going on here.

First I looked at the Xen mini-os kernel, which keeps time correctly.
I added a few printk()s to getttimeofday() and saw that of the values
in the HYPERVISOR_shared_info structure, the vcpu_info data change often
(never more than a handful of seconds between version increments) while
the wallclock timestamp is updated more rarely.

Then I hacked together a Linux kernel module that adds support for
/proc/xeninfo, exposing (if I did it right) the contents of the
shared_info structure. What I'm seeing is the same occasional updating
of the wallclock timestamp (the values are consistent with what I see
in the mini-os domU) but the vcpu_info (for virtual CPU 0; data for the
other VCPUs are all zeros throughout, as I believe is normal for a
single-processor VM) remains stuck at version 2.

A caveat here is that while I'm confident (based on the data; the shift
value is also right, the multiplier is in the right ballpark) that I've 
found a shared_info structure I'm not sure I got the right one. The kernel 
doesn't seem to export all the symbols needed to find its 
HYPERVISOR_shared_info structure, and it needs to be mapped into memory 
in a special way; it's conceivable that I did something wrong here, even 
though I tried to reuse/imitate existing kernel code as much as I could. 

Anyway, the secular clock drift this bug is about seems consistent with a
failure to receive updates to the vcpu_info data. Is the hypervisor somehow
discriminating against Linux domU's by not updating the data, or does the
domU kernel need to do something more in order to see the updates?

The problem is also reproducible with the 2.6.30-bpo.1 kernel (source code
from backports.org, recompiled locally), by the way.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#534978: clock drift in Xen domU with clocksource=xen

2009-06-28 Thread Sergio Gelato

Package: linux-image-2.6.26-2-686-bigmem
Version: 2.6.26-15lenny3
Severity: important

I'm running this kernel in a Xen domU using the xen clocksource:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
xen

The dom0 is running linux-image-2.6.26-2-xen-686 (same version 2.6.26-15lenny3),
also with the xen clocksource, and an NTP client.

My understanding of the documentation is that the domU's wall clock should
be based on information passed (in shared memory) by the hypervisor, which
in turn gets clock updates from dom0.

I'm observing that the domU's clock runs fast relative to dom0 and the rest
of the world.

Rebooting the domU causes its clock to be reset to the correct time.
Moreover, I've tried running Xen's mini-os.gz (not in Debian's binary
packages of Xen, I built it from the extras/mini-os directory of the
xen-3 source package) as another domU on the same system, and it
printed correct timestamps. From this I deduce that the hypervisor's
notion of time is correct, and that the problem must lie in how
the domU kernel uses the information from the hypervisor.

So far I haven't observed the 'clocksource/0: Time went backwards' error
message mentioned at http://wiki.debian.org/Xen . I know I could switch
the domU to the "jiffies" clocksource and run NTP in it, but that's only
a workaround.

"xm info" on the dom0 reports:
release: 2.6.26-2-xen-686
version: #1 SMP Thu May 28 18:35:28 UTC 2009
machine: i686
nr_cpus: 1
nr_nodes   : 1
cores_per_socket   : 1
threads_per_core   : 1
cpu_mhz: 2399
hw_caps: bfebfbff:::0080:0400
total_memory   : 2559
free_memory: 32
node_to_cpu: node0:0
xen_major  : 3
xen_minor  : 2
xen_extra  : -1
xen_caps   : xen-3.0-x86_32p
xen_scheduler  : credit
xen_pagesize   : 4096
platform_params: virt_start=0xf580
xen_changeset  : unavailable
cc_compiler: gcc version 4.3.1 (Debian 4.3.1-2)
cc_compile_by  : waldi
cc_compile_domain  : debian.org
cc_compile_date: Sat Jun 28 15:25:00 UTC 2008
xend_config_format : 4



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#512617: Flood of warnings from hpet_legacy_next_event()

2009-02-09 Thread Sergio Gelato

tags 512617 + patch
thanks

I'm unfortunately able to reproduce this bug on an HP dc7900 with the
latest available BIOS (V1.11).

The patch in 2.6.28 looks trivial to backport to 2.6.26: it consists simply of
replacing WARN_ON with WARN_ON_ONCE. See
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.28.y.git;a=commitdiff;h=1de5b0854623d30d01d72cd4ea323eb5f39d1f16

There may be other reasons to prefer 2.6.28 on a dc7900, but I'd still
be grateful if this simple fix could make it into lenny.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#416393: BUG: warning at kernel/cpu.c:51/unlock_cpu_hotplug()

2008-06-02 Thread Sergio Gelato

found 416393 2.6.18.dfsg.1-18etch5
thanks

Saw the following, at exactly the same time (to the second), on three nodes 
of a 21-node cluster running a LAM/MPI application. In case it matters,
the nodes are all dual-Xeon E5430 running in 64-bit mode.

Jun  2 11:50:36 rama19 kernel: BUG: warning at 
kernel/cpu.c:51/unlock_cpu_hotplug()
Jun  2 11:50:36 rama19 kernel:
Jun  2 11:50:36 rama19 kernel: Call Trace:
Jun  2 11:50:36 rama19 kernel:  [] 
unlock_cpu_hotplug+0x3f/0x6c
Jun  2 11:50:36 rama19 kernel:  [] 
sched_setaffinity+0xf5/0x101
Jun  2 11:50:36 rama19 kernel:  [] 
sys_sched_setaffinity+0x47/0x54
Jun  2 11:50:36 rama19 kernel:  [] system_call+0x7e/0x83

It's only a warning, so I'm not too worried.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

46 matches

Mail list logo