from:"John Baldwin"

Re: Despite the documentation, "etcupdate extract" handles -D destdir (and its contribution to the default workdir)

2021-04-26 Thread John Baldwin


On 4/24/21 12:22 PM, Mark Millard via freebsd-current wrote:

# etcupdate -?
Illegal option -?

usage: etcupdate [-npBF] [-d workdir] [-r | -s source | -t tarball]
  [-A patterns] [-D destdir] [-I patterns] [-L logfile]
  [-M options]
etcupdate build [-B] [-d workdir] [-s source] [-L logfile] [-M options]
  
etcupdate diff [-d workdir] [-D destdir] [-I patterns] [-L logfile]
etcupdate extract [-B] [-d workdir] [-s source | -t tarball] [-L 
logfile]
  [-M options]
etcupdate resolve [-p] [-d workdir] [-D destdir] [-L logfile]
etcupdate status [-d workdir] [-D destdir]

The "etcupdate extract" material does not show -D destdir as valid.


Thanks, it was a documentation oversight I've just fixed.  It is definitely
supposed to work and is quite useful for cross-builds (e.g. I use it frequently
to update rootfs images I use with qemu for RISC-V or MIPS that I run under
qemu, or when updating the SD-card for my RPI that I cross-build on an x86
host).

--
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Suspected mbuf leak with Nginx + sendfile + TLS in 12.2-STABLE

2021-02-04 Thread John Baldwin


On 2/4/21 8:08 AM, GomoR wrote:

Dear FreeBSD community,

we are encountering a DoS condition on our production machines.
Our use case is an Nginx reverse proxy serving large files via HTTPS.
This problem arose when switching kernel and userland from 12.1-RELEASE
to 12.2-RELEASE. Ports were not upgraded (at first).

Each time a user downloads a file, mbuf & mbuf_clusters are raising to
reach the maximum limit in a matter of seconds. Those values are
asserted by 'netstat -m' as follows:

Normal situation:

mbuf:   256, 26031105,   16767,5974,428087938,   0,
   0
mbuf_cluster:  2048, 8135232,   18408,2704,101644203,   0,
0

Warning situtation:

mbuf:   256, 26031105, 2981516,  151205,1109483561,   0,
0
mbuf_cluster:  2048, 8135232, 2983155,4201,319714617,   0,
0

We have seen a patch related to sendfile + KTLS + mbuf at the below link
and we updated to -STABLE to apply:


None of the sendfile or KTLS changes from Netflix are in 12, they are only
in 13 and later.


Don't transmit mbufs that aren't yet ready on TOE sockets.
This includes mbufs waiting for data from sendfile() I/O requests, or
mbufs awaiting encryption for KTLS.
https://github.com/freebsd/freebsd-src/commit/14c77f30b201bf76119d59678e72051c09c2


This patch only applies to Chelsio T5/T6 NICs when using TOE (TCP offload)
and doesn't affect freeing mbufs, it just fixes a race when the NIC could
potentially send random garbage if it sends the mbuf before the scheduled
disk I/O to populate it with data from disk has completed.


NIC is:
ix0: 

What can we do to help you find the root cause?


The first step I would do if possible would be to bisect between the last
known working version and the version that is known to be broken to
determine which commit introduced the problem.  One thing that could help
here is to see if you can reproduce the problem using a 12.2 kernel on a
12.1 world + ports.  If you can, then you can limit your bisecting to just
building new kernels which will make that process quicker.

You might also see if using a different NIC shows the same problem.  If
not, then it might point to a regression in the NIC driver (or perhaps in
iflib as ix uses iflib I believe).

--
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-08-27 Thread John Baldwin

On 8/26/19 5:25 PM, John Baldwin wrote:
> On 8/26/19 1:59 PM, mike tancsa wrote:
>> On 8/22/2019 6:51 PM, John Baldwin wrote:
>>> On 8/21/19 5:47 PM, Mike Tancsa wrote:
>>>> On 8/21/2019 6:38 PM, John Baldwin wrote:
>>>>> On 8/21/19 9:08 AM, mike tancsa wrote:
>>>>>> On 8/21/2019 12:00 PM, John Baldwin wrote:
>>>>>>> dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>>>>>>> count()'
>>>>>> Thanks, I am not familiar with dtrace at all. This command gives a
>>>>>> syntax error
>>>>>>
>>>>>> 0(cage)# dtrace -n 'fbt::_gone_in:entry {
>>>>>> @counts[curthread->td_proc->p_comm] = count()'
>>>>>> dtrace: invalid probe specifier fbt::_gone_in:entry {
>>>>>> @counts[curthread->td_proc->p_comm] = count(): syntax error near end of
>>>>>> input
>>>>>> 1(cage)#
>>>>> Oops, I forgot the closing }.  First, do "dtrace -l | grep _gone_in" to 
>>>>> make
>>>>> sure dtrace is loaded.  You should see something like this:
>>>>>
>>>>> # dtrace -l | grep _gone_in
>>>>> 87003fbtkernel  _gone_in entry
>>>>> 87004fbtkernel  _gone_in 
>>>>> return
>>>>> 98682fbtkernel  _gone_in_dev entry
>>>>> 98683fbtkernel  _gone_in_dev 
>>>>> return
>>>>>
>>>>> Then this should work:
>>>>>
>>>>> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>>>>> count() }'
>>>>> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
>>>>>
>>>> Thanks!
>>>>
>>>> #  dtrace -l | grep _gone_in
>>>> 15632    fbt    kernel  _gone_in entry
>>>> 22693    fbt    kernel  _gone_in_dev entry
>>>>
>>>> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] =
>>>> count() }'
>>>> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
>>>>
>>>> However, It doesnt show anything after that even as I get the
>>>> deprecation messages in dmesg
>>> Can you hit Ctrl-C after seeing some of the messages?  This trace won't
>>> show any results until you exit dtrace.
>>
>> Hi,
>>
>>     I am still having problems tracking it down via dtrace, but I am
>> able to create the problem on demand on sshd.  Whats odd is that if I
>> restrict the list of ciphers in sshd and even specify something like
>> aes-128 on the client, I still get warnings on the server.
>>
>> e.g from a client,
>>
>> % ssh -c aes128-cbc console1 uptime
>>  4:53PM  up  1:02, 3 users, load averages: 0.04, 0.08, 0.08
>>
>> The server shows
> 
> Ok, I was able to reproduce this on an 11.x VM.  It appears to only
> be something that the crypto engine in OpenSSL 1.0.x does (1.1.1 used
> in 12.0 and later has a rewritten /dev/crypto engine).
> 
> I'll see if I can find a way to tone down the warning.  Maybe if
> sshd is only creating sessions and not using them I can restrict
> it to warning the first time a session tries to perform an operation
> using a deprecated algorithm.  (There are separate ioctls for
> creating a sessions vs doing actual crypto ops and the warning is
> in the session creation currently.)

I've committed a fix to head and will MFC it in a few days.  Thanks for tracking
this down!

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-08-26 Thread John Baldwin

On 8/26/19 1:59 PM, mike tancsa wrote:
> On 8/22/2019 6:51 PM, John Baldwin wrote:
>> On 8/21/19 5:47 PM, Mike Tancsa wrote:
>>> On 8/21/2019 6:38 PM, John Baldwin wrote:
>>>> On 8/21/19 9:08 AM, mike tancsa wrote:
>>>>> On 8/21/2019 12:00 PM, John Baldwin wrote:
>>>>>> dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>>>>>> count()'
>>>>> Thanks, I am not familiar with dtrace at all. This command gives a
>>>>> syntax error
>>>>>
>>>>> 0(cage)# dtrace -n 'fbt::_gone_in:entry {
>>>>> @counts[curthread->td_proc->p_comm] = count()'
>>>>> dtrace: invalid probe specifier fbt::_gone_in:entry {
>>>>> @counts[curthread->td_proc->p_comm] = count(): syntax error near end of
>>>>> input
>>>>> 1(cage)#
>>>> Oops, I forgot the closing }.  First, do "dtrace -l | grep _gone_in" to 
>>>> make
>>>> sure dtrace is loaded.  You should see something like this:
>>>>
>>>> # dtrace -l | grep _gone_in
>>>> 87003fbtkernel  _gone_in entry
>>>> 87004fbtkernel  _gone_in return
>>>> 98682fbtkernel  _gone_in_dev entry
>>>> 98683fbtkernel  _gone_in_dev return
>>>>
>>>> Then this should work:
>>>>
>>>> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>>>> count() }'
>>>> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
>>>>
>>> Thanks!
>>>
>>> #  dtrace -l | grep _gone_in
>>> 15632    fbt    kernel  _gone_in entry
>>> 22693    fbt    kernel  _gone_in_dev entry
>>>
>>> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] =
>>> count() }'
>>> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
>>>
>>> However, It doesnt show anything after that even as I get the
>>> deprecation messages in dmesg
>> Can you hit Ctrl-C after seeing some of the messages?  This trace won't
>> show any results until you exit dtrace.
> 
> Hi,
> 
>     I am still having problems tracking it down via dtrace, but I am
> able to create the problem on demand on sshd.  Whats odd is that if I
> restrict the list of ciphers in sshd and even specify something like
> aes-128 on the client, I still get warnings on the server.
> 
> e.g from a client,
> 
> % ssh -c aes128-cbc console1 uptime
>  4:53PM  up  1:02, 3 users, load averages: 0.04, 0.08, 0.08
> 
> The server shows

Ok, I was able to reproduce this on an 11.x VM.  It appears to only
be something that the crypto engine in OpenSSL 1.0.x does (1.1.1 used
in 12.0 and later has a rewritten /dev/crypto engine).

I'll see if I can find a way to tone down the warning.  Maybe if
sshd is only creating sessions and not using them I can restrict
it to warning the first time a session tries to perform an operation
using a deprecated algorithm.  (There are separate ioctls for
creating a sessions vs doing actual crypto ops and the warning is
in the session creation currently.)

> kern.cryptodev_warn_interval=0

I'll try to get this tracked down this week, but this should be a
suitable workaround for now.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-08-22 Thread John Baldwin

On 8/21/19 5:47 PM, Mike Tancsa wrote:
> On 8/21/2019 6:38 PM, John Baldwin wrote:
>> On 8/21/19 9:08 AM, mike tancsa wrote:
>>> On 8/21/2019 12:00 PM, John Baldwin wrote:
>>>> dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>>>> count()'
>>> Thanks, I am not familiar with dtrace at all. This command gives a
>>> syntax error
>>>
>>> 0(cage)# dtrace -n 'fbt::_gone_in:entry {
>>> @counts[curthread->td_proc->p_comm] = count()'
>>> dtrace: invalid probe specifier fbt::_gone_in:entry {
>>> @counts[curthread->td_proc->p_comm] = count(): syntax error near end of
>>> input
>>> 1(cage)#
>> Oops, I forgot the closing }.  First, do "dtrace -l | grep _gone_in" to make
>> sure dtrace is loaded.  You should see something like this:
>>
>> # dtrace -l | grep _gone_in
>> 87003fbtkernel  _gone_in entry
>> 87004fbtkernel  _gone_in return
>> 98682fbtkernel  _gone_in_dev entry
>> 98683fbtkernel  _gone_in_dev return
>>
>> Then this should work:
>>
>> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>> count() }'
>> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
>>
> Thanks!
> 
> #  dtrace -l | grep _gone_in
> 15632    fbt    kernel  _gone_in entry
> 22693    fbt    kernel  _gone_in_dev entry
> 
> # dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] =
> count() }'
> dtrace: description 'fbt::_gone_in:entry ' matched 1 probe
> 
> However, It doesnt show anything after that even as I get the
> deprecation messages in dmesg

Can you hit Ctrl-C after seeing some of the messages?  This trace won't
show any results until you exit dtrace.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-08-21 Thread John Baldwin

On 8/21/19 9:08 AM, mike tancsa wrote:
> On 8/21/2019 12:00 PM, John Baldwin wrote:
>> dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
>> count()'
> 
> Thanks, I am not familiar with dtrace at all. This command gives a
> syntax error
> 
> 0(cage)# dtrace -n 'fbt::_gone_in:entry {
> @counts[curthread->td_proc->p_comm] = count()'
> dtrace: invalid probe specifier fbt::_gone_in:entry {
> @counts[curthread->td_proc->p_comm] = count(): syntax error near end of
> input
> 1(cage)#

Oops, I forgot the closing }.  First, do "dtrace -l | grep _gone_in" to make
sure dtrace is loaded.  You should see something like this:

# dtrace -l | grep _gone_in
87003fbtkernel  _gone_in entry
87004fbtkernel  _gone_in return
98682fbtkernel  _gone_in_dev entry
98683fbtkernel  _gone_in_dev return

Then this should work:

# dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = 
count() }'
dtrace: description 'fbt::_gone_in:entry ' matched 1 probe

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: svn commit: r351246 - in stable: 11/sys/opencrypto 12/sys/opencrypto

2019-08-21 Thread John Baldwin

On 8/21/19 8:21 AM, mike tancsa wrote:
> On a busy server, I am getting a lot of these spewing to dmesg

I have a change staged for MFC that lets you adjust the warning intervals
so you can tone down the spam.

> Deprecated code (to be removed in FreeBSD 13): ARC4 cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): DES cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): 3DES cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): Blowfish cipher via
> /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): CAST128 cipher via
> /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): ARC4 cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): DES cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): 3DES cipher via /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): Blowfish cipher via
> /dev/crypto
> Deprecated code (to be removed in FreeBSD 13): CAST128 cipher via
> /dev/crypto
> 
> 
> What is the best way to try and track down what apps are triggering that ?

One might be to use 'procstat -af' to see which processes have crypto file
descriptors open (file descriptor type 'c').

The other approach would be to use dtrace with the fbt::_gone_in:entry
trace maybe building a count of process names or some such, something like:

dtrace -n 'fbt::_gone_in:entry { @counts[curthread->td_proc->p_comm] = count()'

Let that run and then Ctrl-C after you see some warnings.

>     ---Mike
> 
> On 8/19/2019 9:30 PM, John Baldwin wrote:
>> Author: jhb
>> Date: Tue Aug 20 01:30:35 2019
>> New Revision: 351246
>> URL: https://svnweb.freebsd.org/changeset/base/351246
>>
>> Log:
>>   MFC 348876: Add warnings to /dev/crypto for deprecated algorithms.
>>   
>>   These algorithms are deprecated algorithms that will have no in-kernel
>>   consumers in FreeBSD 13.  Specifically, deprecate the following
>>   algorithms:
>>   - ARC4
>>   - Blowfish
>>   - CAST128
>>   - DES
>>   - 3DES
>>   - MD5-HMAC
>>   - Skipjack
>>   
>>   Relnotes:  yes
>>
>> Modified:
>>   stable/11/sys/opencrypto/cryptodev.c
>> Directory Properties:
>>   stable/11/   (props changed)
>>
>> Changes in other areas also in this revision:
>> Modified:
>>   stable/12/sys/opencrypto/cryptodev.c
>> Directory Properties:
>>   stable/12/   (props changed)
>>
>> Modified: stable/11/sys/opencrypto/cryptodev.c
>> ==
>> --- stable/11/sys/opencrypto/cryptodev.c Tue Aug 20 01:26:02 2019
>> (r351245)
>> +++ stable/11/sys/opencrypto/cryptodev.c Tue Aug 20 01:30:35 2019
>> (r351246)
>> @@ -388,6 +388,9 @@ cryptof_ioctl(
>>  struct crypt_op copc;
>>  struct crypt_kop kopc;
>>  #endif
>> +static struct timeval arc4warn, blfwarn, castwarn, deswarn, md5warn;
>> +static struct timeval skipwarn, tdeswarn;
>> +static struct timeval warninterval = { .tv_sec = 60, .tv_usec = 0 };
>>  
>>  switch (cmd) {
>>  case CIOCGSESSION:
>> @@ -408,18 +411,28 @@ cryptof_ioctl(
>>  case 0:
>>  break;
>>  case CRYPTO_DES_CBC:
>> +if (ratecheck(, ))
>> +gone_in(13, "DES cipher via /dev/crypto");
>>  txform = _xform_des;
>>  break;
>>  case CRYPTO_3DES_CBC:
>> +if (ratecheck(, ))
>> +gone_in(13, "3DES cipher via /dev/crypto");
>>  txform = _xform_3des;
>>  break;
>>  case CRYPTO_BLF_CBC:
>> +if (ratecheck(, ))
>> +gone_in(13, "Blowfish cipher via /dev/crypto");
>>  txform = _xform_blf;
>>  break;
>>  case CRYPTO_CAST_CBC:
>> +if (ratecheck(, ))
>> +gone_in(13, "CAST128 cipher via /dev/crypto");
>>  txform = _xform_cast5;
>>  break;
>>  case CRYPTO_SKIPJACK_CBC:
>> +if (ratecheck(, ))
>> +gone_in(13, "Skipjack cipher via /dev/crypto");
>>  txform = _xform_skipjack;
>>  break;
>>  case CRYPTO_AES_CBC:
>> @@ -432,6 +445,8 @@ cryptof_ioctl(
>>

Re: /dev/crypto not being used in 12-STABLE

2018-12-07 Thread John Baldwin

On 12/6/18 4:19 PM, Konstantin Belousov wrote:
> On Thu, Dec 06, 2018 at 04:48:35PM -0700, John Nielsen wrote:
>> Is aesni(4) even required if all you want is userland acceleration?
>>
> No, it is not.  Same for rdrand_rng(4), if an application uses hw random
> source directly.

To elaborate further, aesni(4) is only useful to accelerate in-kernel
crypto use (e.g. IPSec or GELI).  The fact that /dev/crypto trys to use it
by default is a bug (IMO) that I'm planning on addressing.

-- 
John Baldwin


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: /dev/crypto not being used in 12-STABLE

2018-12-06 Thread John Baldwin

On 12/6/18 3:24 PM, John Nielsen wrote:
>> On Dec 6, 2018, at 4:04 PM, Xin LI  wrote:
>>
>> On Thu, Dec 6, 2018 at 11:37 AM John Nielsen  wrote:
>>>
>>> I have upgraded two physical machines from 11-STABLE to 12-STABLE recently 
>>> (one is 12.0-PRERELEASE r341380 and the other is 12.0-PRERELEASE r341391). 
>>> I noticed today that neither machine seems to be utilizing /dev/crypto. 
>>> Typically I see at least ssh/sshd have the device open plus some programs 
>>> from ports. But 'fuser' doesn't list any processes on either machine:
>>>
>>> # fuser /dev/crypto
>>> /dev/crypto:
>>>
>>> Both machines are running custom kernels that include "device crypto" and 
>>> "device cryptodev". One of them additionally has "device aesni".
>>>
>>> Is anyone else seeing this? Any idea what would cause it?
>>
>> Your average OpenSSL applications should not use /dev/crypto, if your
>> goal is to utilize AES-NI (which does not require /dev/crypto).  On
>> capable systems, AES-NI would be used automatically (and it's faster
>> this way).
> 
> Thanks for the response. Is there a way to verify that AES-NI is being used 
> for e.g. ssh? I'm also curious why/when/how the change to not use (or 
> support?) /dev/crypto from base openssl was made.

I suspect it was something we just didn't test in the flurry of other work
during the OpenSSL upgrade.  However, it is much faster to use the AES-NI
instructions in userland than to use a system call that copies the data
into a kernel buffer, uses the sames AES-NI instructions, then copies the
data back out again along with the overhead of a pair of user <--> kernel
transitions.  If you have an actual crypto offload device (as in a PCI-e
card or something), then you might be interested in /dev/crypto (and we
should fix that eventually), but AES-NI is just faster software crypto and
is best done directly in userland.

-- 
John Baldwin

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Panic on 11-STABLE with Xen guest

2018-11-26 Thread John Baldwin

On 11/22/18 12:39 PM, Joe Clarke wrote:
> I believe after the commit 340016 for the dynamic IRQ layout, my Xen VM
> started to panic.  I just upgraded the kernel today and saw this:
> 
> xen: unable to map IRQ#2
> panic: Unable to register interrupt override
> cpuid = 0
> KDB: stack backtrace:
> #0 0x8060a4e7 at kdb_backtrace+0x67
> #1 0x805c3787 at vpanic+0x177
> #2 0x805c3603 at panic+0x43
> #3 0x8093a766 at madt_parse_ints+0x96
> #4 0x803353f9 at acpi_walk_subtables+0x29
> #5 0x8093a5e6 at xenpv_register_pirqs+0x56
> #6 0x80928296 at intr_init_sources+0x116
> #7 0x8055eba8 at mi_startup+0x118
> #8 0x8029902c at btext+0x2c
> 
> The following kernel works:
> 
> @(#)FreeBSD 11.2-STABLE #4: Thu Nov  1 02:24:07 EDT 2018
> FreeBSD 11.2-STABLE #4: Thu Nov  1 02:24:07 EDT 2018
> root@creme-brulee:/usr/obj/usr/src/sys/CREME-BRULEE
> 
> The following kernel produces the panic above immediately on boot:
> 
> @(#)FreeBSD 11.2-STABLE #5: Wed Nov 21 11:08:38 EST 2018
> FreeBSD 11.2-STABLE #5: Wed Nov 21 11:08:38 EST 2018
> root@creme-brulee:/usr/obj/usr/src/sys/CREME-BRULEE
> 
> Attached is a screen grab of the console of the panic.

Hmm, I don't see any obvious candidates of Xen changes that weren't included
in the MFC.  I've added royger@ (who maintains Xen in FreeBSD) to the cc to
see if he has an idea.

Roger, the main changes that aren't MFC'd to 11 from 12/head seem to be some
refcounting on event channels and PVHv2 vs PVHv1?

-- 
John Baldwin


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Problem with USB <---> UPS management connection

2018-03-19 Thread John Baldwin

On Thursday, March 08, 2018 01:16:46 AM Glen Barber wrote:
> On Wed, Mar 07, 2018 at 08:04:47PM -0500, Mark Saad wrote:
> > > On Mar 7, 2018, at 6:55 AM, wishmaster <artem...@ukr.net> wrote:
> > > 
> > > Hi, colleagues!
> > > 
> > > Something strange happens with a server. I am attempting to connect 
> > > management interface of UPS with server via USB.
> > > In console I see a lot of errors:
> > > 
> > > Mar  7 13:42:04 xxx kernel: ugen2.2:  
> > > at usbus2
> > > Mar  7 13:42:05 xxx kernel: uhid0 on uhub6
> > > Mar  7 13:42:05 xxx kernel: uhid0:  > > class 0/0, rev 1.10/0.02, addr 2> on usbus2
> > > Mar  7 13:42:08 xxx kernel: ugen2.2:  
> > > at usbus2 (disconnected)
> > > Mar  7 13:42:08 xxx kernel: uhid0: at uhub6, port 3, addr 2 (disconnected)
> > > Mar  7 13:42:08 xxx kernel: uhid0: detached
> > > Mar  7 13:42:12 xxx kernel: ugen2.2:  
> > > at usbus2
> > > Mar  7 13:42:12 xxx kernel: uhid0 on uhub6
> > > Mar  7 13:42:12 xxx kernel: uhid0:  > > class 0/0, rev 1.10/0.02, addr 2> on usbus2
> > > Mar  7 13:42:16 xxx kernel: ugen2.2:  
> > > at usbus2 (disconnected)
> > > Mar  7 13:42:16 xxx kernel: uhid0: at uhub6, port 3, addr 2 (disconnected)
> > > Mar  7 13:42:16 xxx kernel: uhid0: detached
> > > 
> > > I have changed USB-cables, USB port on the server - without success.
> > > On another server this problem is absent.
> > > 
> > > FreeBSD version: FreeBSD 11.1-STABLE #1 r329364M:
> > > 
> > > Any ideas?
> > > 
> > All
> >  I lost power at home and noticed that nut didn’t work right . I
> > had a similar dmesg . My box is running 11.1-stable amd64 built
> > from svn 7-8 days ago . When I get power back I’ll post details .
> > 
> 
> This seems suspiciously similar to an issue I am seeing with a USB mouse
> on both stable/11 a patched build of releng/11.1.  In my case, the dmesg
> shows:
> 
> ugen1.3:  at usbus1 (disconnected)
> ugen1.3:  at usbus1
> ugen1.3:  at usbus1 (disconnected)
> ugen1.3:  at usbus1
> 
> What struck me as "suspiciously similar" is the 'ugen' reference.
> Unfortunately, I do not have more information yet, but have been
> pounding my head on my desk throughout the day.  Then, I saw this
> thread.
> 
> Anyone else seeing at least USB mouse-related issues?  It could entirely
> be a red herring.

I am definitely seeing issues with an APC USB I have on my desktop.  I have
used this desktop + APC combination for at least 5 years now and only after
my most recent upgrade to 11.1-STABLE at r326909.  I did not have issues on
the previous 11.1-STABLE kernel at r321399, so it does seem like it could
be a regression.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: DDD hangs on start on 11.1-R

2018-03-06 Thread John Baldwin

On Monday, March 05, 2018 08:19:24 AM Daniel Eischen wrote:
> On Mon, 5 Mar 2018, Trond Endrest�l wrote:
> 
> > On Sat, 3 Mar 2018 18:09+0100, Holm Tiffe wrote:
> >
> >> can anyone get ddd get to work in 11.1-R or stable?
> >
> > I've more or less given up on devel/ddd, since it relies on the old
> > pty subsystem, now replaced by the new pts subsystem, to communicate
> > with gdb.
> >
> > I build custom kernels containing "device pty", but I'm not sure if
> > that directive is being honoured these days.
> >
> > It's a shame, 'cos ddd is very good at visualizing data structures.
> > Maybe it's possible to patch ddd to use pts instead of pty.
> 
> I used to like ddd also.  You might try devel/gps.  It's more
> than just a debugger, but you can use it just for debugging.
> Note, it's been a while since I've used it, but worked similarly
> to ddd.

I patched ddd to use pts (was a short patch) but it still hangs for me
with both old and new gdb.  I think it is unfortunately abandonware. :(

-- 
John Baldwin

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: post ino64: lockd no runs?

2017-06-12 Thread John Baldwin

On Sunday, June 11, 2017 11:12:25 AM David Wolfskill wrote:
> On Sun, Jun 04, 2017 at 08:57:44AM -0400, Michael Butler wrote:
> > It seems that {rpc.}lockd no longer runs after the ino64 changes on any
> > of my systems after a full rebuild of src and ports. No log entries
> > offer any insight as to why :-(
> > 
> > imb
> 
> I don't tend to use NFS on my systems that are running head, so I
> haven't had occasion to test this as stated.
> 
> However, I just completed my weekly update of the "prooduction" systems
> here at home, running stable/11.  And I find that lockd seems to be ...
> claiming that all is well, but declining to run (for long).
> 
> To the best of my knowledge, that was not the case until this last
> update, which was from:
> 
> FreeBSD albert.catwhisker.org 11.1-PRERELEASE FreeBSD 11.1-PRERELEASE #316  
> r319566M/319569:1100514: Sun Jun  4 03:54:41 PDT 2017 
> r...@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT  amd64
> 
> to
> 
> FreeBSD albert.catwhisker.org 11.1-BETA1 FreeBSD 11.1-BETA1 #322  
> r319823M/319823:1100514: Sun Jun 11 03:56:10 PDT 2017 
> r...@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT  amd64
> 
> The "glaringly obvious" symptom in my case is that I am now unable
> to (directly) save an email message from within mutt(1) by appending
> it to an NFS-resident file.  (Saving it to a local file, then using
> cat(1) to append that to the NFS- resident file & removing the local
> copy works)
> 
> After a few variations on a theme of:
> 
> albert(11.1)[5] sudo service lockd restart
> lockd not running?
> Starting lockd.
> albert(11.1)[6] echo $?
> 0
> albert(11.1)[7] service lockd status
> lockd is not running.
> 
> I finally(!) thought to ask ktrace what's going on (as tailing
> /var/log/messages was completely unproductive, even after enabling
> rc_debug).
> 
> So I tried: "sudo ktrace -di service lockd restart"; upon exanimation of
> the output of kdump(1), I see that the trace ends with:
> 
>   ...
>   2811 rpc.lockd NAMI  "/var/run/logpriv"
>   2786 sh   CALL  read(0xa,0x627fc0,0x400)
>   2786 sh   GIO   fd 10 read 0 bytes
>""
>   2811 rpc.lockd RET   connect 0
>   2786 sh   RET   read 0
>   2811 rpc.lockd CALL  sendto(0x3,0x7fffe2c0,0x27,0,0,0)
>   2786 sh   CALL  exit(0)
>   2811 rpc.lockd GIO   fd 3 wrote 39 bytes
>"<30>Jun 11 15:43:10 rpc.lockd: Starting"
>   2811 rpc.lockd RET   sendto 39/0x27
>   2811 rpc.lockd CALL  sigaction(SIGALRM,0x7fffec20,0)
>   2811 rpc.lockd RET   sigaction 0
>   2811 rpc.lockd CALL  nlm_syscall(0,0x1e,0x4,0x801015040)
>   2811 rpc.lockd RET   nlm_syscall -1 errno 14 Bad address

This is a really good clue.  nlm_syscall is dying with EFAULT.  The last
argument is a pointer to an array of char * pointers, and the only way
I can see it dying is if it fails to copyin() one of the strings pointed
to by those pointers.  You could try running rpc.lockd under gdb from
ports and setting a breakpoint on 'nlm_syscall' and then printing out
'addr_count' and 'p addrs@(addr_count * 2)'.

Unfortunately I'm not able to reproduce the failure on a test machine
I have running head post-ino64.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: if_cxgbev build error on -stable

2016-12-04 Thread John Baldwin

On Sunday, December 04, 2016 03:53:23 PM Konstantin Belousov wrote:
> On Sun, Dec 04, 2016 at 04:23:00PM +0300, Andrey Chernov wrote:
> > It seems counter.h is included before systm.h where critical_* are declared.
> It is more weird, since sys/counter.h was added in the stable/10
> merge, but the header is not used in the HEAD sources. It is indeed
> needed for stable/10 driver.  critical_enter() pre-requisite for counter.h
> only exists on i386, which probably explains why John' build test did not
> catched it.
> 
> I am preparing another MFC, so I committed the fix in r309529.

Thanks for fixing this.  I had indeed only tested it on amd64.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: stable/11 -r307797 on BPi-M3 (cortex-a7): truss gets segmentation fault for handling unknown system call

2016-10-28 Thread John Baldwin

On Tuesday, October 25, 2016 11:40:38 AM Mark Millard wrote:
> [The following has been reported in: 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213778 .]
> 
> In trying to build lang/gcc6 xgcc's cc1 got some SIGSYS examples. In trying 
> to track things down I ran into truss getting a SIGSEGV when it tries to 
> handle the situation. . .
> 
> In truss's enter_syscall there is (from a live gdb on truss, after the 
> segmentation fault):
> 
> 380   t->cs.name = sysdecode_syscallname(t->proc->abi->abi, 
> t->cs.number);
> 381   if (t->cs.name == NULL)
> (gdb) 
> 382   fprintf(info->outfile, "-- UNKNOWN %s SYSCALL %d --\n",
> 383   t->proc->abi->type, t->cs.number);
> 384   
> 385   sc = get_syscall(t->cs.name, narg);
> 386   t->cs.nargs = sc->nargs;
> 387   assert(sc->nargs <= nitems(t->cs.s_args));
> 388   
> 389   t->cs.sc = sc;
> 
> (gdb) print *t
> $2 = {entries = {le_next = 0x0, le_prev = 0x20617070}, proc = 0x20617060, tid 
> = 100150, in_syscall = 1, cs = {sc = 0x0, name = 0x0, number = 580828064, 
> args = 0x2061b0c0, nargs = 0, 
> s_args = 0x2061b0ec}, before = {tv_sec = 1477418265, tv_nsec = 
> 492342263}, after = {tv_sec = 1477418265, tv_nsec = 492496630}}
> 
> (gdb) print sc
> $3 = (struct syscall *) 0x0
> 
> So line 386 listed above gets a segmentation fault for sc->nargs when 
> t->cs.name is a NULL pointer: sc ends up NULL.
> 
> Looking at the two things that the fprintf on lines 382 and 383 would report:
> 
> (gdb) print t->proc->abi->type
> $4 = 0x10166 "FreeBSD ELF32"
> 
> (gdb) print t->cs.number
> $5 = 580828064
> 
> (gdb) print narg
> $6 = 0
> 
> (that last is for context for the get_syscall arguments).
> 
> FYI: 580828064 = 0x229EBBA0

I have a patchset I have tested some in a git branch that I believe fixes 
handling of
unknown system calls.  Please try this:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:truss_unknown

(Add .diff to get a diff you can apply with patch)

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: nginx and FreeBSD11

2016-09-19 Thread John Baldwin

On Sunday, September 18, 2016 07:22:41 PM Slawa Olhovchenkov wrote:
> On Thu, Sep 15, 2016 at 10:28:11AM -0700, John Baldwin wrote:
> 
> > On Thursday, September 15, 2016 05:41:03 PM Slawa Olhovchenkov wrote:
> > > On Wed, Sep 07, 2016 at 10:13:48PM +0300, Slawa Olhovchenkov wrote:
> > > 
> > > > I am have strange issuse with nginx on FreeBSD11.
> > > > I am have FreeBSD11 instaled over STABLE-10.
> > > > nginx build for FreeBSD10 and run w/o recompile work fine.
> > > > nginx build for FreeBSD11 crushed inside rbtree lookups: next node
> > > > totaly craped.
> > > > 
> > > > I am see next potential cause:
> > > > 
> > > > 1) clang 3.8 code generation issuse
> > > > 2) system library issuse
> > > > 
> > > > may be i am miss something?
> > > > 
> > > > How to find real cause?
> > > 
> > > I find real cause and this like show-stopper for RELEASE.
> > > I am use nginx with AIO and AIO from one nginx process corrupt memory
> > > from other nginx process. Yes, this is cross-process memory
> > > corruption.
> > > 
> > > Last case, core dumped proccess with pid 1060 at 15:45:14.
> > > Corruped memory at 0x860697000.
> > > I am know about good memory at 0x86067f800.
> > > Dumping (form core) this region to file and analyze by hexdump I am
> > > found start of corrupt region -- offset c8c0 from 0x86067f800.
> > > 0x86067f800+0xc8c0 = 0x86068c0c0
> > > 
> > > I am preliminary enabled debuggin of AIO started operation to nginx
> > > error log (memory address, file name, offset and size of transfer).
> > > 
> > > grep -i 86068c0c0 error.log near 15:45:14 give target file.
> > > grep ce949665cbcd.hls error.log near 15:45:14 give next result:
> > > 
> > > 2016/09/15 15:45:13 [notice] 1055#0: *11659936 AIO_RD 00082065DB60 
> > > start 00086068C0C0 561b0   2646736 ce949665cbcd.hls
> > > 2016/09/15 15:45:14 [notice] 1060#0: *10998125 AIO_RD 00081F1FFB60 
> > > start 00086FF2C0C0 6cdf0 140016832 ce949665cbcd.hls
> > > 2016/09/15 15:45:14 [notice] 1055#0: *11659936 AIO_RD 0008216B6B60 
> > > start 00086472B7C0 7ff70   2999424 ce949665cbcd.hls
> > 
> > Does nginx only use AIO for regular files or does it also use it with 
> > sockets?
> > 
> > You can try using this patch as a diagnostic (you will need to
> > run with INVARIANTS enabled, or at least enabled for vfs_aio.c):
> > 
> > Index: vfs_aio.c
> > ===
> > --- vfs_aio.c   (revision 305811)
> > +++ vfs_aio.c   (working copy)
> > @@ -787,6 +787,8 @@ aio_process_rw(struct kaiocb *job)
> >  * aio_aqueue() acquires a reference to the file that is
> >  * released in aio_free_entry().
> >  */
> > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > +   ("%s: vmspace mismatch", __func__));
> > if (cb->aio_lio_opcode == LIO_READ) {
> > auio.uio_rw = UIO_READ;
> > if (auio.uio_resid == 0)
> > @@ -1054,6 +1056,8 @@ aio_switch_vmspace(struct kaiocb *job)
> >  {
> >  
> > vmspace_switch_aio(job->userproc->p_vmspace);
> > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > +   ("%s: vmspace mismatch", __func__));
> >  }
> > 
> > If this panics, then vmspace_switch_aio() is not working for
> > some reason.
> 
> I am try using next DTrace script:
> 
> #pragma D option dynvarsize=64m
> 
> int req[struct vmspace  *, void *];
> self int trace;
> 
> syscall:freebsd:aio_read:entry
> {
> this->aio = *(struct aiocb *)copyin(arg0, sizeof(struct aiocb));
> req[curthread->td_proc->p_vmspace, this->aio.aio_buf] = 
> curthread->td_proc->p_pid; 
> }
> 
> fbt:kernel:aio_process_rw:entry
> {
> self->job = args[0];
> self->trace = 1;
> }
> 
> fbt:kernel:aio_process_rw:return
> /self->trace/
> {
> req[self->job->userproc->p_vmspace, self->job->uaiocb.aio_buf] = 0;
> self->job = 0;
> self->trace = 0;
> }
> 
> fbt:kernel:vn_io_fault:entry
> /self->trace && !req[curthread->td_proc->p_vmspace, 
> args[1]->uio_iov[0].iov_base]/
> {
> this->buf = args[1]->uio_iov[0].iov_base;
> printf("%Y vn_io_fault %p:%p pid %d\n", walltimestamp, 
> curthread->td_proc->p_vmspace, this->buf, req[curthread->td_proc->p_vmspace, 
> this->buf]);
> }
> ===
> 
> And don't got any messages near nginx core dump.
> What I can check next?
> May be check context/address space switch for kernel process?

Which CPU are you using?  Perhaps try disabling PCID support (I think 
vm.pmap.pcid_enabled=0 from
loader prompt or loader.conf)?  (Wondering if pmap_activate() is somehow not 
switching)

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: nginx and FreeBSD11

2016-09-15 Thread John Baldwin

On Thursday, September 15, 2016 10:09:48 PM Slawa Olhovchenkov wrote:
> On Thu, Sep 15, 2016 at 11:54:12AM -0700, John Baldwin wrote:
> 
> > > > Index: vfs_aio.c
> > > > ===
> > > > --- vfs_aio.c   (revision 305811)
> > > > +++ vfs_aio.c   (working copy)
> > > > @@ -787,6 +787,8 @@ aio_process_rw(struct kaiocb *job)
> > > >  * aio_aqueue() acquires a reference to the file that is
> > > >  * released in aio_free_entry().
> > > >  */
> > > > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > > > +   ("%s: vmspace mismatch", __func__));
> > > > if (cb->aio_lio_opcode == LIO_READ) {
> > > > auio.uio_rw = UIO_READ;
> > > > if (auio.uio_resid == 0)
> > > > @@ -1054,6 +1056,8 @@ aio_switch_vmspace(struct kaiocb *job)
> > > >  {
> > > >  
> > > > vmspace_switch_aio(job->userproc->p_vmspace);
> > > > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > > > +   ("%s: vmspace mismatch", __func__));
> > > >  }
> > > > 
> > > > If this panics, then vmspace_switch_aio() is not working for
> > > > some reason.
> > > 
> > > This issuse caused rare, this panic produced with issuse or on any aio
> > > request? (this is production server)
> > 
> > It would panic in the case that we are going to write into the wrong
> > process (so about as rare as your issue).
> 
> Can I configure automatic reboot (not halted) in this case?

FreeBSD in a stable branch should already reboot (after writing out a dump)
by default unless you have configured it otherwise.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: nginx and FreeBSD11

2016-09-15 Thread John Baldwin

On Thursday, September 15, 2016 08:49:48 PM Slawa Olhovchenkov wrote:
> On Thu, Sep 15, 2016 at 10:28:11AM -0700, John Baldwin wrote:
> 
> > On Thursday, September 15, 2016 05:41:03 PM Slawa Olhovchenkov wrote:
> > > On Wed, Sep 07, 2016 at 10:13:48PM +0300, Slawa Olhovchenkov wrote:
> > > 
> > > > I am have strange issuse with nginx on FreeBSD11.
> > > > I am have FreeBSD11 instaled over STABLE-10.
> > > > nginx build for FreeBSD10 and run w/o recompile work fine.
> > > > nginx build for FreeBSD11 crushed inside rbtree lookups: next node
> > > > totaly craped.
> > > > 
> > > > I am see next potential cause:
> > > > 
> > > > 1) clang 3.8 code generation issuse
> > > > 2) system library issuse
> > > > 
> > > > may be i am miss something?
> > > > 
> > > > How to find real cause?
> > > 
> > > I find real cause and this like show-stopper for RELEASE.
> > > I am use nginx with AIO and AIO from one nginx process corrupt memory
> > > from other nginx process. Yes, this is cross-process memory
> > > corruption.
> > > 
> > > Last case, core dumped proccess with pid 1060 at 15:45:14.
> > > Corruped memory at 0x860697000.
> > > I am know about good memory at 0x86067f800.
> > > Dumping (form core) this region to file and analyze by hexdump I am
> > > found start of corrupt region -- offset c8c0 from 0x86067f800.
> > > 0x86067f800+0xc8c0 = 0x86068c0c0
> > > 
> > > I am preliminary enabled debuggin of AIO started operation to nginx
> > > error log (memory address, file name, offset and size of transfer).
> > > 
> > > grep -i 86068c0c0 error.log near 15:45:14 give target file.
> > > grep ce949665cbcd.hls error.log near 15:45:14 give next result:
> > > 
> > > 2016/09/15 15:45:13 [notice] 1055#0: *11659936 AIO_RD 00082065DB60 
> > > start 00086068C0C0 561b0   2646736 ce949665cbcd.hls
> > > 2016/09/15 15:45:14 [notice] 1060#0: *10998125 AIO_RD 00081F1FFB60 
> > > start 00086FF2C0C0 6cdf0 140016832 ce949665cbcd.hls
> > > 2016/09/15 15:45:14 [notice] 1055#0: *11659936 AIO_RD 0008216B6B60 
> > > start 00086472B7C0 7ff70   2999424 ce949665cbcd.hls
> > 
> > Does nginx only use AIO for regular files or does it also use it with 
> > sockets?
> 
> Only for regular files.
> 
> > You can try using this patch as a diagnostic (you will need to
> > run with INVARIANTS enabled,
> 
> How much debugs produced?
> I am have about 5-10K aio's per second.
> 
> > or at least enabled for vfs_aio.c):
> 
> How I can do this (enable INVARIANTS for vfs_aio.c)?

Include INVARIANT_SUPPORT in your kernel and add a line with:

#define INVARIANTS

at the top of sys/kern/vfs_aio.c.

> 
> > Index: vfs_aio.c
> > ===
> > --- vfs_aio.c   (revision 305811)
> > +++ vfs_aio.c   (working copy)
> > @@ -787,6 +787,8 @@ aio_process_rw(struct kaiocb *job)
> >  * aio_aqueue() acquires a reference to the file that is
> >  * released in aio_free_entry().
> >  */
> > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > +   ("%s: vmspace mismatch", __func__));
> > if (cb->aio_lio_opcode == LIO_READ) {
> > auio.uio_rw = UIO_READ;
> > if (auio.uio_resid == 0)
> > @@ -1054,6 +1056,8 @@ aio_switch_vmspace(struct kaiocb *job)
> >  {
> >  
> > vmspace_switch_aio(job->userproc->p_vmspace);
> > +   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
> > +   ("%s: vmspace mismatch", __func__));
> >  }
> > 
> > If this panics, then vmspace_switch_aio() is not working for
> > some reason.
> 
> This issuse caused rare, this panic produced with issuse or on any aio
> request? (this is production server)

It would panic in the case that we are going to write into the wrong
process (so about as rare as your issue).

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: nginx and FreeBSD11

2016-09-15 Thread John Baldwin

On Thursday, September 15, 2016 05:41:03 PM Slawa Olhovchenkov wrote:
> On Wed, Sep 07, 2016 at 10:13:48PM +0300, Slawa Olhovchenkov wrote:
> 
> > I am have strange issuse with nginx on FreeBSD11.
> > I am have FreeBSD11 instaled over STABLE-10.
> > nginx build for FreeBSD10 and run w/o recompile work fine.
> > nginx build for FreeBSD11 crushed inside rbtree lookups: next node
> > totaly craped.
> > 
> > I am see next potential cause:
> > 
> > 1) clang 3.8 code generation issuse
> > 2) system library issuse
> > 
> > may be i am miss something?
> > 
> > How to find real cause?
> 
> I find real cause and this like show-stopper for RELEASE.
> I am use nginx with AIO and AIO from one nginx process corrupt memory
> from other nginx process. Yes, this is cross-process memory
> corruption.
> 
> Last case, core dumped proccess with pid 1060 at 15:45:14.
> Corruped memory at 0x860697000.
> I am know about good memory at 0x86067f800.
> Dumping (form core) this region to file and analyze by hexdump I am
> found start of corrupt region -- offset c8c0 from 0x86067f800.
> 0x86067f800+0xc8c0 = 0x86068c0c0
> 
> I am preliminary enabled debuggin of AIO started operation to nginx
> error log (memory address, file name, offset and size of transfer).
> 
> grep -i 86068c0c0 error.log near 15:45:14 give target file.
> grep ce949665cbcd.hls error.log near 15:45:14 give next result:
> 
> 2016/09/15 15:45:13 [notice] 1055#0: *11659936 AIO_RD 00082065DB60 start 
> 00086068C0C0 561b0   2646736 ce949665cbcd.hls
> 2016/09/15 15:45:14 [notice] 1060#0: *10998125 AIO_RD 00081F1FFB60 start 
> 00086FF2C0C0 6cdf0 140016832 ce949665cbcd.hls
> 2016/09/15 15:45:14 [notice] 1055#0: *11659936 AIO_RD 0008216B6B60 start 
> 00086472B7C0 7ff70   2999424 ce949665cbcd.hls

Does nginx only use AIO for regular files or does it also use it with sockets?

You can try using this patch as a diagnostic (you will need to
run with INVARIANTS enabled, or at least enabled for vfs_aio.c):

Index: vfs_aio.c
===
--- vfs_aio.c   (revision 305811)
+++ vfs_aio.c   (working copy)
@@ -787,6 +787,8 @@ aio_process_rw(struct kaiocb *job)
 * aio_aqueue() acquires a reference to the file that is
 * released in aio_free_entry().
 */
+   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
+   ("%s: vmspace mismatch", __func__));
if (cb->aio_lio_opcode == LIO_READ) {
auio.uio_rw = UIO_READ;
if (auio.uio_resid == 0)
@@ -1054,6 +1056,8 @@ aio_switch_vmspace(struct kaiocb *job)
 {
 
vmspace_switch_aio(job->userproc->p_vmspace);
+   KASSERT(curproc->p_vmspace == job->userproc->p_vmspace,
+   ("%s: vmspace mismatch", __func__));
 }

If this panics, then vmspace_switch_aio() is not working for
some reason.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: 11.0-RELEASE status update

2016-09-06 Thread John Baldwin

On Thursday, September 01, 2016 02:22:04 PM Bryan Drewery wrote:
> On 9/1/2016 2:13 PM, Slawa Olhovchenkov wrote:
> > On Thu, Sep 01, 2016 at 09:10:00PM +, Glen Barber wrote:
> > 
> >> As some of you may be aware, a few last-minute showstoppers appeared
> >> since 11.0-RC1 (and before RC1).
> >>
> >> One of the showstoppers has been fixed in 12-CURRENT, and merged to
> >> stable/11 and releng/11.0 that affected booting from large volumes:
> >>
> >>  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212139
> >>
> >> There is one issue that is still being investigated, which we are
> >> classifying as an EN candidate, given the manifestations of the issue
> >> and reproducibility:
> >>
> >>  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212168
> >>
> >> There is one blocker before 11.0-RELEASE, that affects libarchive, which
> >> we are waiting for feedback.  Once feedback is received, the schedule
> >> for 11.0-RELEASE will be updated on the website to reflect reality.
> >>
> >> There are a few post-release EN items on our watch list as well, so if
> >> something was not mentioned here, that does not mean it will not be
> >> fixed in 11.0-RELEASE.
> >>
> >> Apologies for the delay, and as always, thank you for your patience.
> >>
> >> Glen
> >> On behalf of:  re@
> >>
> > 
> > 
> > Do you planed to fix issuse with missied and delete libmap32.conf?
> > 
> 
> This was done intentionally quite a while ago:
> https://svnweb.freebsd.org/base?view=revision=282421
> 
> Though it was later removed from ObsoleteFiles so 'make delete-old'
> would not remove it from users' systems in r282423.
> 
> etcupdate removing it is the problem really being reported here.

Mmm, etcupdate should not remove a modified file.  However, etcupdate
assumes that a file removed from /etc is supposed to be removed.  If
your libmap32.conf is unmodified then it truly is pointless since
/usr/lib32/private doesn't exist anymore in 11.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: ahci-timeout regression in beta3

2016-03-05 Thread John Baldwin

e ahci(4) fails when using MSI with ahcichX timeout!
> nooptions   RACCT   # Resource accounting framework
> nooptions   RACCT_DEFAULT_TO_DISABLED # Set
> kern.racct.enable=0 by default
> nooptions   RCTL# Resource limits
> 
> Perhpas it's related?!
> https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html

I think it's related in the sense that there is a timing race in ahci and
that the /dev/random and RACCT changes alter the timing enough to trigger
the race simply by changing the relative order of SYSINIT's during boot
(and/or the amount of time between the ahci driver doing its initial
probe and the second probe that is run for the interrupt config hooks that
actually probes the attached SATA devices).

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: ahci-timeout regression in beta3

2016-03-02 Thread John Baldwin

On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
>  Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):
> >  Hello,
> >
> > I have a remote machine with a probably defective ODD, but until r294989
> > (from Jan 28th) I could boot with just these warnings:
> > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
> > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
> > (cd1:ahcich1:0:0:0): SCSI status: Check Condition
> > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read
> > error)
> > (cd1:ahcich1:0:0:0): Error 5, Unretryable error
> > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
> > …
> >
> > beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
> > ahcich2: Timeout on slot 11 port 0
> > ahcich2: is 0008 cs  ss  rs 0800 tfd 40 derr
> >  cmd 0004cb17
> > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00
> > 00 00 00
> > ...
> > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00
> > 00 00
> > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
> > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
> > ada1 detached
> > ...
> > The numbers (first ACB) and also the channel varies from time to time
> 
> I could narrow it down to r295480
> (https://svnweb.freebsd.org/base?view=revision=295480)
> 
> Reverting that lets the machine boot again.
> 
> I captured verbose boot messages, finding out that problem relaxes with
> verbose-booting, since ahci seems to recover:
> …
> TSC timecounter discards lower 1 bit(s)
> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
> ahcich2: Timeout on slot 12 port 0
> ahcich2: is 0008 cs  ss  rs 1000 tfd 40 serr
>  cmd 0004cc17
> ahcich2: AHCI reset...
> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00
> 00 00 00
> (ada1:ahcich2:0:0:0): CAM status: Command timeout
> (ada1:ahcich2:0:0:0): Retrying command
> ahcich2: SATA connect time=100us status=0123
> ahcich2: AHCI reset: device found
> ahcich2: AHCI reset: device ready after 100ms
> ahcich1: SNTF 0x0001
> ahcich1: SNTF 0x0001
> …
> 
> I have checked twice that r295480 introduces boot failure here.
> 
> I have absolutely no idea where/how/why/what race happens...
> 
> Thanks for any hints,

That is most bizarre.  Does HEAD boot fine on this machine?  The change
in question probably alters the timing of startup a bit since the random
kthread is placed on the run queue later which might affect the relative
order of kthreads as they start executing, but that would just mean it is
exposting a race in some other part of the system.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: ia64 10-stable about r292594: rescue crunchide *.lo unknown executable format

2016-01-29 Thread John Baldwin

On Wednesday, January 27, 2016 12:32:39 AM Anton Shterenlikht wrote:
> I asked about this already in stable@ and ia64@.
> Got no reply. Perhaps ia64 has been abandoned in
> 10-stable too? If so, I'd like to know.
> 
> If not, I'm getting:
> 
> # pwd
> /usr/obj/usr/src/rescue/rescue
> # crunchide -k _crunched_chio_stub cat.lo
> cat.lo: unknown executable format
> # crunchide -k _crunched_chio_stub chflags.lo
> chflags.lo: unknown executable format
> # crunchide -k _crunched_chio_stub chio.lo
> chio.lo: unknown executable format
> # file *lo
> cat.lo: ELF 64-bit LSB relocatable, IA-64, version 1 (FreeBSD), not 
> stripped
> chflags.lo: ELF 64-bit LSB relocatable, IA-64, version 1 (FreeBSD), not 
> stripped
> chio.lo:ELF 64-bit LSB relocatable, IA-64, version 1 (FreeBSD), not 
> stripped
> # file /usr/bin/crunchide
> /usr/bin/crunchide: ELF 64-bit LSB executable, IA-64, version 1 (FreeBSD), 
> dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 10.2 
> (1002504), stripped
> # 
> 
> This is on 10.2-STABLE #20 r292594.
> 
> I tried to buildworld up to r294823,
> and back to r291000, all with the same
> error. I cannot even build the same
> revision as the one I'm running now.
> 
> I've deleted /usr/obj completely,- still
> the same.
> 
> Please advise
> 
> Thanks

While ia64 is mostly abandoned, this build failure was fixed a few weeks ago:


r292885 | emaste | 2015-12-29 12:36:11 -0800 (Tue, 29 Dec 2015) | 4 lines

crunchide: Restore IA-64 support accidentally lost in r292421 mismerge

Reported by:ngie


-- 
John Baldwin

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: smbfs crashes since approx. 10.1-RELEASE

2015-10-07 Thread John Baldwin

On Wednesday, October 07, 2015 08:52:30 AM Christian Kratzer wrote:
> Hi,
> 
> On Tue, 6 Oct 2015, John Baldwin wrote:
> 
> >> This crash is occurring when doing an mtx_unlock(). Unfortunately, 
> >> I'm not
> >> conversant w.r.t. this code. I've cc'd jhb@ in case he has some insight.
> >> If you don't get any responses, I'd suggest reposting to freebsd-current@ 
> >> with
> >> "crashes in mtx_unlock()" in the subject line.
> >>
> >> Btw John, the code does tsleep() in a loop before the mtx_unlock(). 
> >> I do
> >> remember that was once allowed, but am not sure if it still is (ie a 
> >> tsleep() call
> >> while holding Giant)?
> >>
> >> Hopefully someone who knows what is special about Giant that might cause 
> >> this will
> >> respond.
> >>
> >> Good luck with it, rick
> >
> > tsleep() with Giant is still allowed.  However, this sort of panic usually 
> > means
> > you unlocked a mutex you didn't hold (but without INVARIANTS enabled or 
> > you'd get
> > an assertion failure earlier).
> >
> > I don't see anything obviously wrong in smb_iod_thread() however.
> >
> > If you have the crashdump, can you please run this in kgdb:
> >
> > frame 9
> > p (struct mtx *)c
> > p *(struct mtx *)c
> 
> yes I have. Here we go:
> 
> --snipp--
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x20
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x80996c7c
> stack pointer   = 0x28:0xfe004e79bac0
> frame pointer   = 0x28:0xfe004e79baf0
> code segment= base 0x0, limit 0xf, type 0x1b
>  = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= resume, IOPL = 0
> current process = 12235 (smbiod172)
> trap number = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0x80984e30 at kdb_backtrace+0x60
> #1 0x809489e6 at vpanic+0x126
> #2 0x809488b3 at panic+0x43
> #3 0x80d4aadb at trap_fatal+0x36b
> #4 0x80d4addd at trap_pfault+0x2ed
> #5 0x80d4a47a at trap+0x47a
> #6 0x80d307f2 at calltrap+0x8
> #7 0x8092ebe0 at __mtx_unlock_sleep+0x60
> #8 0x8092eb69 at __mtx_unlock_flags+0x69
> #9 0x81a1b724 at smb_iod_thread+0xb4
> #10 0x8091244a at fork_exit+0x9a
> #11 0x80d30d2e at fork_trampoline+0xe
> Uptime: 1d18h34m4s
> Dumping 161 out of 999 MB:..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
> 
> Reading symbols from /boot/kernel/smbfs.ko.symbols...done.
> Loaded symbols for /boot/kernel/smbfs.ko.symbols
> Reading symbols from /boot/kernel/libiconv.ko.symbols...done.
> Loaded symbols for /boot/kernel/libiconv.ko.symbols
> Reading symbols from /boot/kernel/libmchain.ko.symbols...done.
> Loaded symbols for /boot/kernel/libmchain.ko.symbols
> #0  doadump (textdump=) at pcpu.h:219
> 219 pcpu.h: No such file or directory.
>  in pcpu.h
> (kgdb) frame 9
> #9  0x8092ebe0 in __mtx_unlock_sleep (c=0xf8002f531790, 
> opts=,
>  file=0x81a25801 "%s: Can't handle disordered parameters 
> %d:%d\n", line=1) at /usr/src/sys/kern/kern_mutex.c:791
> 791 /usr/src/sys/kern/kern_mutex.c: No such file or directory.
>  in /usr/src/sys/kern/kern_mutex.c
> Current language:  auto; currently minimal
> (kgdb) p (struct mtx *)c
> $1 = (struct mtx *) 0xf8002f531790
> (kgdb) p *(struct mtx *)c
> $2 = {lock_object = {lo_name = 0x6 , lo_flags = 0, 
> lo_data = 0, lo_witness = 0xf8002f531798},
>mtx_lock = 1444181401}

Ok, so that is a destroyed mutex.  This means it is probably not Giant, and
it might be some mutex in smb_iod_main() that shows up in smb_iod_thread() due
to inlining.

Actually, we know this from your earlier mail:

if (evp->ev_type & SMBIOD_EV_SYNC) {
SMB_IOD_EVLOCK(iod);
wakeup(evp);
SMB_IOD_EVUNLOCK(iod);

Line 624 is that SMB_IOD_EVUNLOCK().

Hmm, does 'p *evp' work at frame 10?  If not, can you try building the
devel/gdb port from a recent ports tree with the 'KGDB' option enabled and
use 'kgdb710' instead of 'kgdb' to see if you can print out '*evp'?

> (kgdb)
> --snipp--
> 
> I can build a GENERIC kernel with INVARIANTS enabled on the box to see if we 
> get a better assertions next time this happens.

That would be great, but please keep the existing core and kernel.  We might
be able to figure this out from that stil

Re: smbfs crashes since approx. 10.1-RELEASE

2015-10-06 Thread John Baldwin

On Monday, October 05, 2015 06:16:54 PM Rick Macklem wrote:
> Christian Kratzer wrote:
> > Hi,
> > 
> > I run a regular rsync job that runs from cron and copies stuff that gets
> > created on a Windows smbfs share.
> > 
> > Starting about 10.1-RELEASE the VM has become unstable and started panicing.
> > 
> > I have narrowed the issue down to the aforementioned rsync job.
> > 
> > When I move the job to a different VM the the other VM starts crashing and
> > the VM without the job becomes stable agin.
> > 
> > I have panics and crashinfos stored in /var/crash if anybody is interested:
> > 
> >  root@noc2:/var/crash # uname -a
> >  FreeBSD noc2.cksoft.de 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r28: 
> > Wed
> >  Aug 12 15:26:37 UTC 2015
> >  r...@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
> >  root@noc2:/var/crash # freebsd-version -u
> >  10.2-RELEASE-p5
> >  root@noc2:/var/crash # freebsd-version -k
> >  10.2-RELEASE
> >  root@noc2:/var/crash #
> > 
> > This is what I have in /var/crash/core.txt.0
> > 
> >  Fatal trap 12: page fault while in kernel mode
> >  cpuid = 0; apic id = 00
> >  fault virtual address   = 0x20
> >  fault code  = supervisor read data, page not present
> >  instruction pointer = 0x20:0x80996c7c
> >  stack pointer   = 0x28:0xfe003d6c0ac0
> >  frame pointer   = 0x28:0xfe003d6c0af0
> >  code segment= base 0x0, limit 0xf, type 0x1b
> > = DPL 0, pres 1, long 1, def32 0, gran 1
> >  processor eflags= resume, IOPL = 0
> >  current process = 1349 (smbiod10)
> >  trap number = 12
> >  panic: page fault
> >  cpuid = 0
> >  KDB: stack backtrace:
> >  #0 0x80984e30 at kdb_backtrace+0x60
> >  #1 0x809489e6 at vpanic+0x126
> >  #2 0x809488b3 at panic+0x43
> >  #3 0x80d4aadb at trap_fatal+0x36b
> >  #4 0x80d4addd at trap_pfault+0x2ed
> >  #5 0x80d4a47a at trap+0x47a
> >  #6 0x80d307f2 at calltrap+0x8
> >  #7 0x8092ebe0 at __mtx_unlock_sleep+0x60
> >  #8 0x8092eb69 at __mtx_unlock_flags+0x69
> >  #9 0x81a1b724 at smb_iod_thread+0xb4
> >  #10 0x8091244a at fork_exit+0x9a
> >  #11 0x80d30d2e at fork_trampoline+0xe
> >  Uptime: 2h43m55s
> >  Dumping 103 out of 999 MB: (CTRL-C to abort)
> >  ..16%..31%..47%..62%..78%..93%
> > 
> This crash is occurring when doing an mtx_unlock(). Unfortunately, I'm 
> not
> conversant w.r.t. this code. I've cc'd jhb@ in case he has some insight.
> If you don't get any responses, I'd suggest reposting to freebsd-current@ with
> "crashes in mtx_unlock()" in the subject line.
> 
> Btw John, the code does tsleep() in a loop before the mtx_unlock(). I do
> remember that was once allowed, but am not sure if it still is (ie a tsleep() 
> call
> while holding Giant)?
> 
> Hopefully someone who knows what is special about Giant that might cause this 
> will
> respond.
> 
> Good luck with it, rick

tsleep() with Giant is still allowed.  However, this sort of panic usually means
you unlocked a mutex you didn't hold (but without INVARIANTS enabled or you'd 
get
an assertion failure earlier).

I don't see anything obviously wrong in smb_iod_thread() however.

If you have the crashdump, can you please run this in kgdb:

frame 9
p (struct mtx *)c
p *(struct mtx *)c

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: suspend/resume regression

2015-07-29 Thread John Baldwin

On Saturday, July 25, 2015 03:54:40 PM Kevin Oberman wrote:
 John,
 
 I'm concerned that two issues may be getting conflated.
 
 The issue I thought we were looking at was the failure of some systems
 (T520, X220, T430) to resume after a number of PCI enhancements were MFCed.
 This is completely unrelated to the USB issue I was experiencing when
 trying to test the problem on HEAD. The more I think about it, the more I
 think that the USB issue is just how things need to work.

Well, the USB thing could be smarter, but it's a bit of a PITA.  What if
you take the USB stick out, mess with it on another system, then plug
it back in before resume?  All the cached file data in the RAM of the
resumed system would need to be invalidated, etc.

However, I ended up copying a HEAD kernel onto my USB stick and seeing 
that I at least got the console back before it panic'd.  This was sufficient
to let me test the reversion patch via the USB stick (and would be sufficient
for seeing if we can merge it again for 10.3).

 The real issue is just resuming the system after  r281874 was MFCed as a
 part of 284034. No USB connected file systems are involved. I m happy to
 see that it has been reverted for 10.2, but clearly, these changes are
 needed down the line and I hope the issue can be resolved well before 11.0.
 (This assumes a 10.3 before 11.0 happens next year.)

So it works fine in 11.0 on my x220, and as other folks reported in the PR,
so 11.0 is fine.  It is also needed for PCI-e hotplug to work after resume
(using out-of-tree patches for PCI-e hotplug that jmg@ has).  If I merge it
to 10.3 it won't be until I've verified that whatever I merge works on my
x220 as well as the T440.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: suspend/resume regression

2015-07-22 Thread John Baldwin

On Saturday, July 18, 2015 10:22:33 PM Kevin Oberman wrote:
 I just confirmed that my system resumes on HEAD of July 16 but fails on
 10.2-BETA2. So the problem limited to 10. I'm guessing that some other
 change made to pci that has not been MFCed is the cause, but it is only
 causing a problem on some hardware. I have seen no reports about systems
 other than Lenovo systems.

So my x220 does fail with a USB disk on 10, but I also get a weird behavior
where it seems to wake up (disk lights up) and then goes back to sleep and
never resumes again.  I'm not sure if this is due to using a USB disk or
not.  I get the same result when I disable power management during suspend
which was reported to fix other laptops IIRC.

Please try this:

Index: sys/dev/acpica/acpi.c
===
--- sys/dev/acpica/acpi.c   (revision 285761)
+++ sys/dev/acpica/acpi.c   (working copy)
@@ -691,7 +691,7 @@
 static void
 acpi_set_power_children(device_t dev, int state)
 {
-   device_t child, parent;
+   device_t child;
device_t *devlist;
struct pci_devinfo *dinfo;
int dstate, i, numdevs;
@@ -703,13 +703,12 @@
 * Retrieve and set D-state for the sleep state if _SxD is present.
 * Skip children who aren't attached since they are handled separately.
 */
-   parent = device_get_parent(dev);
for (i = 0; i  numdevs; i++) {
child = devlist[i];
dinfo = device_get_ivars(child);
dstate = state;
if (device_is_attached(child) 
-   acpi_device_pwr_for_sleep(parent, dev, dstate) == 0)
+   acpi_device_pwr_for_sleep(dev, child, dstate) == 0)
acpi_set_powerstate(child, dstate);
}
free(devlist, M_TEMP);
Index: sys/dev/pci/pci.c
===
--- sys/dev/pci/pci.c   (revision 285761)
+++ sys/dev/pci/pci.c   (working copy)
@@ -3671,7 +3671,7 @@
child = devlist[i];
dstate = state;
if (device_is_attached(child) 
-   PCIB_POWER_FOR_SLEEP(pcib, dev, dstate) == 0)
+   PCIB_POWER_FOR_SLEEP(pcib, child, dstate) == 0)
pci_set_powerstate(child, dstate);
}
 }
Index: .
===
--- .   (revision 285761)
+++ .   (working copy)

Property changes on: .
___
Modified: svn:mergeinfo
   Merged /head:r274386,274397


-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: suspend/resume regression

2015-07-15 Thread John Baldwin

On Tuesday, July 14, 2015 03:10:59 PM Brandon J.  Wandersee wrote:
 
 Please forgive me if this seems impudent, but has there been any progress on
 this? The status of the bug report hasn't changed since it was opened. I
 don't mean to be rude, and I certainly appreciate the effort that's gone
 into this already (especially Kevin's detective work), but support for
 suspend-to-RAM and my laptop's hotkeys were essentially the only reasons
 I started tracking 10-STABLE to begin with. Since both features were
 resolved many months ago, I was hoping to switch from -STABLE to 10.2-RELEASE
 when it came out, but I'm starting to get the feeling that won't happen
 because of a single errant commit. Having to continue following -STABLE
 would not be terrible, but it would be disappointing.

As noted previously, I have been moving house and generally offline since
mid-June (and I'm not really fully online yet).  My last request was if
Kevin (or someone else with an affected laptop) could test HEAD to see if
there is a missing bugfix on HEAD that needs to be merged.  This specific
change was tested on HEAD on both a T440 and X220 and on 10 to test the
MFC on the T440.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: suspend/resume regression

2015-06-30 Thread John Baldwin

I'm traveling and AFK for a week or so more, but I did test this MFC including 
suspend/resume with CardBus, etc. on a T440 before committing it.  It would be 
good to know if HEAD works for you.  If it does then there's likely another fix 
from HEAD that you need merged.

-- 
John Baldwin

 On Jun 29, 2015, at 00:54, Kevin Oberman rkober...@gmail.com wrote:
 
 On Sun, Jun 28, 2015 at 11:07 PM, Adrian Chadd adrian.ch...@gmail.com 
 wrote:
 Ok, so which subset of changes is the culprit?
 
 (sorry, I'm tired.. :( )
 The merge of 281874 broke it. Unfortunately, this is a fairly large and 
 important change that touches five files, mainly dev/pci/pci.c and 
 dev/pci/pci_pci.c with a less significant update to dev/pccbb/pccbb_pci.c.
 
 Get some rest. This is an annoying regression, but not disastrous. Systems 
 still run and it sounds like many still resume. Unfortunately my T520 and 
 some contemporary ThinkPads don't.
 
 I now have enough data to open a fairly coherent ticket. I'll try to open it 
 tomorrow. (I'm tired, too.)
 --
 Kevin Oberman, Network Engineer, Retired
 E-mail: rkober...@gmail.com
 PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683
 
 
 -a
 
 
 On 28 June 2015 at 22:45, Kevin Oberman rkober...@gmail.com wrote:
  On Sun, Jun 28, 2015 at 4:54 PM, Kevin Oberman rkober...@gmail.com wrote:
 
  On Sun, Jun 28, 2015 at 10:38 AM, Joseph Mingrone j...@ftfl.ca wrote:
 
  Adrian Chadd adrian.ch...@gmail.com writes:
   ok. I've updated my x230 to the latest -head and it is okay at
   suspend/resume.
 
  No problem with -head on the X220 as well.
 
   I can go acquire an x220 (now that they're cheap) to have as another
   reference laptop.
 
  You might ping Allan Jude.  If I'm not mistaken he had at least two
  X220s at BSDCan.  Maybe he'd be willing to part with one.
 
 
  I have now merged all of the parts of 284034 except for 281874 and resume
  works correctly. As i suspected, something in that rather large commit is
  the problem and it is probably something that is tied to some other change
  in HEAD as Adrian has reported that it works fine in HEAD.
 
  I'll have to admit that have no idea how to approach figuring this out.
  I'm not sure how I can even revert a part of the commit to get
  10.2-PRERELEASE working for me. I really wish that a commit as large as 
  this
  one had been MFCed separately. :-(  So far there has been only a single
  commit to pci and none to pccbb since 284034, so I built stable with the
  files modified in 281874 manually reverted.
 
 
  I now have r284916M running and it seems to be working fine. All of 284034
  committed except for the MFC from 281874. That left three files conflicting
  with STABLE:
  /usr/src/sys/dev/pci/pci.c
  /usr/src/sys/dev/pci/pci_pci.c
  /usr/src/sys/dev/pccbb/pccbb_pci.c
  --
  Kevin Oberman, Network Engineer, Retired
  E-mail: rkober...@gmail.com
  PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683
 
 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Build failed in Jenkins: FreeBSD_stable_9 #729

2015-03-31 Thread John Baldwin

On Tuesday, March 31, 2015 05:02:04 PM jenkins-ad...@freebsd.org wrote:
 See https://jenkins.freebsd.org/job/FreeBSD_stable_9/729/changes
 
 Changes:
 
 [jhb] MFC 278760:
 Add two new counters for vnode life cycle events:
 - vfs.recycles counts the number of vnodes forcefully recycled to avoid
   exceeding kern.maxvnodes.
 - vfs.vnodes_created counts the number of vnodes created by successful
   calls to getnewvnode().

The actual error is unrelated (and also not in this really long e-mail).  It
appears to be:

mv -f dtparserparse.h dtparser.y.h
mv: rename dtparserparse.h to dtparser.y.h: No such file or directory
*** [dtparser.y.h] Error code 1

I suspect this is some sort of race with -j?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: RELENG_10 performance regression (was Re: 35-40% performance drop releng9 vs releng10 openvpn

2015-03-21 Thread John Baldwin

On 3/21/15 12:31 PM, Adrian Chadd wrote:
 On 21 March 2015 at 08:52, John Baldwin j...@freebsd.org wrote:
 On 3/20/15 8:46 PM, Mike Tancsa wrote:
 On 3/20/2015 8:15 PM, Konstantin Belousov wrote:

 For the purpose of devfs, does it make sense to bump timestamps like
 normal filesystems for each read/write operation?  Looks like Mac OS X
 will bump timestamps for each operation but Debian don't.

 First question is, what timecounter hardware is used.  I would accept
 some slowdown from hardware like HPET, but it is indeed surprising
 if caused by TSC.



 David Wolfskill suggested trying the problem commit with

 vfs.timestamp_precision=0

 and it does indeed restore performance to what it was.  The raw dtrace
 files are available and FlameGraphs can all be found at

 http://tancsa.com/time/

 Do you know why you are using the HPET instead of TSC for timestamping?
 Using the TSC can make a non-trivial performance difference since userland
 can calculate timestamps without using system calls when it is used.
 (That is not related to this case, but switching to the TSC in general is
 preferable.)

 There are a few generations of Intel CPUs where you can't mix deeper sleep
 states with the TSC as timecounter, but those CPUs are getting to be a bit
 older at this point.
 
 What about various VMs?

It depends on the hypervisor.  bryanv@ is working on bits to allow us to use
very cheap timecounters under KVM for example (if that isn't already in the
tree).  I think bhyve permits guests to use the TSC already.  I think when we
talked about this on arch@ before the change was made folks felt that even
many embedded systems would have some sort of relatively cheap cycle counter,
especially going forward.

It may be that we end up picking a different default for guests as we do for
'hz' (though that has its downsides.  Luigi has noted that one of the things
he has to do to fix network performance in VMs is undo that and raise hz back
to 1000).  However, for bare metal I'd like to figure out why folks aren't
using the TSC and fix those if possible.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: RELENG_10 performance regression (was Re: 35-40% performance drop releng9 vs releng10 openvpn

2015-03-21 Thread John Baldwin

On 3/20/15 8:46 PM, Mike Tancsa wrote:
 On 3/20/2015 8:15 PM, Konstantin Belousov wrote:

 For the purpose of devfs, does it make sense to bump timestamps like
 normal filesystems for each read/write operation?  Looks like Mac OS X
 will bump timestamps for each operation but Debian don't.

 First question is, what timecounter hardware is used.  I would accept
 some slowdown from hardware like HPET, but it is indeed surprising
 if caused by TSC.


 
 David Wolfskill suggested trying the problem commit with
 
 vfs.timestamp_precision=0
 
 and it does indeed restore performance to what it was.  The raw dtrace 
 files are available and FlameGraphs can all be found at
 
 http://tancsa.com/time/

Do you know why you are using the HPET instead of TSC for timestamping?
Using the TSC can make a non-trivial performance difference since userland
can calculate timestamps without using system calls when it is used.
(That is not related to this case, but switching to the TSC in general is
preferable.)

There are a few generations of Intel CPUs where you can't mix deeper sleep
states with the TSC as timecounter, but those CPUs are getting to be a bit
older at this point.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: savecore problem

2015-03-16 Thread John Baldwin

On Monday, March 16, 2015 10:17:54 AM Brandon Allbery wrote:
 On Mon, Mar 16, 2015 at 9:40 AM, Michael BlackHeart amdm...@gmail.com
 wrote:
 
  Hello there. I've got a problem. Recently my personal server issued a
  kernel panic. Then there's a dump and so on. But there's no dump
  information after reboot. I do not know what was really the panic cause but
  assume that savecore failed because of RAID.
 
  Problem - minidump was done (I saw it was) but was not recovered by
  savecore after reboot into /var/vrash
 
 (...)
 
  /dev/ufs/varfs  /varufs rw,noatime
   2   2
 
 
 Last I checked, savecore had to happen very early --- before filesystems
 other than / are mounted.

No, it can happen after that.  What really has to happen is that you don't
use swap (if you are dumping to your swap partition) before savecore runs.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: HP EliteBook EFI boot failure

2015-03-16 Thread John Baldwin

On Sunday, March 15, 2015 02:28:41 PM Oliver Pinter wrote:
 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194063

I am curious if the redzone fix I committed to the EFI loader last week
might help.  It was noticed because gzipped kernels were corrupted when
loaded from disk, but it might generate other random corruption even in
the non-gzip case.  I think the chance that it helps is low, but it
isn't quite zero.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: savecore problem

2015-03-16 Thread John Baldwin

On Monday, March 16, 2015 11:54:52 AM Michael Jung wrote:
 On 2015-03-16 11:23, John Baldwin wrote:
  On Monday, March 16, 2015 10:17:54 AM Brandon Allbery wrote:
  On Mon, Mar 16, 2015 at 9:40 AM, Michael BlackHeart 
  amdm...@gmail.com
  wrote:
  
   Hello there. I've got a problem. Recently my personal server issued a
   kernel panic. Then there's a dump and so on. But there's no dump
   information after reboot. I do not know what was really the panic cause 
   but
   assume that savecore failed because of RAID.
  
   Problem - minidump was done (I saw it was) but was not recovered by
   savecore after reboot into /var/vrash
  
  (...)
  
   /dev/ufs/varfs  /varufs rw,noatime
2   2
  
  
  Last I checked, savecore had to happen very early --- before 
  filesystems
  other than / are mounted.
  
  No, it can happen after that.  What really has to happen is that you 
  don't
  use swap (if you are dumping to your swap partition) before savecore 
  runs.
 
 Can someone elaborate on not using swap as a dump device a little more? 
 I have
 had instances in the past were I had issues with getting a core dump
 and resorted to a dedicated dump device but didn't investigate further 
 nor have
 I read this as a requirement.

Typically the first swap partition is used as the dump partition.  If the
system writes anything out to swap before savecore runs, then it can
potentially overwrite part of the core.  (Note that the running kernel doesn't
know that there is a core on the swap partition to try to preserve, it just
sees that there is an available swap partition.)  To try to minimize the 
chances of
this happening, the dump is written at the end of the swap partition instead
of the start, but that is not foolproof.  Usually you don't run too many things
during early boot before savecore that would cause swapping, though a fsck
of a large filesystem might use quite a bit of RAM which could result in 
swapping.

 A second question - Can a USB devices be used reliably for a dump device 
 for
 ZFS on boot systems?

I'm not sure if USB devices will work as a dump device or not.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: On-going laptop brightness issues

2015-03-13 Thread John Baldwin

On Thursday, March 12, 2015 06:19:27 PM Kevin Oberman wrote:
 On Thu, Mar 12, 2015 at 12:40 PM, Adrian Chadd adr...@freebsd.org wrote:
 
  I thought jhb already mfc'ed it?
 
 
  -a
 
 
 
 Adrian,
 
 jhb asked that I ask you to MFC. I did so and, since you declined (as is
 your right and I understand that an MFC takes a fair bit of time to do
 correctly), I am hoping someone who has a commit bit and some spare time
 will take this one.

I was going to merge it, but there is another bug report where these changes
broke a different system, see the followups here:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=190186

Hmm, looking at the bug report about the hang itself:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=193500

it seems that there is a workaround at least, but not yet a fix.  However,
since there is a viable workaround the gain from merging this probably
outweighs the downside.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: There has to be a better way of merging /etc during a major freebsd-update

2015-03-12 Thread John Baldwin

On Wednesday, March 11, 2015 10:19:33 AM Peter Olsson wrote:
 On Tue, Mar 10, 2015 at 10:06:37PM +, Ben Morrow wrote:
  Quoth Peter Olsson list-freebsd-sta...@jyborn.se:
  
   (But I will try running freebsd-update without merging /etc,
   and use mergemaster -F instead. Should solve my problem.)
  
  I'm fairly sure this won't do what you want, and in fact won't work at
  all, unless your /etc is identical to the stock /etc installed from the
  ISO. (Which it isn't, of course.)
  
  installworld specifically avoids installing the files in /etc; then,
  when you run mergemaster, it installs the new versions of those files
  into a temporary directory and merges them with the existing /etc. 
  
  freebsd-update works a little differently: because it doesn't have a
  source tree available, it has to fetch the stock versions of the files
  in /etc for the release you're upgrading from, so that it can patch them
  to the new release and then merge the changes into your current /etc. If
  you tell freebsd-update to install /etc without merging it will blindly
  update files you haven't changed (which is probably what you want) but
  (I think) will fail to update the files that you have changed, because
  it uses binary patches which won't apply to your modified versions.
  
  If you want a rather hackish solution, you could try something like
  this:
  
  - Rename /etc to /oldetc.
  - Find yourself a copy of the stock /etc for the version you are
upgrading from. (tar -xpf base.txz --include /etc)
  - Run freebsd-update with /etc removed from the merge list. This
will (should?) give you a stock /etc for the version you are
upgrading to.
  - Rename /etc - /tmp/etc, /oldetc - /etc and run mergemaster with
-t /tmp.
  
  Obviously I would script this if I was doing more than one or two
  machines
  
  Ben
 
 I'm not really clear on what will happen if I remove /etc/ from
 MergeChanges in freebsd-update.conf. Will my /etc then be ignored
 by freebsd-update, or will my /etc be completely overwritten by
 freebsd-update?
 
 Anyway, your hack could be useful to me. There are no more than
 about ten files I usually change in /etc, so saving the current
 /etc, installing a stock /etc, running freebsd-update and then
 running diff -r to sort out my changes could work. But I'm a
 little worried about removing my /etc changes from a running server.

BTW, this is kind of how etcupdate works (except that it does a full 3-way
merge unlike mergemaster since it keeps the previous /etc around to compare
with the new /etc and apply the diffs to the real /etc).  It even has a mode
to allow it to generate tarballs on the build machine that can then be used
in place of having a source tree during upgrades so that freebsd-update could
be changed to ship the updated bundle on each update.  However, I haven't had
time to look at what it would take to update freebsd-update to do this (and
freebsd-update would have to include building the tarballs in its upstream
build process as well).

OTOH, if you ask freebsd-update to update your source tree after each
update, you can use etcupdate to manage /etc instead of using freebsd-update.
(Note that starting with 10.1 and 9.3 etcupdate is in base now and new
releases ship with an initial etcupdate database that matches the release
ISOs).

The (completely untested) process might go something like this:

Before your next freebsd-update run, ensure etcupdate is setup:

1) See if etcupdate already works by running 'etcupdate diff' and seeing if
   you get a sane diff.  If you get a nice diff (without lots of noise like
   $FreeBSD$ changes), skip to step 3.

2) Ensure you have an up-to-date source tree with your current world.
   Run 'etcupdate extract'.  'etcupdate diff' should now give you a
   reasonable diff of your changes to /etc files.  (Note that it does not
   show new files like /etc/fstab, just changes to files installed by
   a clean install.)

3) Review the output of 'etcupdate diff'.  If there are local changes that
   are not correct, you can edit the files in /etc to reduce the diffs.
   If you want to restore a file to its original state, you can use
   'cp /var/db/etcupdate/current/etc/foo /etc/foo' (I will someday add
   an 'etcupdate revert' command for this)

4) Ensure that freebsd-update is set to update your source tree on each
   update and to not do any /etc merges

After your next run of 'freebsd-update', run 'etcupdate' to merge in any
changes to '/etc'.  It can generally cope with simple merges similar to
'svn up'.  If it encounters a conflict, it saves off a copy of the file
with conflict markers for you to resolve via 'etcupdate resolve' but leaves
the old file in /etc untouched until you resolve the conflict.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable

Re: Suspected libkvm infinite loop

2015-03-12 Thread John Baldwin

On Wednesday, March 11, 2015 02:00:41 PM Nick Frampton wrote:
 On 11/03/15 07:59, Mark Johnston wrote:
  On Tue, Mar 10, 2015 at 02:10:09PM -0400, John Baldwin wrote:
  Often loops using libkvm are due to programs using libkvm are trying to 
  read
  kernel data structures while they are changing.  However, if you use 
  sysctls
  to fetch this data instead, you should be able to get a stable snapshot of 
  the
  system state without getting stuck in a possible loop.  I believe for 
  libkvm
  to use sysctl instead of /dev/kmem you have to pass a NULL for the kernel 
  and
  /dev/null for the core image.
 
 In our code, we're invoking kvm_openfiles as you suggest:
 kd = kvm_openfiles (NULL, _PATH_DEVNULL, NULL, O_RDONLY, errbuf)
 
 
  It sounds like this issue might be the one fixed in r272566: if the
  KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an
  sbuf error return value could bubble up and be treated as ERESTART,
  resulting in a loop.
 
  This can be confirmed with something like
 
 dtrace -n 'syscall:::entry /pid == $target/{@[probefunc] = count();} 
  tick-3s {exit(0);}' -p pid of looping proc
 
  If the output consists solely of __sysctl, this bug is likely the
  culprit.
 
 Unfortunately, I accidentally killed fstat this morning before I could do any 
 further debug.
 
 I ran truss -p on it yesterday and it was spinning solely on __sysctl.
 
 I'll try compiling with debug symbols in case it happens again. I haven't 
 been able to reproduce the 
 problem in a reasonable time frame so it could be days or weeks before we see 
 it happen again.

Tha truss output is consistent with Mark's suggestion, so I would try
his suggested fix of 272566.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Suspected libkvm infinite loop

2015-03-12 Thread John Baldwin

On Thursday, March 12, 2015 12:40:23 PM Konstantin Belousov wrote:
 On Wed, Mar 11, 2015 at 09:34:07PM -0700, Mark Johnston wrote:
  On Thu, Mar 12, 2015 at 02:05:32PM +1000, Nick Frampton wrote:
   On 12/03/15 00:38, John Baldwin wrote:
It sounds like this issue might be the one fixed in r272566: if the
 KERN_PROC_ALL sysctl is read with an insufficiently large buffer, 
 an
 sbuf error return value could bubble up and be treated as ERESTART,
 resulting in a loop.
 
 This can be confirmed with something like
 
 dtrace -n 'syscall:::entry/pid == $target/{@[probefunc] = 
  count();} tick-3s {exit(0);}' -p pid of looping proc
 
 If the output consists solely of __sysctl, this bug is likely the
 culprit.

Unfortunately, I accidentally killed fstat this morning before I 
could do any further debug.

I ran truss -p on it yesterday and it was spinning solely on __sysctl.

I'll try compiling with debug symbols in case it happens again. I 
haven't been able to reproduce the
problem in a reasonable time frame so it could be days or weeks 
before we see it happen again.
Tha truss output is consistent with Mark's suggestion, so I would try
his suggested fix of 272566.
   
   I patched the 10.1 kernel with r272566 and it appears to have fixed the 
   issue. Is this patch likely 
   to be MFCed back to 10-stable?
  
  I can't see any reason it shouldn't be, and there was an MFC reminder in
  the commit log entry for that revision. I've cc'ed kib@, who might have a
  reason.
 
 The mentioned commit depends on r271976, in fact it depends on the series of
 commits, including r271486 and r271489.
 
 I did not merged r271976 with manual resolution of the conficts, since it
 means that the work done for HEAD needs to be redone for stable/10 to
 ensure that all cases are covered.  Later, when the mentioned series is
 merged, the work should be redone once more.
 
 And to note, r271489 is not trivially mergeable as well, just checked.

You could merge r272566 and just fixup the sbuf_bcat() in export_fd_to_sb()
in kern_descrip.c instead.  I hadn't really considered fo_fill_kinfo to be
something that was mergeable to 10.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Suspected libkvm infinite loop

2015-03-10 Thread John Baldwin

On Tuesday, March 10, 2015 10:17:07 AM Nick Frampton wrote:
 Hi,
 
 For the past several months, we have had an intermittent problem where a
 process calling kvm_openfiles(3) or kvm_getprocs(3) (not sure which) gets
 stuck in an infinite loop and goes to 100% cpu. We have just observed
 fstat -m do the same thing and suspect it may be the same problem.
 
 Our environment is a 10.1-RELEASE-p6 amd64 guest running in VirtualBox, with
 ufs root and zfs /home.
 
 Has anyone else experienced this? Is there anything we can do to investigate
 the problem further?

Often loops using libkvm are due to programs using libkvm are trying to read 
kernel data structures while they are changing.  However, if you use sysctls 
to fetch this data instead, you should be able to get a stable snapshot of the 
system state without getting stuck in a possible loop.  I believe for libkvm 
to use sysctl instead of /dev/kmem you have to pass a NULL for the kernel and 
/dev/null for the core image.  fstat -m should be doing that by default 
however, so if it is not that, can you ktrace fstat when it is spinning to see 
if it is spinning userland or in the kernel?  If you see no activity via 
ktrace, then it is spinning in one of the two places without making any system 
calls, etc.  You can attach to it with gdb to pause it, then see where gdb 
thinks it is.  If gdb hangs attaching to it, then it is stuck in the kernel.  

If gdb attaches to it ok, then it is spinning in userland.  Unfortunately, for 
gdb to be useful, you really need debug symbols.  We don't currently provide 
those for release binaries or binaries provided via freebsd-update (though 
that is being worked on for 11.0).  If you build from source, then the 
simplest way to get this is to add 'WITH_DEBUG_FILES=yes' to /etc/src.conf and 
rebuild your world without NO_CLEAN.  If you are building from source and are 
able to reproduce with those binaries, then after attaching to the process 
with gdb, use 'bt' to see where it is hung and reply with that.

If it is hanging in the kernel, then you will need to use the kernel debugger 
to see where it is hanging.  The simplest way to do this is probably to force 
a crash via the debug.kdb.panic sysctl (set it to a non-zero value).  You will 
then need to fire up kgdb on the crash dump after it reboots, switch to the 
fstat process via the 'proc pid' command and get a backtrace via 'bt'.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.2-PRERELEASE #0 r254557 amd64: core dump on shutdown

2013-09-12 Thread John Baldwin

On Thursday, September 12, 2013 1:29:40 am Marko Cupać wrote:
 On Wed, 11 Sep 2013 11:11:24 -0400
 John Baldwin j...@freebsd.org wrote:
 
  Is this reproducible?
 
 It happened a few times before (maybe 3-4 times this year), but I can't
 reproduce it intentionally.

Hmm, I'm tempted to chalk this up to a hardware failure then.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.2-PRERELEASE #0 r254557 amd64: core dump on shutdown

2013-09-11 Thread John Baldwin

On Tuesday, September 10, 2013 10:50:55 am Marko Cupać wrote:
 My 9.2-PRERELEASE #0 r254557 amd64 just dumped core on shutdown. I
 updated src to Last Changed Rev: 255395 two days ago but did not get
 to rebuild worldkernel. Also I did not rebuild any ports since.
 Virtualbox was not running.
 
 pacija@kaa:/var/crash % uname -a
 FreeBSD kaa.mimar.rs 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0 r254557: Sun 
Aug 25 22:44:52 CEST 2013 pac...@kaa.mimar.rs:/usr/obj/usr/src/sys/KAAGEN  
amd64
 
 pacija@kaa:/var/crash % sudo cat core.txt.2 
 kaa.mimar.rs dumped core - see /var/crash/vmcore.2
 
 Tue Sep 10 16:41:45 CEST 2013
 
 FreeBSD kaa.mimar.rs 9.2-PRERELEASE FreeBSD 9.2-PRERELEASE #0 r254557: Sun 
Aug 25 22:44:52 CEST 2013 pac...@kaa.mimar.rs:/usr/obj/usr/src/sys/KAAGEN  
amd64
 
 panic: page fault

Is this reproducible?

 #6  0x80cdc843 in calltrap ()
 at /usr/src/sys/amd64/amd64/exception.S:232
 #7  0x80b71085 in swapoff_one (sp=0xfe0006296600, 
 cred=0xfe00037a0e00) at /usr/src/sys/vm/swap_pager.c:1753

Relevant line is:

1753for (swap = swhash[i]; swap != NULL; swap = swap-swb_hnext) {

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: unexpected idprio 31 behavior on 9.2-BETA2 and 9.2-RC1

2013-09-04 Thread John Baldwin

On Thursday, August 08, 2013 10:41:12 am Eric van Gyzen wrote:
 On 08/08/2013 09:19, Eric van Gyzen wrote:
  On 08/06/2013 14:23, J David wrote:
  On Tue, Aug 6, 2013 at 1:59 PM, Eric van Gyzen e...@vangyzen.net wrote:
  on an otherwise idle amd64 system with 4 CPUs.  The first command in 
the
  build.log file:
 
  rm -rf /usr/obj/home/freebsd/tmp
 
  took over three minutes.  It should have taken about three /seconds/.
 
  uptime reported a load average of around 1.00.
  top showed no threads (user or kernel) using CPU.
  iostat showed an average of less than 20 tps on ada0.
  rm was usually in the RUN state.
  We are looking at something similar.  Would you be able to try to
  reproduce it using a kernel with:
 
  nooptions  SCHED_ULE
  optionsSCHED_4BSD
 
  to see if it makes a difference?  It seems to, but the problem is
  inconsistent enough that I can't be sure.
  The 4BSD scheduler does //not// exhibit this problem.  I tested with the
  latest releng/9.2 (r254054) and an otherwise GENERIC config.
 
 To be thorough, I built a GENERIC kernel at the same rev, and it still
 exhibits the problem.

Please try this change:

Index: sched_ule.c
===
--- sched_ule.c (revision 255020)
+++ sched_ule.c (working copy)
@@ -243,7 +243,7 @@ struct tdq {
int tdq_transferable;   /* Transferable thread count. */
short   tdq_switchcnt;  /* Switches this tick. */
short   tdq_oldswitchcnt;   /* Switches last tick. */
-   u_char  tdq_lowpri; /* Lowest priority thread. */
+   u_short tdq_lowpri; /* Lowest priority thread. */
u_char  tdq_ipipending; /* IPI pending. */
u_char  tdq_idx;/* Current insert index. */
u_char  tdq_ridx;   /* Current removal index. */
@@ -2323,7 +2323,7 @@ sched_choose(void)
tdq-tdq_lowpri = td-td_priority;
return (td);
}
-   tdq-tdq_lowpri = PRI_MAX_IDLE;
+   tdq-tdq_lowpri = PRI_MAX_IDLE + 1;
return (PCPU_GET(idlethread));
 }
 

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Why are cardbus drivers cbb(4) and pccard(4) still included in GENERIC?

2013-08-29 Thread John Baldwin

On Thursday, August 29, 2013 6:56:53 am Adrian Chadd wrote:
 Hm! Are they dynamically loaded if you insert the cards?
 
 (Ie, has devd been taught about them as appropriate?)

These are drivers for the bridges, not for cards you plug into the bridges.  
If you autoloaded them at all you would load them during boot when you saw an 
appropriate PCI device.  Currently we don't autoload any PCI drivers, so I 
don't think that should be a blocker for taking these out of GENERIC.

Warner is probably the best person to ask.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: [REGRESSION] Root zpool mounting broken between 06/30/2013 and 07/21/2013 when PS/2 support compiled into the kernel

2013-07-22 Thread John Baldwin

On Monday, July 22, 2013 10:30:32 am Garrett Cooper wrote:
 I have a KERNCONF that previously had PS/2 support compiled into the kernel. 
 If I comment out the following lines like so:
 
 # atkbdc0 controls both the keyboard and the PS/2 mouse
 #device atkbdc  # AT keyboard controller
 #device atkbd   # AT keyboard
 
 then I'm able to mount root again (it was failing with ENOXDEV).
 
 The working kernel was as follows:
 
 $ strings /boot/kernel.WORKING/kernel | grep -B 2 -A 2 BAYONETTA
 @(#)FreeBSD 9.1-STABLE #7 r+0304216: Sun Jun 30 15:22:55 PDT 2013
 FreeBSD 9.1-STABLE #7 r+0304216: Sun Jun 30 15:22:55 PDT 2013
 
 gcooper@bayonetta.local:/usr/obj/scratch/git/github/yaneurabeya-freebsd-stable-9/sys/BAYONETTA
 gcc version 4.2.1 20070831 patched [FreeBSD]
 FreeBSD
 9.1-STABLE
 BAYONETTA
 $ cd /usr/src; git log 0304216
 commit 03042167f73c213732b44218a24d8e1bbea00f8c
 Merge: 2edcad2 974abfb
 Author: Garrett Cooper yaneg...@gmail.com
 Date:   Mon Jun 24 19:00:45 2013 -0700
 
 Merge remote-tracking branch 'upstream/stable/9' into stable/9
 
 The working kernel [with atkbdc] was as follows:
 
 FreeBSD bayonetta.local 9.2-BETA1 FreeBSD 9.2-BETA1 #12 r+c178034: Sun Jul 21 
 20:19:38 PDT 2013 
root@bayonetta.local:/usr/obj/scratch/git/github/yaneurabeya-freebsd-stable-9/sys/BAYONETTA
  amd64
 $ git log c178034
 commit c17803445f4ffb97e1a46a1be5f7ea04692793f0
 Author: avg a...@freebsd.org
 Date:   Tue Jul 9 08:30:31 2013 +
 
 zfsboottest.sh: remove checks for things that are not strictly required
 
 MFC after:  10 days
 
 (Yes, I had to backport some things because they are busted on stable/9 due 
 to other incomplete/missing MFCs).
 
 I can test out patches, but I don't have time to bisect the actual commit 
 that caused the failure. That being said my intuition says it's this 
commit should be looked at first:
 
 commit 28f961058b0667841d7e9d8639bfd02ed8689faa
 Author: jhb j...@freebsd.org
 Date:   Wed Jul 17 14:04:18 2013 +
 
 MFC 252576:
 Don't perform the acpi_DeviceIsPresent() check for PCI-PCI bridges.  If
 we are probing a PCI-PCI bridge it is because we found one by enumerating
 the devices on a PCI bus, so the bridge is definitely present.  A few
 BIOSes report incorrect status (_STA) for some bridges that claimed they
 were not present when in fact they were.
 
 While here, move this check earlier for Host-PCI bridges so attach fails
 before doing any work that needs to be torn down.
 
 PR: kern/91594
 Approved by:re (marius)

I strongly doubt that this is related.  It would be most helpful if you could
obtain a dmesg from the new kernel however (perhaps via a serial console) to
rule it out.  All you would need to see is if the new kernel sees more pcib
devices than the old one to see if this change even has an effect on your
system.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: syncer causing latency spikes

2013-07-17 Thread John Baldwin

On Wednesday, July 17, 2013 3:18:52 pm Konstantin Belousov wrote:
 On Wed, Jul 17, 2013 at 02:07:55PM -0400, Mark Johnston wrote:
  During such an fsync, DTrace shows me that syncer sleeps of 50-200ms are
  happening up to 8 or 10 times a second. When this happens, a bunch of
  postgres threads become blocked in vn_write() waiting for the vnode lock
  to become free. It looks like the write-clustering code is limited to
  using (nswbuf / 2) pbufs, and FreeBSD prevents one from setting nswbuf
  to anything greater than 256.
 Syncer is probably just a victim of profiling.  Would postgres called
 fsync(2), you then blame the fsync code for the pauses.
 
 Just add a tunable to allow the user to manually-tune the nswbuf,
 regardless of the buffer cache sizing.  And yes, nswbuf default max
 probably should be bumped to something like 1024, at least on 64bit
 architectures which do not starve for kernel memory.

Also, if you are seeing I/O stalls with mfi(4), then you might need a
firmware update for your mfi(4) controller.  cc'ing smh@ who knows more about 
that particular issue (IIRC).

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: locks under printf(9) and WITNESS = panic?

2013-07-11 Thread John Baldwin

On Saturday, June 29, 2013 9:19:24 pm Steven Hartland wrote:
 when booting stable/9 under a debug kernel with WITNESS
 enabled and verbose I get the following panic..
 
 It seems very much like the discussion from a year back on
 current: http://lists.freebsd.org/pipermail/freebsd-current/2012-
January/031375.html
 
 Any ideas?

Yeah, that lock needs to be MTX_RECURSE (the cnputs_mtx).  However, it
only recurses under witness.  *sigh*

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: USB ports on Lenovo T400 do not work after a suspend/resume

2013-07-08 Thread John Baldwin

On Sunday, June 30, 2013 10:22:09 am Ian Smith wrote:
 On Sat, 29 Jun 2013, Adrian Chadd wrote:
   On 27 June 2013 04:58, Ian Smith smi...@nimnet.asn.au wrote:
We don't yet know if this is a bus, ACPI /or USB issue.  Home yet? :)
   
   Yup:
   
   http://people.freebsd.org/~adrian/usb/
   
   dmesg.boot = dmesg at startup
   
   1 - after powerup, usb device in
   2 - after acpiconf -s3 suspend/resume, w/ a USB device plugged in
   3 - after acpiconf -s3 suspend/resume, with a USB device removed
   before suspend/resume
 
 After removing [numbers] (for WITNESS?), diff started making sense.  
 The below is between the first and second suspend/resume cycles in 
 dmesg-3.txt, encompassing the others.
 
 Nothing of note that I can see, if that usb hub-to-bus remapping is 
 normal.  As you said, 'CPU0: local APIC error 0x40' looks maybe sus.  
 Maybe someone who knows might comment on that?

From sys/amd64/include/apicreg.h:

/* fields in ESR */
#define APIC_ESR_SEND_CS_ERROR  0x0001
#define APIC_ESR_RECEIVE_CS_ERROR   0x0002
#define APIC_ESR_SEND_ACCEPT0x0004
#define APIC_ESR_RECEIVE_ACCEPT 0x0008
#define APIC_ESR_SEND_ILLEGAL_VECTOR0x0020
#define APIC_ESR_RECEIVE_ILLEGAL_VECTOR 0x0040
#define APIC_ESR_ILLEGAL_REGISTER   0x0080

Receive illegal vector (if look in Intel's SDM manuals) means it
got an interrupt vector  32 (probably zero).  Perhaps it asserted
an interrupt in an I/O APIC before the I/O APIC was properly reset?
Are you using MSI at all?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-06-17 Thread John Baldwin

On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote:
 On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
  On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
   Each day at 5:15 we are generating snapshots on various machines.
   This used to work perfectly under 7-STABLE for years but since
   we started to use 9.1-STABLE the machine reboots in about 10%
   of all cases.
   
   After rebooting we find a new snapshot file which is a bit
   smaller than the good ones and with different permissions
   It does not succeed a fsck. In this example it is the one
   whose name is beginning with s3:
   
   -r--r-   1 root  operator  snapshot 72802894528 29 May 05:15 
   s2-2013.05.28-03.15.04
   -r   1 root  operator  snapshot 72802893824 29 May 05:15 
   s3-2013.05.29-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s4-2013.05.23-06.38.44
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s5-2013.05.24-03.15.03
   -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
   s6-2013.05.25-03.15.03
   
   After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
   I see the following LORs (mksnap_ffs starts exactly at 5:15):
   
   May 29 05:15:00 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:00 kern.crit palveli kernel: 1st 0xc2371da8 ufs (ufs) @ 
   /src/src-9/sys/kern/vfs_mount.c:1240
   May 29 05:15:00 kern.crit palveli kernel: 2nd 0xc2371ec4 devfs (devfs) 
   @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
   May 29 05:15:04 kern.crit palveli kernel: lock order reversal:
   May 29 05:15:04 kern.crit palveli kernel: 1st 0xc228471c snaplk 
   (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
   May 29 05:15:04 kern.crit palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ 
   /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
   
   Unfortunatley no corefiles are being generated ;-(.
   
   I have checked and even rebuilt the (UFS1) fs in question
   from scratch. I have also seen this happen on an UFS2 on
   another machine and on a third one when running dump -L
   on a root fs.
   
   Any hints of how to proceed?
  
  Would it be possible to setup a serial console that is logged on this 
  machine
  to see if it is panic'ing but failing to write out a crashdump?
 
 Couldn't attach the serial console yet ;-(. But I had people
 attach a KVMoverIP switch and enabled the various KDB options
 in the kernel. Now we can see a bit more (see below) -- no
 crashdump is being generated though.

:(  Unfortunately these LORs don't really help with discerning the cause of
the reboot.  If you have remote power access (and still wanted to test this)
one option would be to change KDB to drop into the debugger on a panic.
Then you could connect over the KVM and take images of the original panic
along with a stack trace.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ACPI Warning, then hang

2013-06-14 Thread John Baldwin

On Monday, June 10, 2013 10:18:47 pm Bryce Edwards wrote:
 Verbose boot:
 
 https://www.dropbox.com/s/obm8rtavro68ea8/acpi-verbose.jpg

That is odd.  I had expected it to output some other messages.

Hmm, the line two lines up shows your RSDP (list of ACPI tables)
seems to be garbage as well.  I think the BIOS is just broken
I'm afraid. :(

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: zpool labelclear destroys GPT data

2013-06-14 Thread John Baldwin

On Friday, June 14, 2013 4:21:08 am Daniel O'Connor wrote:
 
 On 14/06/2013, at 17:48, Alban Hertroys haram...@gmail.com wrote:
  IMHO it would be helpful to verify what's there first and warn the user 
about it if such an operation will overwrite a different type of label than 
what is about to get written there.
  Perhaps it should even refuse to write (by issuing an error stating that 
there is already a label there - and preferably also what type) until the 
label that's already there gets explicitly cleared by the user or until the 
command gets forced.
  Does that make sense?
 
 The problem with this is that then each label tool needs to know about every 
other label format you want to detect for..
 
 If a label format has a checksum then you could ignore a request to nuke the 
label if there is no valid checksum (with a flag to force). No idea how many 
have checksums though..

Well, you could have zpool check if there is a valid ZFS label and prompt/warn 
if it doesn't find one on whatever device it's about to wipe.  That doesn't 
fix the gmirror/gpt case, but it might make zpool more intuitive to use.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Reproducable Infiniband panic

2013-06-10 Thread John Baldwin

On Monday, June 10, 2013 8:04:12 am Julian Stecklina wrote:
 On 06/07/2013 06:06 PM, John Baldwin wrote:
  On Friday, June 07, 2013 5:07:34 am Julian Stecklina wrote:
  On 06/06/2013 08:57 PM, John Baldwin wrote:
  On Thursday, June 06, 2013 9:54:35 am Andriy Gapon wrote:
  [...]
  The problem seems to be in incorrect interaction between devfs_close_f 
  and
  linux_file_dtor.  The latter expects curthread-td_fpop to have a valid 
  reasonable
  value.  But the former sets curthread-td_fpop to fp only around 
  vnops.fo_close()
  call and then restores it back to some (what?) previous value before 
  calling
  devfs_fpdrop-devfs_destroy_cdevpriv.  In this case the previous value 
  is 
  NULL.
 
  It is normally NULL in this case.  Why does linux_file_dtor even look at
  td_fpop?
 
  Ah.  I think it should not do that and make the data it uses in the dtor 
  more
  self-contained:
 [...]
 
 Seems to fix my panic. Thanks!

Can you please retest this updated version?  I had thought that I didn't need
a reference count on the vnode, but devfs drops its reference count before the
cdevpriv destructor is called.

Index: sys/ofed/include/linux/fs.h
===
--- sys/ofed/include/linux/fs.h (revision 251604)
+++ sys/ofed/include/linux/fs.h (working copy)
@@ -73,6 +73,7 @@
struct dentry   f_dentry_store;
struct selinfo  f_selinfo;
struct sigio*f_sigio;
+   struct vnode*f_vnode;
 };
 
 #definefilelinux_file
Index: sys/ofed/include/linux/linux_compat.c
===
--- sys/ofed/include/linux/linux_compat.c   (revision 251604)
+++ sys/ofed/include/linux/linux_compat.c   (working copy)
@@ -212,7 +212,8 @@
struct linux_file *filp;
 
filp = cdp;
-   filp-f_op-release(curthread-td_fpop-f_vnode, filp);
+   filp-f_op-release(filp-f_vnode, filp);
+   vdrop(filp-f_vnode);
kfree(filp);
 }
 
@@ -232,6 +233,8 @@
filp-f_dentry = filp-f_dentry_store;
filp-f_op = ldev-ops;
filp-f_flags = file-f_flag;
+   vhold(file-f_vnode);
+   filp-f_vnode = file-f_vnode;
if (filp-f_op-open) {
error = -filp-f_op-open(file-f_vnode, filp);
if (error) {

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ACPI Warning, then hang

2013-06-10 Thread John Baldwin

On Monday, June 10, 2013 10:35:07 am Jeremy Chadwick wrote:
 On Mon, Jun 10, 2013 at 09:18:14AM -0500, Bryce Edwards wrote:
  I'm getting the following warning, and then the system locks:
  
  ACPI Warning: Incorrect checksum in table [(bunch of spaces)] - 0x29,
  should be 0x48
  
  Here's a pic: http://db.tt/O6dxONzI
  
  System is on a SuperMicro C7X58 motherboard that I just upgraded to
  BIOS 2.0a, which I would like to stay on if possible.  I tried
  adjusting all the ACPI related BIOS settings without success.
 
 The message in question refers to hard-coded data in one of the many
 ACPI tables (see acpidump(8) for the list -- there are many).  ACPI
 tables are stored within the BIOS -- the motherboard/BIOS vendor has
 full control over all of them and is fully 100% responsible for their
 content.
 
 It looks to me like they severely botched their BIOS, or somehow it got
 flashed wrong.
 
 You need to contact Supermicro Technical Support and tell them of the
 problem.  They need to either fix their BIOS, or help figure out what's
 become corrupted.  You can point them to this thread if you'd like.
 
 I should note that the corruption/issue is major enough that you are
 missing very key/important lines from your dmesg (after avail memory
 but before kdbX at kdbmuxX, which come from pure reliance upon ACPI.
 Lines such as:
 
 Event timer LAPIC quality 400
 ACPI APIC Table: PTLTDAPIC  
 FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 FreeBSD/SMP: 1 package(s) x 4 core(s)
  cpu0 (BSP): APIC ID:  0
  cpu1 (AP): APIC ID:  1
  cpu2 (AP): APIC ID:  2
  cpu3 (AP): APIC ID:  3
 ioapic0 Version 2.0 irqs 0-23 on motherboard
 ioapic1 Version 2.0 irqs 24-47 on motherboard
 
 In the meantime, you can try booting without ACPI support (there should
 be a boot-up menu option for that) and pray that works.  If it doesn't,
 then your workaround is to roll back to an older BIOS version and/or put
 pressure on Supermicro.  You will find their Technical Support folks are
 quite helpful/responsive to technical issues.
 
 Good luck and keep us posted on what transpires.

Actually, that message is mostly harmless.  All sorts of vendors ship
tables with busted checksums that are in fact fine. :(  However, the table
name looks very odd which is more worrying.  Booting without ACPI enabled
would be a good first step.  Trying a verbose boot to capture the last
message before the hang would also be useful.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Reproducable Infiniband panic

2013-06-07 Thread John Baldwin

On Friday, June 07, 2013 5:07:34 am Julian Stecklina wrote:
 On 06/06/2013 08:57 PM, John Baldwin wrote:
  On Thursday, June 06, 2013 9:54:35 am Andriy Gapon wrote:
 [...]
  The problem seems to be in incorrect interaction between devfs_close_f 
and
  linux_file_dtor.  The latter expects curthread-td_fpop to have a valid 
reasonable
  value.  But the former sets curthread-td_fpop to fp only around 
vnops.fo_close()
  call and then restores it back to some (what?) previous value before 
calling
  devfs_fpdrop-devfs_destroy_cdevpriv.  In this case the previous value is 
NULL.
  
  It is normally NULL in this case.  Why does linux_file_dtor even look at
  td_fpop?
  
  Ah.  I think it should not do that and make the data it uses in the dtor 
more
  self-contained:
  
  Index: sys/ofed/include/linux/linux_compat.c
  ===
  --- linux_compat.c  (revision 251465)
  +++ linux_compat.c  (working copy)
  @@ -212,7 +212,7 @@ linux_file_dtor(void *cdp)
  struct linux_file *filp;
   
  filp = cdp;
  -   filp-f_op-release(curthread-td_fpop-f_vnode, filp);
  +   filp-f_op-release(filp-f_vnode, filp);
  kfree(filp);
   }
   
  @@ -232,6 +232,7 @@ linux_dev_open(struct cdev *dev, int oflags, int d
  filp-f_dentry = filp-f_dentry_store;
  filp-f_op = ldev-ops;
  filp-f_flags = file-f_flag;
  +   filp-f_vnode = file-f_vnode;
  if (filp-f_op-open) {
  error = -filp-f_op-open(file-f_vnode, filp);
  if (error) {
  
 
 Doesn't compile for me. Did you forget to add the f_vnode member to
 struct linux_file?
 
 sys/ofed/include/linux/linux_compat.c: In function 'linux_file_dtor':
 sys/ofed/include/linux/linux_compat.c:214: error: 'struct linux_file'
 has no member named 'f_vnode'
 sys/ofed/include/linux/linux_compat.c: In function 'linux_dev_open':
 sys/ofed/include/linux/linux_compat.c:234: error: 'struct linux_file'
 has no member named 'f_vnode'

Oof it's in another header:

Index: sys/ofed/include/linux/fs.h
===
--- fs.h(revision 251494)
+++ fs.h(working copy)
@@ -73,6 +73,7 @@ struct linux_file {
struct dentry   f_dentry_store;
struct selinfo  f_selinfo;
struct sigio*f_sigio;
+   struct vnode*f_vnode;
 };
 
 #definefilelinux_file


-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Reproducable Infiniband panic

2013-06-06 Thread John Baldwin

On Thursday, June 06, 2013 9:54:35 am Andriy Gapon wrote:
 on 06/06/2013 14:48 Julian Stecklina said the following:
  #7  0x807a3d83 in linux_file_dtor (cdp=0xfe000aeabb80) at
  /usr/home/julian/src/freebsd/sys/ofed/include/linux/linux_compat.c:214
  filp = (struct linux_file *) 0xfe000aeabb80
  #8  0x80513c39 in devfs_destroy_cdevpriv (p=0xfe0005772980)
  at /usr/home/julian/src/freebsd/sys/fs/devfs/devfs_vnops.c:159
  No locals.
  #9  0x80513e47 in devfs_close_f (fp=0xfe000b0e9aa0,
  td=value optimized out)
  at /usr/home/julian/src/freebsd/sys/fs/devfs/devfs_vnops.c:619
  error = 0
  fpop = (struct file *) 0x0
 
 The problem seems to be in incorrect interaction between devfs_close_f and
 linux_file_dtor.  The latter expects curthread-td_fpop to have a valid 
 reasonable
 value.  But the former sets curthread-td_fpop to fp only around 
 vnops.fo_close()
 call and then restores it back to some (what?) previous value before calling
 devfs_fpdrop-devfs_destroy_cdevpriv.  In this case the previous value is 
 NULL.

It is normally NULL in this case.  Why does linux_file_dtor even look at
td_fpop?

Ah.  I think it should not do that and make the data it uses in the dtor more
self-contained:

Index: sys/ofed/include/linux/linux_compat.c
===
--- linux_compat.c  (revision 251465)
+++ linux_compat.c  (working copy)
@@ -212,7 +212,7 @@ linux_file_dtor(void *cdp)
struct linux_file *filp;
 
filp = cdp;
-   filp-f_op-release(curthread-td_fpop-f_vnode, filp);
+   filp-f_op-release(filp-f_vnode, filp);
kfree(filp);
 }
 
@@ -232,6 +232,7 @@ linux_dev_open(struct cdev *dev, int oflags, int d
filp-f_dentry = filp-f_dentry_store;
filp-f_op = ldev-ops;
filp-f_flags = file-f_flag;
+   filp-f_vnode = file-f_vnode;
if (filp-f_op-open) {
error = -filp-f_op-open(file-f_vnode, filp);
if (error) {

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found

2013-05-31 Thread John Baldwin

On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
 Each day at 5:15 we are generating snapshots on various machines.
 This used to work perfectly under 7-STABLE for years but since
 we started to use 9.1-STABLE the machine reboots in about 10%
 of all cases.
 
 After rebooting we find a new snapshot file which is a bit
 smaller than the good ones and with different permissions
 It does not succeed a fsck. In this example it is the one
 whose name is beginning with s3:
 
 -r--r-   1 root  operator  snapshot 72802894528 29 May 05:15 
 s2-2013.05.28-03.15.04
 -r   1 root  operator  snapshot 72802893824 29 May 05:15 
 s3-2013.05.29-03.15.03
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s4-2013.05.23-06.38.44
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s5-2013.05.24-03.15.03
 -r--r-   1 root  operator  snapshot 72802894528 28 May 14:22 
 s6-2013.05.25-03.15.03
 
 After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
 I see the following LORs (mksnap_ffs starts exactly at 5:15):
 
 May 29 05:15:00 kern.crit palveli kernel: lock order reversal:
 May 29 05:15:00 kern.crit palveli kernel: 1st 0xc2371da8 ufs (ufs) @ 
 /src/src-9/sys/kern/vfs_mount.c:1240
 May 29 05:15:00 kern.crit palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ 
 /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
 May 29 05:15:04 kern.crit palveli kernel: lock order reversal:
 May 29 05:15:04 kern.crit palveli kernel: 1st 0xc228471c snaplk (snaplk) @ 
 /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
 May 29 05:15:04 kern.crit palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ 
 /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
 
 Unfortunatley no corefiles are being generated ;-(.
 
 I have checked and even rebuilt the (UFS1) fs in question
 from scratch. I have also seen this happen on an UFS2 on
 another machine and on a third one when running dump -L
 on a root fs.
 
 Any hints of how to proceed?

Would it be possible to setup a serial console that is logged on this machine
to see if it is panic'ing but failing to write out a crashdump?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: SunFire X2200 ilo's bge1 DOWN/UP

2013-05-30 Thread John Baldwin

On Thursday, May 30, 2013 2:44:35 am Daniel Braniss wrote:
  --/04w6evG8XlLl3ft
  Content-Type: text/x-diff; charset=us-ascii
  Content-Disposition: attachment; filename=bge.media_sts.diff
  
  Index: sys/dev/bge/if_bge.c
  ===
  --- sys/dev/bge/if_bge.c(revision 251021)
  +++ sys/dev/bge/if_bge.c(working copy)
  @@ -5583,6 +5583,10 @@ bge_ifmedia_sts(struct ifnet *ifp, struct ifmediar
   
  BGE_LOCK(sc);
   
  +   if ((ifp-if_flags  IFF_UP) == 0) {
  +   BGE_UNLOCK(sc);
  +   return;
  +   }
  if (sc-bge_flags  BGE_FLAG_TBI) {
  ifmr-ifm_status = IFM_AVALID;
  ifmr-ifm_active = IFM_ETHER;
  
  --/04w6evG8XlLl3ft--
 after 18hs, the logs are empty!
 it seems the patch fixes the problem.
 
 now maybe it's time to hunt for who is randomly calling for bge_ifmedia_sts
 ...

It could be any number of daemons that query interface state such as an
SNMP server, ladvd, etc.

If you wanted help you could modify the patch so that it does something like 
this:

if (/* test for IFF_UP */) {
BGE_UNLOCK(sc);
if_printf(ifp, state queried on down interface by pid %d (%s),
curthread-td_proc-p_pid, curthread-td_proc-p_comm);
return;
}

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: System doesn't dump

2013-05-30 Thread John Baldwin

On Wednesday, May 29, 2013 2:41:38 am Dominic Fandrey wrote:
 I have a number of actions that reliably panic the system, such as
 performing shutdown -p (yes I'm booting into an inconsistent file
 system every time). Both with my notebook and my workstation.
 
 However I cannot get the system to dump.
 
 dumpdir=/var/crash
 and I've tried ada0s2b, /dev/ada0s2b, label/5swap, /dev/label/5swap and AUTO
 for dumpdev to no avail.
 
 The swap partition is 16g, the machines have 8g RAM and there's plenty
 of hard disk space available for /var/crash.
 
 I'm looking for that secret, undocumented trigger, that makes the
 system dump if a panic occurs. Once upon a time dumping just worked
 if the swap partition was large enough. I miss those olden days.

Does /dev/dumpdev exist and point to your swap partition after booting?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-REL Supermicro H8DCL-iF kernel panic

2013-04-03 Thread John Baldwin

On Monday, April 01, 2013 12:29:46 pm Xin Li wrote:
 Yes, this is a bandaid and the right fix should be refactor the code a
 little bit to make sure that no interrupt handler is installed before
 the driver have done other initializations but I don't have hardware
 that can reproduce this issue handy to validate changes like that.

It is not that easy.  I instrumented the crap out of the igb driver on the
one machine where I could reliably reproduce this and kept clearing the
interrupt cause register during attach multiple times and still got a
spurious interrupt.  I believe this is a chip bug of some sort, but I've
no idea whose fault it is.  It has only been reported on SuperMicro *8*
boards to date.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: [patch] IPMI KCS can drop the lock while servicing a request

2013-04-02 Thread John Baldwin

On Saturday, March 23, 2013 11:11:20 pm Eric van Gyzen wrote:
 At work, we discovered that our application's IPMI thread would often 
 use a lot of CPU time.  The KCS thread uses DELAY to wait for the BMC, 
 so it can run without sleeping for a long time with a slow BMC.  It 
 also holds the ipmi_softc.ipmi_lock during this time.  When using 
 adaptive mutexes, an application thread that wants to operate on the 
 ipmi_pending_requests list will also spin during this same time.
 
 We see no reason that the KCS thread needs to hold the lock while 
 servicing a request.  We've been running with the attached patch for a 
 few months, with no ill effects.

The lock protects against concurrent access to the registers themselves
(though the thread sort of does this already).  However, even with a slow
BMC it shouldn't be waiting but so long.  I had some other comments about
this patch in my reply to when it was committed.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: gptzfsboot: error 4 lba 30

2013-04-02 Thread John Baldwin

On Monday, March 25, 2013 7:52:04 am Kai Gallasch wrote:
 Hi.
 
 On one of my fresh installed servers I am seeing the following output during 
boot:
 
 gptzfsboot: error 4 lba 30
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 30
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31
 gptzfsboot: error 4 lba 31

Humm, do you have disks that the BIOS sees that are small?  An error code of 4
means 'sector not found' or 'read error'.  It would be interesting to see the
output of 'lsdev -v' from the loader prompt.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Core Dump / panic sleeping thread

2013-03-20 Thread John Baldwin

On Wednesday, March 20, 2013 9:22:22 am Konstantin Belousov wrote:
 On Wed, Mar 20, 2013 at 12:13:05PM +0100, Michael Landin Hostbaek wrote:
  
  On Mar 20, 2013, at 10:49 AM, Konstantin Belousov kostik...@gmail.com 
wrote:
   
   I do not like it. As I said in the previous response to Andrey,
   I think that moving the vnode_pager_setsize() after the unlock is
   better, since it reduces races with other thread seeing half-done
   attribute update or making attribute change simultaneously.
  
  OK - so should I wait for another patch - or? 
 
 I think the following is what I mean. As an additional note, why nfs
 client does not trim the buffers when server reported node size change ?

Will changing the size always result in an mtime change forcing the client to
throw away the data on the next read or fault anyway (or does it only affect
ctime)?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: svn - but smaller?

2013-03-15 Thread John Baldwin

On Wednesday, March 13, 2013 10:11:28 pm John Mehr wrote:
  And svnup(1) really should mention that any files in the 
 target tree not 
  in the repository will be deleted, which was 
 (explicitly) not the case 
  with c{,v}sup.  I only lost a few acpi patches that I 
 think have likely 
  made it to stable/9 anyway, and it's a test system, but 
 I was surprised.
 
 I always thought csup did delete files.  I was looking at 
 csup's man page for things to put on the to-do list and 
 there's a csup command line parameter ( -d ) that puts a 
 limit on the number of files that can be deleted in a 
 given run.  Adding this feature is already on my to-do 
 list, and I've just added another item to let the user 
 choose whether svnup should delete extra files in the 
 local source tree.

csup deletes files that are deleted upstream (so if an svn
commit were to remove a file from the source tree).  It did
not delete files that were locally added (like work/ directories
for port builds, or kernel config files) that were never in
the repository in the first place.

I think that is the approach you probably want to take by default.
That is also how the stock svn client acts.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mfi timeouts

2013-02-27 Thread John Baldwin

On Wednesday, February 27, 2013 12:58:11 am rihad wrote:
 Now about this part taken from here 
 http://lists.freebsd.org/pipermail/freebsd-scsi/2011-March/004839.html
   By issuing a dummy read operation (thus forcing a flush of data 
 buffers), this issue is largely averted.
 
 Does this mean that battery-backed cache (BBU) is effectively rendered 
 useless, as all write operations are forced on to the disk platters on 
 every interrupt?

No, this is a very different level.  This is forcing pending PCI DMA 
transactions on the PCI bus to flush by doing a read, not forcing I/O
buffers to be flushed to disk.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: IPMI serial console

2013-02-26 Thread John Baldwin

On Thursday, February 21, 2013 5:55:01 pm Glen Barber wrote:
 On Thu, Feb 21, 2013 at 05:23:14PM -0500, John Baldwin wrote:
  On Thursday, February 21, 2013 4:56:02 pm Daniel O'Connor wrote:
   
   On 22/02/2013, at 2:19, John Baldwin j...@freebsd.org wrote:
Does anyone have any hints?

Rather than using all these hints, just use these three in loader.conf:

console=comconsole vidconsole
console_speed=115200
console_port=0xblah  (where blah is the correct I/O port for 
COM3, 
  0x3e8 
maybe?)
   
   
   No dice :(
   
   I also tried booting with '-D -h -S 115200' but nothing either.
  
  Sorry, those should be 'comconsole_speed' and 'comconsole_port'.  Also, you 
  should be able to get the loader prompt working if you enter those by hand 
  using an IPMI KVM or some such.
  
 
 John, this sounds very similar to a question I posed to you a few weeks
 ago.  I guess it's not just me with these weird SuperMicro BMCs. :(

I am using exactly this on many SuperMicro X8 and X9 boards.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: IPMI serial console

2013-02-26 Thread John Baldwin

On Thursday, February 21, 2013 5:42:08 pm Daniel O'Connor wrote:
 
 On 22/02/2013, at 8:53, John Baldwin j...@freebsd.org wrote:
  I also tried booting with '-D -h -S 115200' but nothing either.
  
  Sorry, those should be 'comconsole_speed' and 'comconsole_port'.  Also, 
you 
  should be able to get the loader prompt working if you enter those by hand 
  using an IPMI KVM or some such.
 
 
 No luck with that either :(
 
 The IPMI serial console works for the BIOS  loader so I guess the 
comconsole parts work, however the kernel doesn't seem to use it even with '-D 
-h'.
 
 The uart(4) flags are correct (I believe)
 uart0: 16550 or compatible port 0x3f8-0x3ff irq 4 on acpi0
 uart1: 16550 or compatible port 0x2f8-0x2ff irq 3 on acpi0
 uart2: 16550 or compatible port 0x3e8-0x3ef irq 5 flags 0x30 on acpi0

The way this works with the kernel is that the loader has to be setting a
hw.uart.console hint based on comconsole_port.  The hint.uart.X.flags settings
are completely ignored for this.  Also, for 9.1, you must set the speed before
you set the port (so the order of lines in loader.conf matters), or 
hw.uart.console will tell the kernel to use 9600 instead of 115200.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1 would not boot on pentium3 laptop

2013-02-26 Thread John Baldwin

On Tuesday, February 26, 2013 1:10:37 am Mikhail T. wrote:
 15.02.2013 08:49, John Baldwin ???(??):
  Were you able to test this patch?
 Yes, with the patch my laptop boots -- even after I removed the
 work-around (hint.ichss.0.disabled=1 from device.hints). powerd is
 also able to regulate the frequency -- I'm not sure, how else to test
 the functionality.
 
 Thank you. Yours,

Perfect, thanks for testing!

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mfi timeouts

2013-02-26 Thread John Baldwin

On Tuesday, February 26, 2013 1:31:44 pm rihad wrote:
  On 28/10/2011 04:14, Jan Mikkelsen wrote:
  /  Hi,
  //
  //  There is a patch linked to from this PR, which seems very similar:
  //
  //  http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/140416
  //
  //  http://lists.freebsd.org/pipermail/freebsd-scsi/2011-
March/004839.html
  //
  //  The problem is also consistent with running mfiutil clearing the 
problem.
  //
  //  I'm about to deploy mfi controllers in a similar configuration, so 
I'd be very curious about whether the patch fixes the problem for you.
  //
  /This looks promising, I'll give a try when I get a moment.
 
 Hi,
 
 Did the patch help? We're having the same issues  running mfiutil show 
 volumes every minute doesn't make the freezes go away.
 Will this small patch be ok on 8.2-RELEASE-p4? Thanks.

You can use the patch on 8.2.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: IPMI serial console

2013-02-21 Thread John Baldwin

On Thursday, February 21, 2013 5:45:13 am Daniel O'Connor wrote:
 Hi all,
 A recent thread inspired me to try getting a proper serial console working 
on a Supermicro X9SCL motherboard with IPMI.
 
 However I find that while I see loader messages and the getty I enabled  
after boot I don't get any kernel messages which does somewhat limit the 
utility..
 
 The BMC creates COM3 (/dev/cuau2) which works with getty. I modified 
/boot/loader.conf like so..
 boot_multicons=yes
 boot_serial=YES
 console=comconsole vidconsole
 comconsole_speed=115200
 # Disable console flags on these 2 ports
 hint.uart.0.flags=0x00
 hint.uart.1.flags=0x00
 # Set console flag
 hint.uart.2.flags=0x10
 
 Does anyone have any hints?

Rather than using all these hints, just use these three in loader.conf:

console=comconsole vidconsole
console_speed=115200
console_port=0xblah  (where blah is the correct I/O port for COM3, 0x3e8 
maybe?)

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: IPMI serial console

2013-02-21 Thread John Baldwin

On Thursday, February 21, 2013 4:56:02 pm Daniel O'Connor wrote:
 
 On 22/02/2013, at 2:19, John Baldwin j...@freebsd.org wrote:
  Does anyone have any hints?
  
  Rather than using all these hints, just use these three in loader.conf:
  
  console=comconsole vidconsole
  console_speed=115200
  console_port=0xblah  (where blah is the correct I/O port for COM3, 
0x3e8 
  maybe?)
 
 
 No dice :(
 
 I also tried booting with '-D -h -S 115200' but nothing either.

Sorry, those should be 'comconsole_speed' and 'comconsole_port'.  Also, you 
should be able to get the loader prompt working if you enter those by hand 
using an IPMI KVM or some such.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9-STABLE - NFS - NetAPP:

2013-02-19 Thread John Baldwin

On Friday, February 15, 2013 11:31:11 pm Marc Fournier wrote:
 
 Trying the patch now … but what do you mean by using 'SIGSTOP'?  I generally
 do a 'kill -HUP' then when that doesn't work 'kill -9' … should Iuse -STOP
 instead of 9?

No.  This patch only helps if you are using kill -STOP to pause processes and
later resume them.  If you aren't doing that, then the suspension could be due
to a different cause.  Please try this patch instead and let me know if you
see any of the 'Deferring' messages on the console:

Index: kern_thread.c
===
--- kern_thread.c   (revision 246122)
+++ kern_thread.c   (working copy)
@@ -794,7 +794,30 @@ thread_suspend_check(int return_instead)
(p-p_flag  P_SINGLE_BOUNDARY)  return_instead)
return (ERESTART);
 
+#if 0
/*
+* Ignore suspend requests for stop signals if they
+* are deferred.
+*/
+   if (P_SHOULDSTOP(p) == P_STOPPED_SIG 
+   td-td_flags  TDF_SBDRY) {
+   KASSERT(return_instead,
+   (TDF_SBDRY set for unsafe thread_suspend_check));
+   return (0);
+   }
+#else
+   /* Ignore syspend requests if stops are deferred. */
+   if (td-td_flags  TDF_SBDRY) {
+   if (!return_instead)
+   panic(TDF_SBDRY set, but return_instead not);
+   if (P_SHOULDSTOP(p) != P_STOPPED_SIG)
+   printf(Deferring non-STOP suspension: 
SHOULDSTOP: %x p_flag %x\n,
+   P_SHOULDSTOP(p), p-p_flag);
+   return (0);
+   }
+#endif
+
+   /*
 * If the process is waiting for us to exit,
 * this thread should just suicide.
 * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE.



-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9-STABLE - NFS - NetAPP:

2013-02-15 Thread John Baldwin

On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote:
 Marc Fournier wrote:
  On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  
   The pid that is in T state for the ps auxlH.
  
  Different server, last kernel update on Jan 22nd, https process this
  time instead of du last time.
  
  I've attached:
  
  ps auxlH
  ps auxlH of just the processes that are in TJ state (6 httpd servers)
  procstat output for each of the 6 process
  
  
  
  
  They are included as attachments … if these don't make it through, let
  me know, just figured I'd try and keep it compact ...
 Well, I've looked at this call path a little closer:
 16693 104135 httpd-mi_switch+0x186 
thread_suspend_check+0x19f sleepq_catch_signals+0x1c5
   sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 
clnt_reconnect_call+0xfb newnfs_request+0xadb
   nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 
nfs_access+0x306 vn_open_cred+0x5a8
   kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 
 
 I am probably way off, since I am not familiar with this stuff, but it
 seems to me that thread_suspend_check() should just return 0 for the
 case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set)
 instead of sitting in the loop and doing a mi_switch(). I'm not even
 sure if it should call thread_suspend_check() for this case, but there
 are cases in thread_suspend_check() that I don't understand.
 
 Although I don't really understand thread_suspend_check(), I've attached
 a simple patch that might be a starting point for fixing this?
 
 I wouldn't recommend trying the patch until kib and/or jhb weigh in
 on whether it makes any sense.

I think this is the right idea, but in HEAD with the sigdeferstop() changes it 
should just check for TDF_SBDRY instead of adding a new parameter.  I think
checking for TDF_SBDRY will work even in 9 (and will make the patch smaller).  
Also, I think this is only needed for stop signals.  Other suspend requests 
will eventually resume the thread, it is only stop signals that can cause the 
thread to get stuck indefinitely (since it depends on the user sending 
SIGCONT).

Marc, are you using SIGSTOP?

Index: kern_thread.c
===
--- kern_thread.c   (revision 246122)
+++ kern_thread.c   (working copy)
@@ -795,6 +795,17 @@ thread_suspend_check(int return_instead)
return (ERESTART);
 
/*
+* Ignore suspend requests for stop signals if they
+* are deferred.
+*/
+   if (P_SHOULDSTOP(p) == P_STOPPED_SIG 
+   td-td_flags  TDF_SBDRY) {
+   KASSERT(return_instead,
+   (TDF_SBDRY set for unsafe thread_suspend_check));
+   return (0);
+   }
+
+   /*
 * If the process is waiting for us to exit,
 * this thread should just suicide.
 * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1 would not boot on pentium3 laptop

2013-02-15 Thread John Baldwin

On Thursday, February 07, 2013 2:25:17 pm John Baldwin wrote:
 On Thursday, February 07, 2013 1:28:30 pm Mikhail T. wrote:
  On 07.02.2013 13:16, John Baldwin wrote:
   Can you get pciconf -lc output?
  Here:
  
  hostb0@pci0:0:0:0:  class=0x06 card=0x
  chip=0x11308086 rev=0x02 hdr=0x00
   cap 09[88] = vendor (length 4) Intel cap 15 version 1
   cap 02[a0] = AGP 4x 2x 1x SBA disabled
 
 Looks like you have one of the systems the comment mentions.  Try this patch 
 to see if ichss is disabled automatically for you:

Were you able to test this patch?
 
 Index: ichss.c
 ===
 --- ichss.c   (revision 246122)
 +++ ichss.c   (working copy)
 @@ -67,7 +67,7 @@ struct ichss_softc {
  #define PCI_DEV_82801BA  0x244c /* ICH2M */
  #define PCI_DEV_82801CA  0x248c /* ICH3M */
  #define PCI_DEV_82801DB  0x24cc /* ICH4M */
 -#define PCI_DEV_82815BA  0x1130 /* Unsupported/buggy part */
 +#define PCI_DEV_82815_MC 0x1130 /* Unsupported/buggy part */
  
  /* PCI config registers for finding PMBASE and enabling SpeedStep. */
  #define ICHSS_PMBASE_OFFSET  0x40
 @@ -155,9 +155,6 @@ ichss_identify(driver_t *driver, device_t parent)
* E.g. see Section 6.1 PCI Devices and Functions and table 6.1 of
* Intel(r) 82801BA I/O Controller Hub 2 (ICH2) and Intel(r) 82801BAM
* I/O Controller Hub 2 Mobile (ICH2-M).
 -  *
 -  * TODO: add a quirk to disable if we see the 82815_MC along
 -  * with the 82801BA and revision  5.
*/
   ich_device = pci_find_bsf(0, 0x1f, 0);
   if (ich_device == NULL ||
 @@ -167,6 +164,22 @@ ichss_identify(driver_t *driver, device_t parent)
   pci_get_device(ich_device) != PCI_DEV_82801DB))
   return;
  
 + /*
 +  * Certain systems with ICH2 and an Intel 82815_MC host bridge
 +  * where the host bridge's revision is  5 lockup if SpeedStep
 +  * is used.
 +  */
 + if (pci_get_device(ich_device) == PCI_DEV_82801BA) {
 + device_t hostb;
 +
 + hostb = pci_find_bsf(0, 0, 0);
 + if (hostb != NULL 
 + pci_get_vendor(hostb) == PCI_VENDOR_INTEL 
 + pci_get_device(hostb) == PCI_DEV_82815_MC 
 + pci_get_revid(hostb)  5)
 + return;
 + }
 +
   /* Find the PMBASE register from our PCI config header. */
   pmbase = pci_read_config(ich_device, ICHSS_PMBASE_OFFSET,
   sizeof(pmbase));
 

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9-STABLE - NFS - NetAPP:

2013-02-15 Thread John Baldwin

On Friday, February 15, 2013 10:21:11 am Rick Macklem wrote:
 Konstantin Belousov wrote:
  On Fri, Feb 15, 2013 at 08:44:43AM -0500, John Baldwin wrote:
   On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote:
Marc Fournier wrote:
 On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca
 wrote:

 
  The pid that is in T state for the ps auxlH.

 Different server, last kernel update on Jan 22nd, https process
 this
 time instead of du last time.

 I've attached:

 ps auxlH
 ps auxlH of just the processes that are in TJ state (6 httpd
 servers)
 procstat output for each of the 6 process




 They are included as attachments ??? if these don't make it
 through, let
 me know, just figured I'd try and keep it compact ...
Well, I've looked at this call path a little closer:
16693 104135 httpd - mi_switch+0x186
   thread_suspend_check+0x19f sleepq_catch_signals+0x1c5
  sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763
   clnt_reconnect_call+0xfb newnfs_request+0xadb
  nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56
   nfs_access+0x306 vn_open_cred+0x5a8
  kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7
   
I am probably way off, since I am not familiar with this stuff,
but it
seems to me that thread_suspend_check() should just return 0 for
the
case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag
set)
instead of sitting in the loop and doing a mi_switch(). I'm not
even
sure if it should call thread_suspend_check() for this case, but
there
are cases in thread_suspend_check() that I don't understand.
   
Although I don't really understand thread_suspend_check(), I've
attached
a simple patch that might be a starting point for fixing this?
   
I wouldn't recommend trying the patch until kib and/or jhb weigh
in
on whether it makes any sense.
  
   I think this is the right idea, but in HEAD with the sigdeferstop()
   changes it
   should just check for TDF_SBDRY instead of adding a new parameter. I
   think
   checking for TDF_SBDRY will work even in 9 (and will make the patch
   smaller).
   Also, I think this is only needed for stop signals. Other suspend
   requests
   will eventually resume the thread, it is only stop signals that can
   cause the
   thread to get stuck indefinitely (since it depends on the user
   sending
   SIGCONT).
  
   Marc, are you using SIGSTOP?
  
   Index: kern_thread.c
   ===
   --- kern_thread.c (revision 246122)
   +++ kern_thread.c (working copy)
   @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead)
 return (ERESTART);
  
 /*
   + * Ignore suspend requests for stop signals if they
   + * are deferred.
   + */
   + if (P_SHOULDSTOP(p) == P_STOPPED_SIG 
   + td-td_flags  TDF_SBDRY) {
   + KASSERT(return_instead,
   + (TDF_SBDRY set for unsafe thread_suspend_check));
   + return (0);
   + }
   +
   + /*
  * If the process is waiting for us to exit,
  * this thread should just suicide.
  * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE.
  
  This looks correct.
 Righto. Thanks jhb and kib for looking at this.
 
 Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However,
 as far as I can tell, it just sets TDF_SBDRY when it is already set
 and seems harmless. (Since this code is supposed to be generic and not
 specific to NFS, maybe it should stay that way?)

In HEAD PBDRY is now a nop and the existing sigdeferstop() stuff should
cover the calls in sys/rpc.

 Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch
 is ok for stable/9 without your recent head patch.

Yep, exactly.

 Thanks everyone for your help, rick

Thanks for your debugging!

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-RELEASE AMD64 crash under VBox 4.2.6 when IO APIC is disabled

2013-02-14 Thread John Baldwin

On Wednesday, February 13, 2013 6:56:06 pm CeDeROM wrote:
 On Wed, Feb 13, 2013 at 4:48 PM, John Baldwin j...@freebsd.org wrote:
  The simple answer that I have deduced is that APIC is MANDATORY for
  AMD64 machines and they won't run otherwise? This is why generic AMD64
  install fails when no APIC is enabled in the VBox?
 
  No, it is not quite like that.  x86 machines have two entirely different
  sets of interrupt controllers. (...)
 
 Hello John :-) Things now are more clear to me, thank you for your
 extensive explanation!! :-) I am wondering in that case if it wouldn't
 be a good idea to put atpci (old x86 IRQ handler) in the GENERIC
 configuration, or at least in the default installer kernel, so it is a
 safe fallback for a AMD64 machines with no APIC support, as for
 example VBox with APIC disabled..? Is atpic removed on purpose so it
 enforces use of new APIC and so better performance?

Real hardware should always use device apic on amd64.  Even for a VM you
should prefer apic.  That is, I think you should just enable APIC when
using VBox.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-RELEASE AMD64 crash under VBox 4.2.6 when IO APIC is disabled

2013-02-13 Thread John Baldwin

On Monday, February 11, 2013 4:34:37 pm CeDeROM wrote:
 On Mon, Feb 11, 2013 at 10:06 PM, John Baldwin j...@freebsd.org wrote:
  On Sunday, February 10, 2013 1:16:16 pm CeDeROM wrote:
  Hey :-) I have just noticed that booting installation media for
  FreeBSD 9.1-RELEASE AMD64 from ISO bootonly under VirtualBox 4.2.6
  results in a kernel panic both when ACPI is enabled and disabled in
 
  You will need to add 'device atpic' to your kernel config and build a 
custom
  kernel.  All real amd64-capable hardware has APICs.
 
 Hello John :-) Thank you for your reply, still I need some more
 information to understand why this happens :-)
 
 The simple answer that I have deduced is that APIC is MANDATORY for
 AMD64 machines and they won't run otherwise? This is why generic AMD64
 install fails when no APIC is enabled in the VBox?

No, it is not quite like that.  x86 machines have two entirely different
sets of interrupt controllers.  Old i386 machines only had a pair of 8259A
controllers (this is what 'device atpic' manages), and i386 kernels assume
they are always present (see sys/i386/conf/DEFAULTS).  When Intel added
SMP support to i386 machines starting with the 486 and Pentium they added
a new set of interrupt controllers called APICs (both I/O APICs to manage
device interrupts ala the 8259As and on-CPU APICs on Pentium and later called
local APICs).  device apic enables use of APICs.  The code to manage these
is actually shared between i386 and amd64 and any x86 kernel can use one or
the other of these _if_ the relevant driver is compiled in.  On i386
'device atpic' is enabled by default (via DEFAULTS) and 'device apic' is
enabled in GENERIC, so i386 kernels will work with both out of the box.  On 
amd64, 'device atpic' is not enabled by default (not in GENERIC), but
'device apic' is mandated to be on (it's not even an option, just always
compiled in).  So GENERIC on amd64 only supports 'device apic' by default.
You can use 'device atpic' on amd64 if you really want to, but APICs are
more efficient and required for using multiple CPUs, so unless you are
working around a specific hardware bug (or writing a hypervisor where
you haven't implemented APIC emulation yet), you should prefer APIC.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9.1-RELEASE AMD64 crash under VBox 4.2.6 when IO APIC is disabled

2013-02-11 Thread John Baldwin

On Sunday, February 10, 2013 1:16:16 pm CeDeROM wrote:
 Hey :-) I have just noticed that booting installation media for
 FreeBSD 9.1-RELEASE AMD64 from ISO bootonly under VirtualBox 4.2.6
 results in a kernel panic both when ACPI is enabled and disabled in
 the boot dialog screen (seems different cause of crash), when IO APIC
 is disabled in VBox (which is a default). I thought AMD64 is not
 related to APIC..?
 Best regards :-)
 Tomek

You will need to add 'device atpic' to your kernel config and build a custom 
kernel.  All real amd64-capable hardware has APICs.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1 would not boot on pentium3 laptop

2013-02-07 Thread John Baldwin

On Wednesday, February 06, 2013 1:24:57 am Mikhail T. wrote:
 On 05.02.2013 23:38, Mikhail T. wrote:
  What happened between 6.x and 7.x?
 Ok, what happened is that device cpufreq is now in GENERIC and the 
 ichss0 along with it.
 
 Setting
 
 set hint.ichss.0.disabled=1
 
 on the loader prompt allows me to boot -- both my own kernel as well as 
 the 9.1-RELEASE from CD. Solved... Annoying beyond belief, but solved.

I wonder if your system falls into this:

/*
 * ICH2/3/4-M I/O Controller Hub is at bus 0, slot 1F, function 0.
 * E.g. see Section 6.1 PCI Devices and Functions and table 6.1 of
 * Intel(r) 82801BA I/O Controller Hub 2 (ICH2) and Intel(r) 82801BAM
 * I/O Controller Hub 2 Mobile (ICH2-M).
 *
 * TODO: add a quirk to disable if we see the 82815_MC along
 * with the 82801BA and revision  5.
 */
ich_device = pci_find_bsf(0, 0x1f, 0);
if (ich_device == NULL ||
pci_get_vendor(ich_device) != PCI_VENDOR_INTEL ||
(pci_get_device(ich_device) != PCI_DEV_82801BA 
pci_get_device(ich_device) != PCI_DEV_82801CA 
pci_get_device(ich_device) != PCI_DEV_82801DB))
return;

Can you get pciconf -lc output?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD-9.1 would not boot on pentium3 laptop

2013-02-07 Thread John Baldwin

On Thursday, February 07, 2013 1:28:30 pm Mikhail T. wrote:
 On 07.02.2013 13:16, John Baldwin wrote:
  Can you get pciconf -lc output?
 Here:
 
 hostb0@pci0:0:0:0:  class=0x06 card=0x
 chip=0x11308086 rev=0x02 hdr=0x00
  cap 09[88] = vendor (length 4) Intel cap 15 version 1
  cap 02[a0] = AGP 4x 2x 1x SBA disabled

Looks like you have one of the systems the comment mentions.  Try this patch 
to see if ichss is disabled automatically for you:

Index: ichss.c
===
--- ichss.c (revision 246122)
+++ ichss.c (working copy)
@@ -67,7 +67,7 @@ struct ichss_softc {
 #define PCI_DEV_82801BA0x244c /* ICH2M */
 #define PCI_DEV_82801CA0x248c /* ICH3M */
 #define PCI_DEV_82801DB0x24cc /* ICH4M */
-#define PCI_DEV_82815BA0x1130 /* Unsupported/buggy part */
+#define PCI_DEV_82815_MC   0x1130 /* Unsupported/buggy part */
 
 /* PCI config registers for finding PMBASE and enabling SpeedStep. */
 #define ICHSS_PMBASE_OFFSET0x40
@@ -155,9 +155,6 @@ ichss_identify(driver_t *driver, device_t parent)
 * E.g. see Section 6.1 PCI Devices and Functions and table 6.1 of
 * Intel(r) 82801BA I/O Controller Hub 2 (ICH2) and Intel(r) 82801BAM
 * I/O Controller Hub 2 Mobile (ICH2-M).
-*
-* TODO: add a quirk to disable if we see the 82815_MC along
-* with the 82801BA and revision  5.
 */
ich_device = pci_find_bsf(0, 0x1f, 0);
if (ich_device == NULL ||
@@ -167,6 +164,22 @@ ichss_identify(driver_t *driver, device_t parent)
pci_get_device(ich_device) != PCI_DEV_82801DB))
return;
 
+   /*
+* Certain systems with ICH2 and an Intel 82815_MC host bridge
+* where the host bridge's revision is  5 lockup if SpeedStep
+* is used.
+*/
+   if (pci_get_device(ich_device) == PCI_DEV_82801BA) {
+   device_t hostb;
+
+   hostb = pci_find_bsf(0, 0, 0);
+   if (hostb != NULL 
+   pci_get_vendor(hostb) == PCI_VENDOR_INTEL 
+   pci_get_device(hostb) == PCI_DEV_82815_MC 
+   pci_get_revid(hostb)  5)
+   return;
+   }
+
/* Find the PMBASE register from our PCI config header. */
pmbase = pci_read_config(ich_device, ICHSS_PMBASE_OFFSET,
sizeof(pmbase));

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: problems with the mfi

2013-02-05 Thread John Baldwin

On Tuesday, February 05, 2013 3:48:28 am Daniel Braniss wrote:
 after rebooting I get very often:
 ...
 mfi0: COMMAND 0xff800132d990 TIMEOUT AFTER 659 SECONDS
 mfi0: COMMAND 0xff800132d990 TIMEOUT AFTER 689 SECONDS
 mfi0: COMMAND 0xff800132d990 TIMEOUT AFTER 719 SECONDS
 ...
 
 another reboot usualy fixes this.

Does it have the latest firmware?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: bge numbering

2013-01-25 Thread John Baldwin

On Friday, January 25, 2013 3:46:10 am Daniel Braniss wrote:
 Hi,
 this server, a Dell R720 has 4 bge on board,
  Broadcom NetXtreme Gigabit Ethernet, ASIC rev. 0x572
  bge0: APE FW version: NCSI v1.1.7.0
  bge0: CHIP ID 0x0572; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E
  miibus0: MII bus on bge0
  ...
 
 I have connected the ethernet to port labeled 0, but it appears
 as bge2, how can this be corrected?

It can't really.  The order of PCI devices is determined by the layout of the 
PCI device hierarchy which is generally determined by the physical traces on 
your motherboard.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9-STABLE - NFS - NetAPP:

2013-01-20 Thread John Baldwin

On Sunday, January 20, 2013 01:10:29 AM Hub- Marketing wrote:
 On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote:
  On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote:
  I'm running a few servers sitting on top of a NetAPP file server …
  everything runs great, but periodically I'm getting:
  
  nfs_getpages: error 13
  vm_fault: pager read error, pid 11355 (https)
  
  Are you using interruptible mounts (intr mount option)?
 
 192.168.1.253:/vol/vol1 /vm nfs rw,intr,soft,nolockd  0
   0
 
 I just added the 'soft' option to the mix … nolockd is enabled since I know
 for a fact that its not possible for two processes to access the same file
 on both mounts at the same time …

Ah, ok.  I just fixed a bug with interruptible mounts in HEAD where having
a signal interrupt an NFS request returns EACCESS (13) rather than EINTR.
You should retest with that fix applied.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Failed to attach P_CNT - FreeBSD 9.1 RC3

2013-01-19 Thread John Baldwin

On Sunday, November 04, 2012 05:56:33 AM Shiv. Nath wrote:
 Dear FreeBSD Community Friends,
 
 It is FreeBSD 9.1 RC3, i get the following warning in the message log
 file. i need assistance to understand the meaning of this error, how
 serious is it?
 
 acpi_throttle23: failed to attach P_CNT

On newer CPUs that use est you don't want to use acpi_throttle anyway so you 
can ignore the errors.  (est gives you power savings when it lowers your CPU 
speed, acpi_throttle generally does not, it only helps with lowering the 
temperature)

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Startup lapic messages

2013-01-19 Thread John Baldwin

On Tuesday, December 18, 2012 06:28:25 AM S.N.Grigoriev wrote:
 Hi list,
 
 I've installed FreeBSD 9.1R amd64 on a new Intel server.
 The following lapic messages appear during system startup:
 
 lapic18: Forcing LINT1 to edge trigger
 SMP: AP CPU #2 Launched!
 lapic50: Forcing LINT1 to edge trigger
 SMP: AP CPU #6 Launched!
 lapic20: Forcing LINT1 to edge trigger
 SMP: AP CPU #3 Launched!
 lapic32: Forcing LINT1 to edge trigger
 SMP: AP CPU #4 Launched!
 lapic2: Forcing LINT1 to edge trigger
 SMP: AP CPU #1 Launched!
 lapic34: Forcing LINT1 to edge trigger
 SMP: AP CPU #5 Launched!
 lapic52: Forcing LINT1 to edge trigger
 SMP: AP CPU #7 Launched!
 
 I've never seen such messages in past.
 Does it mean I have some hardware problem/misconfiguration?

Your BIOS is slightly buggy, but in a harmless way.  You can ignore these.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: 9-STABLE - NFS - NetAPP:

2013-01-19 Thread John Baldwin

On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote:
 I'm running a few servers sitting on top of a NetAPP file server …
 everything runs great, but periodically I'm getting:
 
 nfs_getpages: error 13
 vm_fault: pager read error, pid 11355 (https)

Are you using interruptible mounts (intr mount option)?

Also, can you get ps output that includes the 'l' flag to show what
the processes are stuck on?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Failsafe on kernel panic

2013-01-17 Thread John Baldwin

On Wednesday, January 16, 2013 4:27:53 pm Sami Halabi wrote:
 Thank you for your response, very helpful.
 one question - how do i configure auto-reboot once kernel panic occurs?

Unless you've added DDB and KDB to your kernel it will reboot by default
on a panic.  Stable kernel configs also include the unattended option so
that even with the debugger present they reboot by default on a panic.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Failsafe on kernel panic

2013-01-16 Thread John Baldwin

On Wednesday, January 16, 2013 2:25:33 pm Sami Halabi wrote:
 Hi everyone,
 I have a production box, in which I want to install new kernel without any
 remotd kvn.
 my problem is its 2 hours away, and if a kernel panic occurs I got a
 problem.
 I woner if I can seg failsafe script to load the old kernel in case of
 psnic.

man nextboot (if you are using UFS)

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: Possible to reset PCI device at boot? [Was: Re: msi-x enabled igb works only if module loaded twice]

2012-10-23 Thread John Baldwin

On Tuesday, October 23, 2012 8:40:44 am Harald Schmalzbauer wrote:
  schrieb Harald Schmalzbauer am 23.10.2012 11:49 (localtime):
   schrieb Harald Schmalzbauer am 22.10.2012 21:48 (localtime):
   schrieb Harald Schmalzbauer am 22.10.2012 21:33 (localtime):
   Hello,
 
  when using igb as module, no packet is received.
  If I send out anything, I see the packet with tcpdump,  also the switch
  learns the MAC address, but nothing comes back in - total silenc, no
  boradcasts, nothing.
  If I unload the module and load it again, everything works as expected!
  No matter if I load it by 4th loader, or later, I always have tio unload
  first then load it again.
  I'ts late here, I'll see tomorrow if things change when compieled into
  kernel.
  It doesn't matter if igb is loaded as module or compiled into kernel.
 
  Maby somebody has an idea what the source of the problem could be.
  Please find atteched some info, the OS is 9-RC2-amd64 on ESXi5.1 and
  nics are pci-passthrough.
  I found one possibly relevant difference:
 
  Non-Working state:dev.igb.0.link_irq: 0
  Working state:   dev.igb.0.link_irq: 2
  This is only true with msi-x!!!
  If I disable mis-x, the problem itself vanishes. igb just works fine
  from the initial loading (with dev.igb.0.link_irq=0!).
  So dev.igb.0.link_irq is only relevant with msi-x.
  But what makes me curious is why it also works mith mis-x enabled after
  the second kldload!?!
  
 I think I found the root cause:
 When ESXi powers up the guest, the passthru-devices are intialized with:
 VMKPCIPassthru: 2565: BDF = 02:00.1 intrType = 2 numVectors: 1
 intrType=2 seems to mean MSI.
 I guess, IOMMUIntel is instructed to remap one irq-vector for the
 device. But igb uses MSI-X and wants 3 vectors.
 If I unload if_igb and reload again, the ESXi-log shows the following:
 VMKPCIPassthru: 2565: BDF = 02:00.1 intrType = 4 numVectors: 3
 intrType=4 seems to mean MSI-X.
 After that initialization, if_igb works fine and saves 25kIRQ/s!
 
 I haven't found a way to change the power-up behaviour for the guest
 with ESXi.
 
 Is it possible to re-init a pci device from userland?

The problem is you want the igb driver to retry MSI-X even after a re-init
and that basically requires a full detach/attach, so your existing workaround
is actually the best way to do this. :(  Alternatively, you could try
forcing igb to not use MSI, only use either MSI-X or INTx.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ${CTFCONVERT_CMD} expands to empty string

2012-10-22 Thread John Baldwin

On Sunday, October 21, 2012 8:25:32 pm Andrey Chernov wrote:
 Those lines cause this error:
 .if ${MK_CTF} != no
 CTFCONVERT_CMD= ${CTFCONVERT} ${CTFFLAGS} ${.TARGET}
 .elif ${MAKE_VERSION} = 520300
 CTFCONVERT_CMD=
 .else
 CTFCONVERT_CMD= @:
 .endif
 
 My make version is 9201206140
 So, either the check for = 520300 is incorrect or change for empty
 make variables expansion is not merged into stable-9

I can't reproduce this doing a buildworld of a stable/9 checkout on a 9.0-
stable machine btw.  What exact contents of /etc/src.conf and commands are you 
using to reproduce this?

I also can't find the string empty string in the output of my stable/9
'make universe' build before I committed this.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ${CTFCONVERT_CMD} expands to empty string

2012-10-22 Thread John Baldwin

On Monday, October 22, 2012 1:01:53 pm Andrey Chernov wrote:
 All that happens because this commit is not merged into stable-9.
 Do you plan to mere it by yourself?
 
 r228157 | fjoe | 2011-11-30 22:07:38 +0400 (ср, 30 ноя 2011) | 10 lines
 
 - Fix segmentation fault when running +command when run with -jX -n due
 to Compat_RunCommand() being called with `cmd' that is not on the node-
commands
 list
 - Make ellipsis (... command) handling consistent: check for ... command
 in job make after variables expansion to match compat make behavior
 - Fix empty command handling (after variables expansion and @+- modifiers
 are processed): now empty commands are ignored in compat make and are not
 printed in job make case
 - Bump MAKE_VERSION to 5-2011-11-30-0

As soon as I can reproduce something that tests it, sure (I want to have a 
test case I can reproduce so that I can also check for 8).  Your test
Makefile does break on 8 and 9, want to do some more tests.

 On 22.10.2012 20:45, Andrey Chernov wrote:
  And simple test case proving that make v9201206140 dislike empty commands.
  Makefile:
  
  CTFCONVERT_CMD=
  all:
  echo ${MAKE_VERSION}
  ${CTFCONVERT_CMD}
  echo b
  
  make
  echo 9201206140
  9201206140
  ${CTFCONVERT_CMD} expands to empty string
  echo b
  b

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ${CTFCONVERT_CMD} expands to empty string

2012-10-20 Thread John Baldwin

On Friday, October 19, 2012 09:06:55 PM Andrey Chernov wrote:
 On recent -stable I got a lots of (see subj) now due to CTF changes in
 *.mk files.
 I have
 WITHOUT_CDDL=yes
 in my /etc/src.conf and WITHOUT_CDDL have wider scope than WITHOUT_CTF
 suggested, but WITHOUT_CDDL is not checked in recent CTF changes.
 Please fix this thing.

Which stable?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD 9.1-RC2 Available...

2012-10-19 Thread John Baldwin

On Friday, October 19, 2012 2:26:45 pm Alex de Joode wrote:
 https://sabotage.org/FBSD/FBSD-9.1RC2.jpg
 
 Screen shot. Basicly the only diff between the two r210 are the disks, 
 one has 2x2TB (works) and the one that has 2x1Tb fails with the above error.
 
 Both are sw/ mirrored. No hw/ raid and ACHI sata settings.

Hummm, somehow we are executing data, not code:

  8c 39 00 00 01 82 44 45  4c 4c 20 20 50 45 5f 53  |.9DELL  PE_S|

That isn't a valid instruction. :(

Also, your eip value is not anything that would be normal.

Actually, your eip value looks like a pointer into the BIOS (0xf000:bf6a).

I bet something in your BIOS had a buffer overrun and trashed the stack or
some such.  Or it overran an I/O buffer which trashed the return stack of
the userland process somehow.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mpt irq timeout problem after reboot - only if non-verbose booting !?!

2012-10-18 Thread John Baldwin

On Wednesday, October 17, 2012 3:14:52 pm Harald Schmalzbauer (mobil) wrote:
 -Ursprüngliche Nachricht-
  Von: John Baldwin j...@freebsd.org
  An: freebsd-stable@freebsd.org
  Cc: h.schmalzba...@omnilan.de
  Gesendet: 17.10.'12,  20:46
  
  On Tuesday, October 16, 2012 5:24:44 am Harald Schmalzbauer wrote:
   Hello,
  
  I have 9.1-RC2 running in an ESXi 5.1 guest.
  I use 'lsisas' as virtual SCSI-Controller and mpt attaches and finds 1068E.
  
  Everything is working fine until the first 'shutdown -r now':
  The second boot pauses for ~2 minutes after probing disks and continues
  with this error:
  mpt0: Timedout requests already complete. Interrupts may not be 
  functioning.
  
  To be clear, you only see this at the end of reboot, and the hardware is 
  fine
  once the machine is back up?
 .
 
 Thanks for your attention!
 The timeout occurs after the first 'shutdown -r' while device probing during
 second boot process.  Perhaps this is amd64 specific. Today I had a new i386
 setup which doesn't exhibit this timeout. But it's on different hardware and
 hv-host was 5.0 inestead 5.1. So not really representative...

Hmmm, ok.  In that case my patch is not relevant.  It would only fix that
message occuring during the shutdown.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: mpt irq timeout problem after reboot - only if non-verbose booting !?!

2012-10-17 Thread John Baldwin

, mpt_vol,
mpt_verify_mwce: Get request failed!\n);
@@ -965,7 +966,7 @@ static void
rv = mpt_issue_raid_req(mpt, mpt_vol, /*disk*/NULL, req,
MPI_RAID_ACTION_CHANGE_VOLUME_SETTINGS,
data, /*addr*/0, /*len*/0,
-   /*write*/FALSE, /*wait*/TRUE);
+   /*write*/FALSE, /*wait*/TRUE, sleep_ok);
if (rv == ETIMEDOUT) {
mpt_vol_prt(mpt, mpt_vol, mpt_verify_mwce: 
Write Cache Enable Timed-out\n);
@@ -1018,7 +1019,8 @@ mpt_verify_resync_rate(struct mpt_softc *mpt, stru
rv = mpt_issue_raid_req(mpt, mpt_vol, /*disk*/NULL, req,
MPI_RAID_ACTION_SET_RESYNC_RATE,
mpt-raid_resync_rate, /*addr*/0,
-   /*len*/0, /*write*/FALSE, /*wait*/TRUE);
+   /*len*/0, /*write*/FALSE, /*wait*/TRUE,
+   /*sleep_ok*/TRUE);
if (rv == ETIMEDOUT) {
mpt_vol_prt(mpt, mpt_vol, mpt_refresh_raid_data: 
Resync Rate Setting Timed-out\n);
@@ -1054,7 +1056,8 @@ mpt_verify_resync_rate(struct mpt_softc *mpt, stru
rv = mpt_issue_raid_req(mpt, mpt_vol, /*disk*/NULL, req,
MPI_RAID_ACTION_CHANGE_VOLUME_SETTINGS,
data, /*addr*/0, /*len*/0,
-   /*write*/FALSE, /*wait*/TRUE);
+   /*write*/FALSE, /*wait*/TRUE,
+   /*sleep_ok*/TRUE);
if (rv == ETIMEDOUT) {
mpt_vol_prt(mpt, mpt_vol, mpt_refresh_raid_data: 
Resync Rate Setting Timed-out\n);
@@ -1314,7 +1317,7 @@ mpt_refresh_raid_vol(struct mpt_softc *mpt, struct
return;
}
rv = mpt_issue_raid_req(mpt, mpt_vol, NULL, req,
-   MPI_RAID_ACTION_INDICATOR_STRUCT, 0, 0, 0, FALSE, TRUE);
+   MPI_RAID_ACTION_INDICATOR_STRUCT, 0, 0, 0, FALSE, TRUE, TRUE);
if (rv == ETIMEDOUT) {
mpt_vol_prt(mpt, mpt_vol,
mpt_refresh_raid_vol: Progress Indicator fetch timeout\n);
@@ -1474,7 +1477,7 @@ mpt_refresh_raid_data(struct mpt_softc *mpt)
mpt_vol-flags |= MPT_RVF_UP2DATE;
mpt_vol_prt(mpt, mpt_vol, %s - %s\n,
mpt_vol_type(mpt_vol), mpt_vol_state(mpt_vol));
-   mpt_verify_mwce(mpt, mpt_vol);
+   mpt_verify_mwce(mpt, mpt_vol, TRUE);
 
if (vol_pg-VolumeStatus.Flags == 0) {
continue;
@@ -1752,7 +1755,7 @@ mpt_raid_set_vol_mwce(struct mpt_softc *mpt, mpt_r
mpt_vol_prt(mpt, mpt_vol, WARNING - Unsafe shutdown 
detected.  Suggest full resync.\n);
}
-   mpt_verify_mwce(mpt, mpt_vol);
+   mpt_verify_mwce(mpt, mpt_vol, TRUE);
}
mpt-raid_mwce_set = 1;
MPT_UNLOCK(mpt);


-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: FreeBSD 9.1-RC2 Available...

2012-10-12 Thread John Baldwin

On Thursday, October 11, 2012 3:49:51 am Sami Halabi wrote:
 Hi,
 
 there's a patch in the list you mentioned.
 it should go to rc3 i guess.

No, that patch would break all other interrupt config hooks like probes for 
SATA and SCSI disks and USB disks.

Some driver's config hook is not finishing.  Each driver's hook is responsible 
for deregistering itself once it has finished it's interrupt probing which is
why it is not obvious how the list becomes empty.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: stable/9 panic Bad tailq NEXT(0xffffffff80e52660-tqh_last) != NULL

2012-10-03 Thread John Baldwin

On Tuesday, October 02, 2012 7:12:59 pm Sean Bruno wrote:
 On Tue, 2012-10-02 at 14:06 -0700, John Baldwin wrote:
  On Tuesday, October 02, 2012 3:05:30 pm Sean Bruno wrote:
   On Mon, 2012-10-01 at 05:47 -0700, John Baldwin wrote:
Can you add extra printfs to see where exactly attach is failing?  I
would
start with the attach routine in sys/dev/acpica/acpi_pcib_pci.c:


   
   hrm ... interesting side effects.  After adding my printf's I don't hit
   the panic any more.  :-)
   
   I changed the ret val of acpi_pcib_pci_attach() and put in some
   instrumentation in acpi_pcib_attach().  The key value is that
   acpi_DeviceIsPresent() appears to be returning FALSE in this case.
   
   patch used --http://people.freebsd.org/~sbruno/acpi_pcib.txt
  
  What happens if you just comment out the acpi_DeviceIsPresent() check?
  
 
 
 wow, it booted up and seems to be fine.  huh ...
 pcib7: ACPI PCI-PCI bridge at device 28.0 on pci0
 pcib7:   domain0
 pcib7:   secondary bus 7
 pcib7:   subordinate bus   7
 pcib7:   no prefetched decode
 pci7: ACPI PCI bus on pcib7
 pci7: domain=0, physical bus=7

Is there anything on the bus?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: panic Sleeping thread owns a non-sleepable lock via cv_timedwait_signal, was rsync over NFS

2012-10-02 Thread John Baldwin

On Tuesday, October 02, 2012 11:21:06 am Norbert Aschendorff wrote:
 I'll compile a kernel with
 
 options WITNESS
 options WITNESS_KDB
 
 ok? Or should I include WITNESS_SKIPSPIN too?

Yes, you should include WITNESS_SKIPSPIN.  We should probably make that the 
default.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: stable/9 panic Bad tailq NEXT(0xffffffff80e52660-tqh_last) != NULL

2012-10-02 Thread John Baldwin

On Tuesday, October 02, 2012 3:05:30 pm Sean Bruno wrote:
 On Mon, 2012-10-01 at 05:47 -0700, John Baldwin wrote:
  Can you add extra printfs to see where exactly attach is failing?  I
  would
  start with the attach routine in sys/dev/acpica/acpi_pcib_pci.c:
  
  
 
 hrm ... interesting side effects.  After adding my printf's I don't hit
 the panic any more.  :-)
 
 I changed the ret val of acpi_pcib_pci_attach() and put in some
 instrumentation in acpi_pcib_attach().  The key value is that
 acpi_DeviceIsPresent() appears to be returning FALSE in this case.
 
 patch used --http://people.freebsd.org/~sbruno/acpi_pcib.txt

What happens if you just comment out the acpi_DeviceIsPresent() check?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: panic Sleeping thread owns a non-sleepable lock via cv_timedwait_signal, was rsync over NFS

2012-10-02 Thread John Baldwin

On Tuesday, October 02, 2012 2:19:35 pm Norbert Aschendorff wrote:
 Well...
 
 Here the results for a kernel without WITNESS_SKIPSPIN (I'll compile one
 including that tomorrow, but until then...)
 
 Good news is: The kernel crashed with activated WITNESS.
 Bad news is: I have to turn power off after the crash with WITNESS. The
 crash dump is _not_ written to disk :(
 
 Good news II is: It wrote something to the syslog. Actually, it wrote
 very much to the syslog, some megabytes in total. Most of it is the
 same, here the latest messages logfile:
 http://lbo.spheniscida.de/Files/nfs-crash.log (94K)
 
 It specifies the file, line and zone. Maybe it's useful...

That does help.  It tells us that the lock being held is a vnode interlock 
that was last acquired in vinactive().

I don't see how though, unless the lock was recursively acquired elsewhere.

You could try adding a different WITNESS check (using WITNESS_WARN) to see
which NFS proc returns with a lock held so you can catch this when it first 
occurs rather than much later after the fact.  Do you have the start of the 
log messages?

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: stable/9 panic Bad tailq NEXT(0xffffffff80e52660-tqh_last) != NULL

2012-10-01 Thread John Baldwin

On Thursday, September 27, 2012 4:53:49 pm Sean Bruno wrote:
 On Thu, 2012-09-27 at 10:52 -0700, Sean Bruno wrote:
   
pcib7: ACPI PCI-PCI bridge irq 19 at device 28.7 on pci0
panic: Bad tailq NEXT(0x80e52660-tqh_last) != NULL
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kdb_backtrace() at kdb_backtrace+0x37
panic() at panic+0x1d8
rman_init() at rman_init+0x17c
pcib_alloc_window() at pcib_alloc_window+0x9f
pcib_attach_common() at pcib_attach_common+0x457
acpi_pcib_pci_attach() at acpi_pcib_pci_attach+0x1c
device_attach() at device_attach+0x72
bus_generic_attach() at bus_generic_attach+0x1a
acpi_pci_attach() at acpi_pci_attach+0x164
device_attach() at device_attach+0x72
bus_generic_attach() at bus_generic_attach+0x1a
acpi_pcib_attach() at acpi_pcib_attach+0x1a7
acpi_pcib_acpi_attach() at acpi_pcib_acpi_attach+0x1f6
device_attach() at device_attach+0x72
bus_generic_attach() at bus_generic_attach+0x1a
acpi_attach() at acpi_attach+0xbc1
device_attach() at device_attach+0x72
bus_generic_attach() at bus_generic_attach+0x1a
nexus_acpi_attach() at nexus_acpi_attach+0x69
device_attach() at device_attach+0x72
bus_generic_new_pass() at bus_generic_new_pass+0xd6
bus_set_pass() at bus_set_pass+0x7a
configure() at configure+0xa
mi_startup() at mi_startup+0x77
btext() at btext+0x2c
Uptime: 1s
Automatic reboot in 15 seconds - press a key on the console to abort
-- Press a key on the console to reboot,
-- or switch off the system now.
   
   
   --
   Andriy Gapon
   
  
  resurrecting this thread from my sent items folder, not sure if 
  mailman will thread this correctly or not
  
  Anyway, after disabling the broken pci bridge via some hackery
  that jhb and eadler had lying around, I was able to get the r620 up 
  on the new BIOS and get an acpidump before and after the firmware update.
  
  I can poke a the machines, but I don't quite see in this nonsense where 
  it breaks acpi_pcib_pci_attach().  Where should I start poking next?
  
  
  http://people.freebsd.org/~sbruno/acpi_112_r620.txt
  
  http://people.freebsd.org/~sbruno/acpi_126_r620.txt
  
  
 
 For fun, I added the pciconf output to see if there's anything obviously
 wrong with pcib7.  But, as usual, I have no idea how to interpret this.
 
 http://people.freebsd.org/~sbruno/r620_pciconf.txt

Can you add extra printfs to see where exactly attach is failing?  I would
start with the attach routine in sys/dev/acpica/acpi_pcib_pci.c:

static int
acpi_pcib_pci_attach(device_t dev)
{
struct acpi_pcib_softc *sc;

ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__);

pcib_attach_common(dev);
sc = device_get_softc(dev);
sc-ap_handle = acpi_get_handle(dev);
return (acpi_pcib_attach(dev, sc-ap_prt, sc-ap_pcibsc.secbus));
}

Hmm, so that can only fail inside of acpi_pcib_attach() in
sys/dev/acpica/acpi_pcib.c.  I would add printfs to annotate that.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 950 matches

Mail list logo