Re: Anyone else seeing hangs at "Trying to mount root" after recent commits?

2019-09-10 Thread Terry Kennedy
  Since I'm not the only one seeing this, I've opened PR 240487:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240487

    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Anyone else seeing hangs at "Trying to mount root" after recent commits?

2019-09-10 Thread Terry Kennedy
  I have a system I updated from r352025 (which worked fine) to r352200.
Any attempt to boot r352200 results in the system just sitting there after
displaying the normal "Trying to mount root from ufs:/dev/da0p3 [rw]..."
message. Nothing further, not even after a half hour. The console is non-
responsive (not that I'd really expect anything, but...).

  This persists across multiple resets, power cycles, etc. Booting the
previous r352025 works fine, as expected.

  Before I start trying to bisect this, is anyone else seeing this? amd64
on a standard Dell PowerEdge R730 if that matters.

    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: mpr causing a boot hang sometime after r348368

2019-08-31 Thread Terry Kennedy
  In case it wasn't clear from my original post, all of the successful
boots with the card in a full-height CPU 1 slot were with r351637.

    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


mpr causing a boot hang sometime after r348368 - NUMA related?

2019-08-31 Thread Terry Kennedy
  TL;DR - mpr controller becomes increasingly likely to hang boot when on
the 2nd CPU as FreeBSD 12.0-STABLE moves forward.

  I have a Dell PowerEdge R730 (configuration details available if needed)
with a PERC H730 mini (mrsas driver) and a "12Gbps external HBA", Dell part
number T93GD (mpr driver). There is an external Dell LTO4 drive attached to
the external HBA and is the only thing connected to it.

  r348368 boots normally, and the HBA and tape are recognized as:

mpr0:  port 0x8000-0x80ff mem 
0xc910-0xc910,0xc800-0xc80f irq 64 at device 0.0 numa-domain 1 
on pci17
mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 
7a85c
mpr0: Found device ,End Device> <6.0Gbps> handle<0x0009> 
enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1   )
sa0 at mpr0 bus 0 scbus14 target 7 lun 0

  The next revision I tried was r350268. That boots most of the time, but
sometimes hangs with various messages, not in any particular order, such
as (forgive any typos, I could only get these as screen grabs):

mpr_config_get_dpm_pg0: request for page completed with error 60
mpr0: Out of chain frames, consider increasing hw.mpr.max_chains
(probe0:mpr0:0:7:0): Down reving Protocol Version from 4 to 0?
mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60)
mpr0: Reinit success
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config

  This all happens whether or not the external tape drive is plugged into
the system (unplugged at the system end, so no dangling cables). 

  The problem goes away (with unacceptable loss of performance) if I boot
in safe mode. Setting hw.mpr.disable_msi=1 and hw.mpr.disable_msix=1 has
no effect.

  r350970 behaves in much the same way, working sometimes but needing safe
mode to have a 100% successful chance of booting.

  r351637 seems to never boot unless I boot in safe mode, then works 100%
of the time.

  Dell has replaced the controller and the problem persists. Since it still
happens with the tape drive disconnected, I didn't have them replace the
drive and cable.

  The one thing I noted when Dell had the chassis open was that the slot
this card is in is labeled "CPU 2", which would seem to be confirmed by
the "numa-domain 1" in the working dmesg output. Unfortunately, all of the
low-profile slots in this chassis are on CPU 2, and the part number of my
card (and the Dell spare) is a low-profile-only card. I had the tech put
the card in one of the full-height CPU 1 slots (which involved removing
the card bracket and installing it "naked", which he wasn't comfortable
with). Lo and behold, it boots when the card is in numa-domain 0:

mpr0:  port 0x2000-0x20ff mem 
0x9360-0x9360,0x9250-0x925f irq 32 at device 0.0 numa-domain 0 
on pci4
mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 
7a85c
mpr0: Found device ,End Device> <6.0Gbps> handle<0x0009> 
enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1   )
sa0 at mpr0 bus 0 scbus2 target 7 lun 0

  I was able to do 4 consecutive working boots before the tech got antsy
and wanted to either put the card back in a low-profile slot or start the
meter for billable time.

  Based on this, it seems to be a timing-related issue when the mpr card
is on the 2nd CPU (and when SMP is enabled)

  Any suggestions for further diagnostic information, other things to try,
or (preferably) "here. try this patch"?

Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: FreeBSD 11.1-R network slowness on Samba

2017-08-01 Thread Terry Kennedy
> Then I updated to 11.1-R by recompiling from svn, using the same
> kernelconfig from 10.3, and now my windows client shows timeouts and
> really slow connection. File copy never past kilobytes per second :(

  I think the issue is with the SAMBA port. I was running the samba36
port under FreeBSD 10.x and things were fine (500Mbyte/sec transfers
using 10GbE). At some point the samba36 port broke due to changes in
one or more of talloc/tdb/tevent and I tried various samba4x ports (I
am using samba44 now, samba46 doesn't seem to work with XP-type client
systems).

  Various directory traversal operations spike the CPU load up to 100%
and clients see very bursty behavior. A lot seems to depend on the 
application in use - my benchmark is clrmamepro:
https://mamedev.emulab.it/clrmamepro/

  But even things like Windows (7) Photo Viewer will just sit at the
"Loading..." message for a random length of time before displaying
the next picture.

  I am seeing easily a 10:1 performance degradation with any of the 
samba4x ports. I have tried large numbers of SAMBA config tuning
changes, different port build options, etc. without any success. I
"solved" the problem by using an old FreeBSD 8.4 box with a 10GbE
card as an NFS client to the storage server, and then exporting the
NFS-mounted storage to clients with samba36. This whole chain is
nearly as fast as the old samba36-on-storage-server setup.

  I may try resurrecting the samba36 port under FreeBSD 11.1 to see
if that has the performance I used to get. I'm not sure how hard
it will be to build samba36, though. Things have changed since the
port was retired.

Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Trouble with SM961 in SuperMicro X11

2017-07-23 Thread Terry Kennedy
> I bought this card but never could get it to fail, despite trying them in a
> number of different systems :(. So lighten up please...

  They are obviously working for some people, but enough people are seeing
the same problem over and over that there is definitely something wrong.

  I spent a good deal of time gathering the requested traces for the de-
veloper who was working on it with me, going so far as to purchase a dif-
ferent board (same model, different firmware) and a second adapter card,
and tried all of the above in multiple systems. I then sent off the re-
quested info, re-iterated my offer of remote access to one of the systems
showing the problem, and heard... nothing. A follow-up some months later
also got no response, while more and more people are running into the is-
sue and reporting it either on the lists or in the forums.

  My offer of a test system with the card is still open, if someone wants
to pick this up again. I will note that in my case, it only happens in my
Supermicro systems (but again, Linux works well with it on those boxes).
The SM961 in the same adapter works fine in a Dell system, and an Optane
card also works fine in the Supermicro system. 

    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Trouble with SM961 in SuperMicro X11

2017-07-23 Thread Terry Kennedy
> It's an SM961, not PM951.

  Welcome to the club! 8-{

  See PR211723 - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
as well as the forums: https://forums.freebsd.org/threads/58170/#post-334061

  I (and others) have offered developers remote console access to systems
that exhibit the problem, as well as confirming it works on the same hard-
ware using Linux, gathered requested traces and so on, and then things just
sort of... died.

  You should probably pile onto both the forum discussion and the PR with
a "me too!" so it becomes more and more obvious that this is affecting a
larger number of people as time goes on and these modules become more pop-
ular.
    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 10.3-BETA2 regression in MPT

2016-02-20 Thread Terry Kennedy
> Can you build 10-STABLE and merge back the mpt driver prior to r285840 . It
> looks like a change was merged in about 7 weeks ago that
> has to do with probing the devices.
>
> https://svnweb.freebsd.org/base/stable/10/sys/dev/mpt/mpt.c?view=log

  Same behavior with or without that change reverted.

    Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 10.3-BETA2 regression in MPT

2016-02-19 Thread Terry Kennedy
> Can you get the status of the controller and disks via mptutil ? Also 
> what does camcontrol devlist -v show ?

  8.4 (8-STABLE):

# mptutil show adapter
mpt0 Adapter:
   Board Name: SAS6IR
   Board Assembly: 
Chip Name: C1068E
Chip Revision: UNUSED
  RAID Levels: RAID0, RAID1, RAID1E
RAID0 Stripes: 64k
   RAID1E Stripes: 64k
 RAID0 Drives/Vol: 2-10
 RAID1 Drives/Vol: 2
RAID1E Drives/Vol: 3-10

# mptutil show drives
mpt0 Physical Drives:
   0 (  137G) ONLINE  SAS bus 0 id 1
   1 (  137G) ONLINE  SAS bus 0 id 9

# mptutil show volumes
mpt0 Volumes:
  Id SizeLevel   Stripe  State  Write-Cache  Name
 0 (  136G) RAID-1  OPTIMAL   Enabled   

# camcontrol devlist -v
scbus0 on mpt0 bus 0:
   at scbus0 target 0 lun 0 (da0,pass0)
at scbus0 target 8 lun 0 (ses0,pass1)
<> at scbus0 target -1 lun -1 ()
scbus1 on mpt0 bus 1:
   at scbus1 target 0 lun 0 (pass2)
<> at scbus1 target -1 lun -1 ()
scbus-1 on xpt0 bus 0:
<> at scbus-1 target -1 lun -1 (xpt0)

10.3-BETA2:

# mptutil show adapter
mpt0 Adapter:
   Board Name: SAS6IR
   Board Assembly: 
Chip Name: C1068E
Chip Revision: UNUSED
  RAID Levels: RAID0, RAID1, RAID1E
RAID0 Stripes: 64K
   RAID1E Stripes: 64K
 RAID0 Drives/Vol: 2-10
 RAID1 Drives/Vol: 2
RAID1E Drives/Vol: 3-10

# mptutil show drives
mpt0 Physical Drives:
   0 (  137G) ONLINE  SAS bus 0 id 1
   1 (  137G) ONLINE  SAS bus 0 id 9

# mptutil show volumes
mpt0 Volumes:
  Id SizeLevel   Stripe  State  Write-Cache  Name
 0 (  136G) RAID-1  OPTIMAL   Enabled   

# camcontrol devlist -v
scbus0 on mpt0 bus 0:
   at scbus0 target 0 lun 0 (pass0,da0)
at scbus0 target 8 lun 0 (ses0,pass1)
<> at scbus0 target -1 lun  ()
scbus1 on mpt0 bus 1:
   at scbus1 target 0 lun 0 (pass2)
<> at scbus1 target -1 lun  ()
scbus2 on ata2 bus 0:
   at scbus2 target 1 lun 0 (pass3,cd0)
<> at scbus2 target -1 lun  ()
scbus3 on ata3 bus 0:
<> at scbus3 target -1 lun  ()
scbus4 on ata4 bus 0:
<> at scbus4 target -1 lun  ()
scbus5 on ata5 bus 0:
<> at scbus5 target -1 lun  ()
scbus-1 on xpt0 bus 0:
<> at scbus-1 target -1 lun  (xpt0)

  To clarify, things seem to work fine on 10.3-BETA2 after the system
has booted, but there is a _long_ pause while the kernel is probing
the mpt0 controller, followed by the spew of CAM error messages from
the probes.

  Let me know if you need any additional info.

Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


10.3-BETA2 regression in MPT

2016-02-18 Thread Terry Kennedy
hru on these controllers -
as you can see above, there are 2 physical disks attached to the controller,
used as a mirror volume. But only one of the members appears as a passN de-
vice, which means that the other one can't be monitored with smartmontools.
If I'm remembering correctly, a volume with more than 2 drives creates a
passN device for all but one of the drives.

Terry Kennedy http://www.glaver.org  New York, NY USA
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ZFS panic after replacing log device

2010-11-16 Thread Terry Kennedy
> I would say it is definitely very odd that writes are a problem.  Sounds
> like it might be a hardware problem.  Is it possible to export the pool, 
> remove the ZIL and re-import it?  I myself would be pretty nervous trying
> that, but it would help isolate the problem?  If you can risk it.

  I think it is unlikely to be a hardware problem. While I haven't run any
destructive testing on the ZFS pool, the fact that it can be read without
error, combined with ECC throughout the system and the panic always happen-
ing on the first write, makes me think that it is a software issue in ZFS.

  When I do:

zpool export data; zpool remove data da0

  I get a "No such pool: data". I then re-imported the pool and did:

zpool offline data da0; zpool export data; zpool import data

  After doing that, I can write to the pool without a panic. But once I
online the log device and do any writes, I get the panic again.

  As I mentioned, I have this data replicated elsewere, so I can exper-
iment with the pool if it will help track down this issue.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ZFS panic after replacing log device

2010-11-15 Thread Terry Kennedy
> I am no ZFS kernel-code dude or anything, but it is well known that losing
> the ZIL can corrupt things pretty bad with ZFS.

  First, thanks for writing back!

  I agree that this could be the problem. As I mentioned in my original post,
I followed the steps recommended by "zpool status" - clearing the device and
then doing a replace. The fix may be as simple as testing for whether the de-
vice in question is a log device and if so, erroring out with "You can't do
that".

  Also note that multiple scrubs pass with no errors detected - it is only
writes that trigger the panic. It looks like something isn't being cleaned
up in the clear / replace path.

  I would save a crash dump for people to look at, but unfortunately the
last time a crash dump actually worked for me (on dozens of systems) was
back in the FreeBSD 6.2 days.

  There wasn't any data corruption (the filesystem was not being written at
the time the log device failed) - I have my own checksum files written by
the sysutils/cfv port, and the data all matches.

> All in all, if I was in your situation I would give a whirl at installing
> OpenSolaris and going from there, being sure not to upgrade the pool vers-
> ion past what is supported by FreeBSD and going from there.

  I have the data on another server (see my prior "snapshots are not back-
ups" discussion on freebsd-stable if interested). So, fortunately, this is
not a case of data recovery.

> Unfortunately we all find ourselves in a bit of a pickle with ZFS right 
> now with the Oracle acquisition of Sun.  For myself, I would stick with 
> deploying on FreeBSD but I think its going to be FBSD 9.1 before its go-
> ing to be truly ready for production.

  The problem with hardware on the leading edge is that the software often
needs time to catch up. In this particular case, the ZFS pool is 32TB. I
can't begin to imagine how long a UFS fsck would take on such a partition,
even if it were possible to create one. It was bad enough on the previous
generation of my servers (2TB UFS partitions).

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ZFS panic after replacing log device

2010-11-15 Thread Terry Kennedy
> I can give a developer remote console / root access to the box if that would 
> help. I have a couple days before I will need to nuke the pool and restore it 
> from backups. 

I haven't heard from anyone that wants to look into this. I need to get the 
pool back into service soon. If I don't get any requests to postpone or offers 
to investigate by 00:00 GMT on the 18th, I'll proceed with re-initializing the 
pool (minus the SSD, which is persona non grata). 

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


ZFS panic after replacing log device

2010-11-13 Thread Terry Kennedy
I'm posting this to the freebsd-stable and freebsd-fs mailing lists. Followups
should probably happen on freebsd-fs.

I have a ZFS pool configured as:

zpool create data raidz da1 da2 da3 da4 da5 raidz da6 da7 da8 da9 da10 
raidz da11 da12 da13 da14 da15 spare da16 log da0

where da1-16 are WD2003FYYS drives (2TB RE4) and da0 is a 256GB PCI-Express
SSD (name omitted to protect the guilty).

The SSD has been dropping offline randomly - it seems that one or more flash 
modules pop out of their sockets and need to be re-seated frequently for some 
reason.

The most recent time it did that, I replaced the SSD with another one (for some 
reason, the manufacturer ties the flash modules to a particular controller, so 
just moving the modules results in an offline SSD and inability to manage it 
due to "license limits exceeded" or some such nonsense).

ZFS wasn't happy with the log device being changed, and reported it as 
corrupted, with the suggested corrective action being to "zpool clear" it. I 
did that, and then did a "zpool replace data da0 da0" and it claimed to 
successfully resilver it. I then did a "zpool scrub" and the scrub completed 
with no errors. So far, so good.

However, any attempt to write to the array results in a near-immediate panic:

panic: solaris assert: sm->sm_spare + size <= sm->sm_size, file: 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c,
 
line: 93 cpuid=2

(Screenshot at http://www.tmk.com/transient/zfs-panic.png in case I mis-typed
something).

This is repeatable across reboot / scrub / test cycles. System is 8-STABLE as 
of Fri Nov  5 19:08:35 EDT 2010, on-disk pool is version 4/15, same as the 
kernel.

I know that certain operations on log devices aren't supported until pool 
version 19 or thereabouts, but the error messages and zpool command results 
gave the impression that what I was doing was supported and worked (when it 
didn't). If this is truly a "you can't do that in pool version 15", perhaps a 
warning could be added so users don't get fooled into thinking it worked?

I can give a developer remote console / root access to the box if that would 
help. I have a couple days before I will need to nuke the pool and restore it 
from backups.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Bogus "igb1: Could not setup receive structures" in 8-STABLE

2010-10-14 Thread Terry Kennedy
> The problem is mbuf resources, the driver is autoconfiguring the number of
> queues based on the number of cores, on newer systems with lots of them
> this is outstripping the mbuf resource pool.

  That would make sense, as these systems have 16 cores (dual E5520's).

> I have decided to hard limit the queues to 8, you can fix the number
> manually
> by searching for num_queues in if_igb.c and setting it to something other
> than
> 0 for now.

  I changed it to 8, and saw the same problem. I noted that the igb boot
messages changed from:

Oct 14 18:28:02 rz1m kernel: igb0: Using MSIX interrupts with 10 vectors
Oct 14 18:28:02 rz1m kernel: igb1: Using MSIX interrupts with 10 vectors

  to:

Oct 14 21:53:44 rz1m kernel: igb0: Using MSIX interrupts with 9 vectors
Oct 14 21:53:44 rz1m kernel: igb1: Using MSIX interrupts with 9 vectors

  So I dropped the value to 3 (on the assumption that the system uses one
more than the specified value per interface), and got:

igb0: Using MSIX interrupts with 4 vectors
igb1: Using MSIX interrupts with 4 vectors

  and both igb interfaces came up. I didn't try to find the maximum
number of queues that would work.

> I am at work on a number of issues with igb and em right now which is why
> there has not been an MFC yet.

  Understood. Thanks for the quick response and workaround.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Bogus "igb1: Could not setup receive structures" in 8-STABLE

2010-10-14 Thread Terry Kennedy
ue7.txd_head: 100
dev.igb.0.queue7.txd_tail: 100
dev.igb.0.queue7.tx_packets: 50
dev.igb.0.queue7.rxd_head: 74
dev.igb.0.queue7.rxd_tail: 73
dev.igb.0.queue7.rx_packets: 1098
dev.igb.0.queue7.rx_bytes: 116918
dev.igb.0.queue8.txd_head: 11
dev.igb.0.queue8.txd_tail: 11
dev.igb.0.queue8.tx_packets: 6
dev.igb.0.queue8.rxd_head: 25
dev.igb.0.queue8.rxd_tail: 24
dev.igb.0.queue8.rx_packets: 25
dev.igb.0.queue8.rx_bytes: 3698
dev.igb.0.mac_stats.total_pkts_recvd: 3138
dev.igb.0.mac_stats.good_pkts_recvd: 1815
dev.igb.0.mac_stats.bcast_pkts_recvd: 383
dev.igb.0.mac_stats.mcast_pkts_recvd: 114
dev.igb.0.mac_stats.rx_frames_64: 217
dev.igb.0.mac_stats.rx_frames_65_127: 673
dev.igb.0.mac_stats.rx_frames_128_255: 779
dev.igb.0.mac_stats.rx_frames_256_511: 61
dev.igb.0.mac_stats.rx_frames_512_1023: 46
dev.igb.0.mac_stats.rx_frames_1024_1522: 39
dev.igb.0.mac_stats.good_octets_recvd: 270378
dev.igb.0.mac_stats.good_octets_txd: 252216
dev.igb.0.mac_stats.total_pkts_txd: 1554
dev.igb.0.mac_stats.good_pkts_txd: 1554
dev.igb.0.mac_stats.bcast_pkts_txd: 36
dev.igb.0.mac_stats.mcast_pkts_txd: 65
dev.igb.0.mac_stats.tx_frames_64: 32
dev.igb.0.mac_stats.tx_frames_65_127: 219
dev.igb.0.mac_stats.tx_frames_128_255: 1222
dev.igb.0.mac_stats.tx_frames_256_511: 55
dev.igb.0.mac_stats.tx_frames_512_1023: 19
dev.igb.0.mac_stats.tx_frames_1024_1522: 7
dev.igb.0.interrupts.asserts: 3350
dev.igb.0.interrupts.rx_pkt_timer: 1815
dev.igb.0.interrupts.tx_abs_timer: 1815
dev.igb.0.interrupts.tx_queue_empty: 1554
dev.igb.0.host.rx_good_bytes: 270378
dev.igb.0.host.tx_good_bytes: 252216

  Both ports are cabled to the same Cisco switch. If I swap the cables at
the FreeBSD end between igb0 and igb1, the problem stays with igb1 so I
don't think it is the switch.

  I have 3 identical systems, all of which exhibit this same issue. Un-
fortunately, I don't have any other hardware with dual igb's.

  I can give a developer root access as well as a web-based remote console
if needed to track this down.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Read / write timeouts on SATA disks connected to ICH9

2010-05-15 Thread Terry Kennedy

Interesting. Which version of FreeBSD is this system running? I guess
you didn't experience any of the timeouts I'm seeing?


 8-STABLE as of the 11th of this month, or thereabouts. No, I've never
seen a disk timeout on that box.


Yeah, this R300 was bought second-hand and unfortunately the owner
pulled the RAID card out. It's something to consider, getting one of
those cards. Do you use the RAID-features of the drive and if so, does
that work well? I'm a bit hesitant to use hardware raid; it would be a
big plus if the RAID disks could also be used stand-alone if need be
(which is easy with gmirror because of its metadata being stored in the
drive's last sector).


Does your system have hot-swap drive bays and the SAS backplane? If it
at least has hot-swap bays, then you could always add the backplane,
cable, and controller.

 I'm using the hardware mirroring on the SAS 6/iR card (with a pair of
WD3000HLFS drives, since the previous owner took the factory drives out
before selling the system).

 I haven't tried taking one of those drives and seeing if it will boot
on a standalone SATA port. I have removed both drives, installed a scratch
drive, and installed Windows on it to run one of the Dell update install-
ers (not all of them come in DOS or Linux flavors). The controller didn't
mind the swap a bit (or the swap back to the 2 RAID drives). That's a lot
better than the old amr-based RAID cards.

   Terry Kennedy http://www.tmk.com
   te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: Read / write timeouts on SATA disks connected to ICH9

2010-05-14 Thread Terry Kennedy
On Fri May 14 22:42:38 UTC 2010, Jeremy Chadwick wrote:
> Finally, your vmstat -i output:
>
> > # vmstat -i
> > interrupt  total   rate
> > irq23: atapci0 371021299  10423
>
> Good to know there's no IRQ sharing going on, but what does worry me is
> the interrupt rate (10K interrupts/second).  That seems *extremely*
> high, but it also depends on what kind of disk I/O is happening on this
> system -- especially since you have 2 disks attached to the same
> controller.

I have a bunch of R300's here. From one that is using the on-board SATA
and 2 drives in a gmirror setup (very similar to the OP) after 18 hours
of uptime:

[0:2] speedtest:~> vmstat -i
interrupt  total   rate
irq23: atapci0254116  3

  I haven't specifically done any stress testing on this box, though I did
do a "make -j8 buildworld" during the initial gmirror synchronization. 8-}

  The drives are a pair of Dell-labeled 160GB "SAMSUNG HE161HJ 1AC01121"
that shipped with the box.

  I also have another R300 with Dell's "SAS 6/iR" card (a re-branded LSI
1068-something, seen as "mpt" by FreeBSD). While Dell only sells that as
part of a package deal with the hot-swap backplane and redundant power
supplies, there's no reason you couldn't pick one up on eBay and add it
yourself. You'll need some sort of breakout cable to get from the big
connector on the SAS 6 to individual SATA ports.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

2010-05-14 Thread Terry Kennedy
> Oops, youre right that other CPUs are running.
>
> The stop_cpus() call is only made if kdb is entered.  doadump() is called 
> out of boot() which comes later.  At Isilon weve been running with a patch 
> that does stop_cpus() pretty close to the front of panic(9).

  This is interesting, and changing the behavior will probably allow the
crash dump for the original problem (repeatable crash in the bce driver)
to be analyzed.

  At the moment, I'm more interested in dealing with the original problem
of the crash in bce. Right now, I'm running this vendor's product under
Linux compatibility mode. The vendor is hard at work building a native
FreeBSD version of their product. One of two things is going to happen
here: 1) the crash doesn't happen in native mode due to different code
paths being taken, and I lose the ability to reproduce the crash when the
box goes into production, or 2) the crash continues to happen and the ven-
dor gets the impression FreeBSD is unstable and not worth supporting. I'd
like to avoid that.

  So, any ideas on how to troubleshoot the panic in bce?

Thanks,
Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

2010-05-14 Thread Terry Kennedy
> > Hmm.  You could try changing the code to not do a nested panic in that
> > case.  You would update subr_turnstile.c to just return if panicstr is
> > not NULL rather than calling panic.  However, there is still a good
> > chance you will end up deadlocking in that case.  I have another patch I
> > can send you next week that prevents blocking on mutexes duing a panic
> > which may also help.
>
> It would be instructive to know exactly why we were in turnstile(9) but 
> its likely due to mtx contention.
>

> AIX has some code at the beginning of all the locking operations to avoid 
> taking locks if we were running code out of kdb, though getting that worked 
> out was slightly tricky with our variant of mtx_assert(9).  I seem to recall
> there was also some "lockbusting" code that forcibly reset all owned locks 
> to have no owner, at least in some paths.

> Given that the system is single-cpu and should be single-threaded when 
> dumping, this seems to me to be something worth working through to get 
> more reliable dumps.  Except for mtx_assert(9) I cant think of a reason 
> to take locks once we start dumping or when in the debugger.

  As an aside, this is a quad-core in one package CPU (an X3363). On both
this box and a similar one with an X5470, console messages continue to
print out after "the system has been halted - press any key to reboot" -
in particular, the shutdown makes a bunch of the "behind the scenes" man-
agement stuff like the virtual keyboard and monitor appear. Plugging or
unplugging USB devices will go through the whole deal of detecting and
making their service available.

  I know the other CPUs are considered to still be running (hence the
"halting other CPUs" when you press a key to reboot), but this is the
first time I've seen device detection, attachment, etc. show up on the
console after a shutdown.

  Is this behavior to be expected, or is it as unexpected as it was to
me? Systems are Dell Poweredge R300's, 8-STABLE amd64.

> As an aside, with terribly corrupted locks Ive seen double panics when the 
> attempt to print the lock name faulted in strlen(9) called for printf(9), 
> due to a bad lockname pointer.  We have been able to get enough info off 
> these crashes to debug them, but its useful to remember that the system 
> may be in a very unstable state depending on why it panics.

  True. In these crashes, the system is doing essentially nothing except
the one application (which, unfortunately, I don't have the source code
for). The second crash happened right after booting the system, logging in,
and firing off the application. It left an identical footprint (other than
the 0x10 byte offset due to a recompiled kernel) from the first one, where
the system had been up for 13+ hours.

  So, in this case I don't think there was a bunch of corruption piling up
which triggered the fault, but instead the one simple operation and right
away - splat!

  As I mentioned in the original posting, I'd be glad to give a developer
complete access to the system via the remote console (Dell DRAC 5 web
interface) and to the underlying FreeBSD if it'll help pin down the prob-
lem.

  Another thing I could try (would take a couple days until I could get
someone to the site) would be to try this using a bge port instead of
the bce one. That might help pin it down to either something in the bce-
specific code path, or somewhere else in the stack.

Thanks,
Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

2010-05-14 Thread Terry Kennedy

> The crash was a "page fault while in kernel mode" with the current process
> being the interrupt service routine for the bce0 GigE. Things progressed
> reasonably until partway through the dump, when the system locked up with a
> "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the
> same PID as reported in the main crash.

Hmm.  You could try changing the code to not do a nested panic in that
case.  You would update subr_turnstile.c to just return if panicstr is
not NULL rather than calling panic.  However, there is still a good
chance you will end up deadlocking in that case.  I have another patch I
can send you next week that prevents blocking on mutexes duing a panic
which may also help.


 Ok, I'll be glad to try that.


> 3) Is there any way to rig the system to obtain more info if this happens
> again? Right now I'm using an embedded remote console server, but I could
> switch the system to a serial port if enabling the kernel debugger might help.
> But I think that the sleeping thread bit would happen even at the debugger
> prompt, wouldn't it?

Include DDB and enable the 'trace_on_panic' sysctl knob perhaps.


 Hmmm. Do you think it will get very far before the sleeping thread business
locks it up?


> Is it possible to correlate the source line in the kernel with the instruction
> pointer in the panic?

If you are booted into the same kernel with the same modules loaded, you
can probably run 'kgdb' as root do 'l *'.


 I did that and discovered that the 0x20: prefix is probably unwanted:

(kgdb) l *0x20:0x801e3c06
A syntax error in expression, near `:0x801e3c06'.
(kgdb) l *0x801e3c06
0x801e3c06 is in bce_start_locked (/usr/src/sys/dev/bce/if_bce.c:6996).
6991}
6992
6993count++;
6994
6995/* Send a copy of the frame to any BPF listeners. */
6996ETHER_BPF_MTAP(ifp, m_head);
6997}
6998
6999/* Exit if no packets were dequeued. */
7000if (count == 0) {
(kgdb) 


 This kernel does have BPF compiled in, but I don't think it was in use at
the time. 


 Any further suggestions to look at (remember, this system is in another
state from me and all I have is remote access to the framebuffer - I'd have
to go there and set up a serial console to be able to talk to the debugger
if it crashes).

Thanks,
   Terry Kennedy http://www.tmk.com
   te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

2010-05-13 Thread Terry Kennedy
  I'm reposting this over here at the suggestion of the Forums moderator.
The original post is at http://forums.freebsd.org/showthread.php?t=14163

Got an interesting crash just now (well, as interesting as a crash on a 
soon-to-be production system can be).

This is 8-STABLE/amd64, last cvsup'd early in the morning of May 9th.

The system didn't complete the crash dump, so it needed a manual reset to get 
it going again.

The crash was a "page fault while in kernel mode" with the current process 
being the interrupt service routine for the bce0 GigE. Things progressed 
reasonably until partway through the dump, when the system locked up with a 
"Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the 
same PID as reported in the main crash.

Screen capture at http://www.tmk.com/transient/crash-20100513002317.png
Complete dmesg, etc. available on request.

As I mentioned above, the system needed a hard reset to get going again. 
savecore doesn't think there's a usable dump, so I don't think there's any
more info to gather.

I just cvsup'd the box and built a new kernel, in case the previous cvsup was 
in between related commits, or to see if anything changed since. I still have 
the old kernel around in case any useful info can be gathered from it.

So, a couple questions:

1) Anything known to be funky w/ bce?

2) Should the part of the system that caused the panic be able to lock up the 
crash dump process? Obviously, if the disk driver causes a panic, all bets are 
off when trying to use it to write the dump, but this crash seems to have been 
from a network driver. Shouldn't a double panic just give up on the dump and 
try a reboot?

3) Is there any way to rig the system to obtain more info if this happens 
again? Right now I'm using an embedded remote console server, but I could 
switch the system to a serial port if enabling the kernel debugger might help. 
But I think that the sleeping thread bit would happen even at the debugger 
prompt, wouldn't it? 

I just booted the new kernel and tried this again, and got another crash. The 
message is identical to the first, except that the instruction pointer changed 
by 0x10 (presumably due to code differences between the old and new kernels) 
and it got 6MB further writing the crash dump.

Since it seems I can reproduce this at will, I'll be glad to either perform 
additional information-gathering or give a developer access to the box for 
testing purposes.

Is it possible to correlate the source line in the kernel with the instruction 
pointer in the panic? 

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: amr driver broken since March 12

2009-03-28 Thread Terry Kennedy
Danny Braniss danny at cs.huji.ac.il writes:
> at least for me :-)
> [and sorry for the cross posting]
>
[...]
>
> amr0:  mem 
> 0xfbef-0xfbef,0xfe58-0xfe5f 
> irq 27 at device 0.0 on pci4
> amr0: [ITHREAD]
> amr0: delete logical drives supported by controller
> amr0:  Firmware 414I, BIOS A100, 
> 128MB RAM
> amr0: adapter is busy
> amr0: adapter is busy
> amr0: delete logical drives supported by controller
> (probe0:amr0:0:6:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 
> (probe0:amr0:0:6:0): CAM Status: SCSI Status Error
> (probe0:amr0:0:6:0): SCSI Status: Check Condition
> (probe0:amr0:0:6:0): ILLEGAL REQUEST asc:24,0
> (probe0:amr0:0:6:0): Invalid field in CDB
> (probe0:amr0:0:6:0): Unretryable error

  FWIW, I have a an amr device (Dell PERC 3/DC) which is working fine with
a -STABLE dated after March 12th:

FreeBSD 7.2-PRERELEASE #2: Thu Mar 26 09:41:58 EDT 2009
te...@test4.tmk.com:/usr/obj/usr/src/sys/PE1550
[snip]
amr0:  mem 0xf000-0xf7ff irq 25 at device 0.0 
on pci3
amr0: [ITHREAD]
amr0: delete logical drives supported by controller
amr0:  Firmware 199D, BIOS 3.35, 128MB RAM
amr0: delete logical drives supported by controller
amrd0:  on amr0
amrd0: 69360MB (142049280 sectors) RAID 5 (optimal)
ses0 at amr0 bus 0 target 6 lun 0
ses0:  Fixed Processor SCSI-2 device 
ses0: SAF-TE Compliant Device
Trying to mount root from ufs:/dev/amrd0s1a

  This is on a dual-processor Dell PowerEdge 1550.

  So this may only affect certain models or firmware revisions of amr
devices. Of course, since each LSI OEM uses their own firmware and
BIOS numbering scheme, it'll be hard to tell which one is newer than
the other.

  I have a bazillion of these cards if one would be helpful to a de-
veloper.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: rdump stuck in sbwait state (RELENG_7)

2009-01-05 Thread Terry Kennedy
rything has been ack'd by the other side.


 If you have kgdb handy, it would be useful to
look at *so and *so->so_domain in the soreceive_generic frame of proc 4439.
If it's an inet socket, we'd like to see *(struct inpcb *)so->so_pcb, and if
it's a TCP socket, *(struct tcpcb *)((struct inpcb *)so->so_pcb)->inp_ppcb.


 Sorry, you lost me here. Can you give me detailed instructions on how to
examine this data? I got as far as "proc 4439" in kgdb, but then got lost.

Thanks,
   Terry Kennedy http://www.tmk.com
   te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: rdump stuck in sbwait state (RELENG_7)

2009-01-05 Thread Terry Kennedy

I may have missed this earlier in the thread, but I don't see a kernel stack
trace of the stuck thread/process.  Could you grab one using procstat -k, DDB,
or KGDB?  I'd like to confirm that the 'sbwait' really reflects waiting to
send, rather than waiting to receive, which (for better or worse) uses the
same wmesg.  procstat -k may be the simplest of the above to do if your system
is reasonable recent.


 I didn't post that earlier as no-one had asked for it 8-)

 The system is current as of December 29th. Here's the relevant info:

(0:10) test4:/sysprog/terry# uname -a
FreeBSD test4.tmk.com 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Mon Dec 29 
11:48:04 EST 2008 te...@test4.tmk.com:/usr/obj/usr/src/sys/PE1550  i386
(0:11) test4:/sysprog/terry# ps -axwww | grep dump
 UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT   TIME COMMAND
   0  4436  4411   0   8  0 35896 34552 wait   I+p10:00.70 /sbin/rdump 
0uLa -b 64 -C 32 -f server /usr (rdump)
   0  4439  4436   0   4  0 35896 34784 sbwait I+p10:03.05 rdump: 
/dev/amrd0s1f: pass 4: 18.48% done, finished in 0:17 at Sat Jan  3 21:02:05 
2009 (rdump)
   0  4440  4439   0  20  0 35896 34624 pause  I+p10:05.26 /sbin/rdump 
0uLa -b 64 -C 32 -f server /usr (rdump)
   0  4441  4439   0  20  0 35896 34624 pause  I+p10:05.26 /sbin/rdump 
0uLa -b 64 -C 32 -f server /usr (rdump)
   0  4442  4439   0   4  0 35896 34624 sbwait I+p10:05.26 /sbin/rdump 
0uLa -b 64 -C 32 -f server /usr (rdump)
(0:12) test4:/sysprog/terry# procstat -k 4436
 PIDTID COMM TDNAME   KSTACK   
4436 100115 rdump-mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep kern_wait wait4 syscall Xint0x80_syscall 
(0:13) test4:/sysprog/terry# procstat -k 4439
 PIDTID COMM TDNAME   KSTACK   
4439 100127 rdump-mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep sbwait soreceive_generic soreceive soo_read dofileread kern_readv read syscall Xint0x80_syscall 
(0:14) test4:/sysprog/terry# procstat -k 4440
 PIDTID COMM TDNAME   KSTACK   
4440 100131 rdump-mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep kern_sigsuspend sigsuspend syscall Xint0x80_syscall 
(0:15) test4:/sysprog/terry# procstat -k 4441
 PIDTID COMM TDNAME   KSTACK   
4441 100105 rdump-mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep kern_sigsuspend sigsuspend syscall Xint0x80_syscall 
(0:16) test4:/sysprog/terry# procstat -k 4442
 PIDTID COMM TDNAME   KSTACK   
4442 100135 rdump-mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep sbwait soreceive_generic soreceive soo_read dofileread kern_readv read syscall Xint0x80_syscall 


 As I understand it, the processes in sbwait state are waiting to receive.
That would seem to indicate that they don't see the ACKs from the other
end, despite the tcpdump showing that they were received.

 Let me know if you need more information.

   Terry Kennedy http://www.tmk.com
   te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: rdump stuck in sbwait state (RELENG_7)

2009-01-03 Thread Terry Kennedy
> Sorry, I can't think of any - by the time you see it hung, whatever
> went wrong has already happened.  You might glean some insight from
> the TCP socket state (on the FreeBSD side, use 'netstat -A' to print
> the PCB address and gdb to dump the contents but I'm not sure how to
> get this data out of OpenVMS).  The '-C' and '-W' options to tcpdump
> will help.

  Ok, I found some time to reproduce this while capturing a trace with
tcpdump.

  Here's the relevant output from netstat / kgdb:

(0:31) test4:~terry# netstat -A
Active Internet connections
TcpcbProto Recv-Q Send-Q  Local Address  Foreign Address(state)
c73eeae0 tcp4   0  0 test4.892  server.shell   ESTABLISHED
[snip]

(0:32) test4:~terry# kgdb
GNU gdb 6.1.1 [FreeBSD]
[snip]
#0  0x in ?? ()
(kgdb) print * (struct tcpcb *) 0xc73eeae0
$1 = {t_segq = {lh_first = 0x0}, t_segqlen = 0, t_dupacks = 0, 
  t_timers = 0xc73eec24, t_inpcb = 0xc7387708, t_state = 4, t_flags = 484, 
  snd_una = 292841209, snd_max = 292841209, snd_nxt = 292841209, 
  snd_up = 292780017, snd_wl1 = 3606352422, snd_wl2 = 292841209, 
  iss = 3955646224, irs = 3606284909, rcv_nxt = 3606352422, 
  rcv_adv = 3606415910, rcv_wnd = 63488, rcv_up = 3606352422, snd_wnd = 65535, 
  snd_cwnd = 65535, snd_bwnd = 1073725440, snd_ssthresh = 1073725440, 
  snd_bandwidth = 0, snd_recover = 3955646224, t_maxopd = 1460, 
  t_rcvtime = 11273919, t_starttime = 11024967, t_rtttime = 0, 
  t_rtseq = 292839154, t_bw_rtttime = 11024966, t_bw_rtseq = 3955646224, 
  t_rxtcur = 230, t_maxseg = 1448, t_srtt = 145, t_rttvar = 34, 
  t_rxtshift = 0, t_rttmin = 30, t_rttbest = 67, t_rttupdated = 232101, 
  max_sndwnd = 65535, t_softerror = 0, t_oobflags = 0 '\0', t_iobc = 0 '\0', 
  snd_scale = 0 '\0', rcv_scale = 3 '\003', request_r_scale = 3 '\003', 
  ts_recent = 1207233, ts_recent_age = 11273919, ts_offset = 0, 
  last_ack_sent = 3606352422, snd_cwnd_prev = 0, snd_ssthresh_prev = 0, 
  snd_recover_prev = 0, t_badrxtwin = 0, snd_limited = 0 '\0', 
  snd_numholes = 0, snd_holes = {tqh_first = 0x0, tqh_last = 0xc73eebb8}, 
  snd_fack = 0, rcv_numsacks = 0, sackblks = {{start = 0, end = 0}, {
  start = 0, end = 0}, {start = 0, end = 0}, {start = 0, end = 0}, {
  start = 0, end = 0}, {start = 0, end = 0}}, sack_newdata = 0, 
  sackhint = {nexthole = 0x0, sack_bytes_rexmit = 0}, t_rttlow = 1, 
  rfbuf_ts = 0, rfbuf_cnt = 0, t_pspare = {0x0, 0x0, 0x0}, t_tu = 0x0, 
  t_toe = 0x0}
(kgdb) q

  Rather than pasting the decoded tcpdump output here, the raw capture
file is at http://www.tmk.com/transient/rdump30.gz (it is only 76KB
compressed, 270KB uncompressed). It looks to me like the remote host
(the VMS box) has correctly ack'd all outstanding data from the FreeBSD
host, but that the FreeBSD host is just sitting there for some reason.

  As before, I have this sitting in the wedged state so if anyone needs
more data, I can either collect it or give you access to the system.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: rdump stuck in sbwait state (RELENG_7)

2008-12-30 Thread Terry Kennedy
> I'm pretty sure it's caused by FreeBSD.  It can very well be related to
> PR 117603, a real nasty dump(8) bug that was introduced in 7.0 on SMP
> systems.  But it should have been patched back in March by this:
[...]
> So I'm real surprised it shows up again. We got a pretty large backup
> environment with dump(8) being a critical element of it.  I just hope
> the problem will be resolved before 7.1-RELEASE hit the streets.
>
> Terry, please file a bug report on this and get in touch with iedowse@
> who was implementing the aforementioned patch.

  I don't think my hang is related to that problem - mine seems to be in
the TCP code while that problem seems to be in the kernel / filesystem
code (or at least that's what I recall of it from prior discussions).

  Plus, my problem just showed up in a recent build. The last time subr_
sleepqueue was touched seems to have been back in September.

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: rdump stuck in sbwait state (RELENG_7)

2008-12-30 Thread Terry Kennedy
> Unfortunately, you need the last packets that were exchanged in order
> to identify which end has the problem (and hopefully provide some
> pointers as to why).  If possible, can you repeat the dump whilst you
> run a tcpdump on the rdump flow and then post the last dozen or so
> packets in each direction.

  That could be pretty unpleasant - this happens at a random point while
dumping 4GB or so. If I have to, I'll do it but I was hoping there was
a better way.

  Shouldn't this get torn down by a keepalive at some point? It has been
sitting for 9 hours or so at this point...

Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


rdump stuck in sbwait state (RELENG_7)

2008-12-29 Thread Terry Kennedy
  I upgraded a box (Dell Poweredge 1550, dual PIII processors) from a kernel +
world of December 8th to one from today (December 29th) and I am experiencing
a new problem with rdump.

  The symptom is that rdump stops sending data to the remote system. It is
responsive to ^T and can be aborted with ^C. Here's the ^T status on the
sending box (the aforementioned Dell RELENG_7 system):

  DUMP: dumping (Pass IV) [regular files]
  DUMP: 20.49% done, finished in 0:19 at Mon Dec 29 19:58:57 2008
  DUMP: 38.00% done, finished in 0:16 at Mon Dec 29 20:00:52 2008
  DUMP: 55.45% done, finished in 0:12 at Mon Dec 29 20:01:37 2008
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1494 [pause] 2.30u 11.22s 0% 34616k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1493 [sbwait] 2.32u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.00  cmd: rdump 1492 [sbwait] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k
load: 0.02  cmd: rdump 1495 [pause] 2.37u 11.25s 0% 34616k
load: 0.02  cmd: rdump 1492 [running] 2.46u 4.89s 0% 34800k

  A tcpdump on both the sending and receiving systems shows no packets
between them from the rdump processes. However, I can rshell both ways
and get the expected output, so the link isn't down.

  ps shows the same thing as ^T. The sbwait process looks like this:

0  1492  1489   0   4  0 36024 34808 sbwait I+p00:07.35 rdump: 
/dev/amrd0s1f: pass 4: 69.66% done, finished in 0:08 at Mon Dec 29 20:01:53 
2008 (rdump)

  and the status never changes.

  The remote (receiving) system is a HP DS10 running OpenVMS 8.3 with
MultiNet 5.1A as the TCP stack. Despite this being a rather rare envir-
onment, I haven't had any problems until this most recent kernel build.
I have a large number (over a dozen) other systems running a variety
of releases (6.4, 7.0, 7.1-PRERELEASE) which can do this same dump oper-
ation without difficulty.

  I have the offending dump process still in this stuck state, so I can
generate whatever sort of debugging information is needed. The box is a
test box, so I can crash it and get a core dump if that's what is needed.

    Terry Kennedy http://www.tmk.com
te...@tmk.com New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Correct way to perform minidumps on gmirror device?

2007-01-31 Thread Terry Kennedy
  I'm trying to find out the best way to set up minidumps on my gmirror
drives. I have some questions about what is considered the correct way
of doing this. In particular:

  1) The release notes say to "sysctl debug.minidump=1", but isn't doing
 this in /etc/sysctl.conf too late in the process?

  2) The examples I've found (http://ezine.daemonnews.org/200608/gmirror_1.html)
 say to add gmirror configure commands to /etc/rc.early and /etc/rc.local,
 but the manpage says these are deprecated?

  Where should these commands be placed?

Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Dell PowerEdge 750 & 850 environtmental monitoring

2006-07-06 Thread Terry Kennedy
> Does anybody have temperature and fan monitoring working on Dell 
> PowerEdge 750's & 850's?  I have done my share of googling without
> much luck.

  With the DRAC III/XT card installed, I am monitoring PE750's using
the IPMI device ("device ipmi" in the kernel config of 6.1-STABLE)
and the ipmitool port. You can view the results at:

  http://www.tmk.com/cgi-bin/ipmi.cgi

Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Strange TCP-related hang on startup w/ recent CVSUP

2005-06-09 Thread Terry Kennedy
  Just to close this out, this was caused by the uipc_socket.c change that
also broke NFS. All better now...

Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Strange TCP-related hang on startup w/ recent CVSUP

2005-06-08 Thread Terry Kennedy
Kris Kennaway (kris at obsecurity.org) writes:

> Are you sure you didn't change your kernel config or forget to rebuild
> modules?

  No changes to the kernel config, and I did the usual sequence of make
buildworld ; make buildkernel; make installkernel; reboot to single-user
mode; make installworld; mergemaster.

  Any ideas?

    Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Strange TCP-related hang on startup w/ recent CVSUP

2005-06-07 Thread Terry Kennedy
0s 0% 844k
load: 0.75  cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k
load: 0.69  cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k
load: 0.64  cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k
load: 0.64  cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k
load: 0.02  cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k
^CScript /etc/rc.d/ntpdate interrupted
Starting rpcbind.
NFS access cache time=2
ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/X11R6/lib /usr/local/lib
a.out ldconfig path: /usr/lib/aout /usr/lib/compat/aout /usr/X11R6/lib/aout
Starting mountd.
Starting nfsd.
Starting statd.
Starting lockd.
Jun  8 00:58:41 rz2 rpc.lockd: 100024 RPC: Port mapper failure
Starting usbd.
Starting local daemons:.
Updating motd.
Starting ntpd.
Starting rwhod.
Configuring syscons: blanktime screensaver.

  Here are the list of changed kernel files since the previous (working) 
kernel:

find . -newer /tmp/date
./alpha/alpha/busdma_machdep.c
./amd64/amd64/identcpu.c
./amd64/amd64/machdep.c
./conf/NOTES
./conf/files
./conf/options
./dev/ata/ata-chipset.c
./dev/ata/ata-pci.h
./dev/bktr/bktr_reg.h
./dev/sound/pci/ich.c
./kern/uipc_socket.c
./kern/kern_event.c
./kern/sysv_shm.c
./kern/subr_unit.c
./modules/netgraph/device
./modules/netgraph/device/Makefile
./modules/netgraph/Makefile.inc
./modules/udbp/Makefile
./netgraph/bluetooth/drivers/ubt
./netgraph/bluetooth/drivers/ubt/ng_ubt.c
./netgraph/ng_eiface.c
./netgraph/netgraph.h
./netgraph/ng_base.c
./netgraph/ng_iface.c
./netgraph/ng_device.c
./netgraph/ng_ksocket.c
./netinet/tcp_subr.c
./netinet/ip_divert.c
./netinet/ip_icmp.c
./netinet/ip_icmp.h
./netinet6/ipsec.c
./sys/systm.h

  Any ideas? Also, if there is additional information needed, let me know
what to do and I'll report back.

Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Apparent interrupt routing problem in 5.4-PRERELEASE

2005-03-07 Thread Terry Kennedy
  I've communicated with a few people about this off-list, and it was sug-
gested I give the issue some wider exposure on this list in the hope of
having it addressed for 5.4-RELEASE. It may or not be related to the other
interrupt storm problems some people are seeing.

  I have a number of systems running the latest 5-STABLE (as of 4 PM today
or so). I've been seeing this issue for quite some time, though (5.3-RELEASE
at least, though I don't remember it happening in 5.2.1-RELEASE).

  The first symptom is that at boot time, I see these messages:

Interrupt storm detected on "irq16: uhci0"; throttling interrupt source
Interrupt storm detected on "irq17: ichsmb0"; throttling interrupt source

  Next, during the whole time the system is up, a "systat -v" shows that my
uhci0 and ichsmb0 devices have active interrupt counts (despite no activity
on them) which happen to *exactly* correspond with "real" interrupt activity
on other devices.

  The motherboards are Tyan S2721-533's. The rest should be apparent from the
dmesg output.

  I'm attaching two consecutive screen captures of the systat -v output as
well as the dmesg output.

  This happens on both SMP and UP (with a UP kernel) configs, and also with
or without ACPI enabled (by option at boot time).

  Let me know if anyone needs further information to help diagnose this. I
can also provide remote access to a test system if a developer needs it.

Terry Kennedy http://www.tmk.com
[EMAIL PROTECTED] New York, NY USA

3 usersLoad  0.00  0.00  0.00  Mar  7 21:57

Mem:KBREALVIRTUAL VN PAGER  SWAP PAGER
Tot   Share  TotShareFree in  out in  out
Act   19276336862804 4176 2873112 count 1
All 11083206276  3306404 8148 pages 1
  zfod   Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow7244 total
 35 159231 663810513  230  213256 wire1: atkb
12032 act 6: fdc0
 8.1%Sys   3.7%Intr  0.2%User  0.0%Nice 87.9%Idl   884388 inact   128 8: rtc
||||||||||cache   12: psm
++2873112 free13: npx
  daefr   15: ata
Namei Name-cacheDir-cache prcfr  3175 16: uhc
Calls hits% hits% react   332 17: ich
  pdwak   24: twa
  pdpgs  3175 48: em0
Disks   da0   da1   sa0 pass0 pass1 pass2 intrn   332 72: twa
KB/t   9.67   127  0.00  0.00  0.00  0.00  114464 buf 98: ahd
tps   2   167 0 0 0 0   9 dirty   99: ahd
MB/s   0.02 20.62  0.00  0.00  0.00  0.00  10 desir   100e0: clk
% busy071 0 0 0 0   64552 numvnodes
14155 freevnodes

3 usersLoad  0.07  0.02  0.00  Mar  7 21:57

Mem:KBREALVIRTUAL VN PAGER  SWAP PAGER
Tot   Share  TotShareFree in  out in  out
Act   19276336862804 4176 2552792 count
All 14286406276  3821844 8148 pages
  zfod   Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow7610 total
 35 166862 696811000  237  213324 wire1: atkb
12040 act 6: fdc0
 9.2%Sys   3.7%Intr  0.4%User  0.0%Nice 86.8%Idl  1204632 inact   128 8: rtc
||||||||||cache   12: psm
=+>   2552792 free13: npx
  daefr   15: ata
Namei Name-cacheDir-cache prcfr  3344 16: uhc
Calls hits% hits% react   347 17: ich
  pdwak   24: twa
  pdpgs  3344 48: em0
Disks   da0   da1   sa0 pass0 pass1 pass2 intrn   347 72: twa
KB/t   0.00   127  0.00  0.00  0.00  0.00  114464 buf 98: ahd
tps   0   174 0 0 0 0   9 dirty   99: ahd
MB/s   0.00 21.69  0.00  0.00  0.00  0.00  10 desir   100e0: clk
% busy069 0 0 0 0   64552 numvnodes