Bug#1053929: linux-image-amd64: kernel fails to find all nvme SSDs

2023-10-24 Thread Jeffrey Mark Siskind
Supermicro provided a workaround: boot with the kernel command line
parameter pci=realloc=off.

As an side, Rocky 9.2 does not have this issue even though it boots
without that kernel command line parameter.

Jeff (http: //engineering.purdue.edu/~qobi)



bugs 1053927 and 1053929

2023-10-19 Thread Jeffrey Mark Siskind
I filed bugs 1053927 and 1053929 with reportbug against
linux-image-amd64. Reportbug suggested that I file against a specific
kernel version package, not a metapackge. But I filed against the
metapackage because both bug affect multiple kernel versions.

Should I refile? Or should they be moved?

Both have severity important. But 1053929 is far more critical. It
prevents me from accessing my NVMe drives. This appears to be a
pervasive bug because I also booted Ubuntu Live 22.04 which has a 5
series kernel and it also exhibits the bug.

Jeff (http: //engineering.purdue.edu/~qobi)



Bug#1053929: linux-image-amd64: kernel fails to find all nvme SSDs

2023-10-14 Thread Jeffrey Mark Siskind
Package: linux-image-amd64
Version: 6.4.4-3~bpo12+1
Severity: important

Dear Maintainer,

   * What led up to the situation?

I purchased a new server: Supermicro AS-8125GS-TNHR. It has 17 NVME
drives installed:

1x Micron 7450
   12x Micron 9300
4x Micron 9400

Upon boot, /dev/nvme* only shows 10 drives: the Micron 7450, 8 of the
Micron 9300s, and 1 of the Micron 9400s. Before I plugged in the 12x
Micron 9300, /dev/nvme* only showed 2 drives: the Micron 7450 and 1 of
the Micron 9400s.

I run bookworm stable. Upon first install, it ran kernel
6.1.0-12. After an apt upgrade it ran 6.1.0-13. I also installed
6.4.0-0-deb12.2 from bookworm backports. All 3 exhibit the same
issue. the only difference is that under 6.1.0-13 the 10 drives that do
appear appear as /de/nvme{0,1,2,3,4,5,6,7,8,9} while under 6.4.0-0-deb12.2
the 10 drives that do appear appear as different numbers with some missing.

The 1x 7450 has 3 partitions: EFI, / formatted as btrfs, and swap.
The 12x 9300s are all formatted with 1 partition. There are 6 pairs of
2, Each pair has a btrfs raid1 file system. The 9400s are not yet formatted.

   * What exactly did you do (or not do) that was effective (or
 ineffective)?

I tried 3 kernels: 6.1.0-12, 6.1.0-13, and 6.4.0-0-deb12.2.
I tried with and without the 12x 9300.
I enclose the output of ls /dev/nvme*, lspci -k, and hwinfo --disks below.

   * What was the outcome of this action?

All exhibit the same issue.

   * What outcome did you expect instead?

I had hoped that I would be able to access all 17 drives (and format
the 4x 9400s as a single btrfs raid1 filesystem).

-- System Information:
Debian Release: 12.2
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.4.0-0.deb12.2-amd64 (SMP w/383 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages linux-image-amd64 depends on:
ii  linux-image-6.4.0-0.deb12.2-amd64  6.4.4-3~bpo12+1

linux-image-amd64 recommends no packages.

linux-image-amd64 suggests no packages.

-- no debconf information

Thanks,
Jeff (http: //engineering.purdue.edu/~qobi)

qobi@poto>ls /dev/nvme*
/dev/nvme1   /dev/nvme12n1/dev/nvme15n1/dev/nvme5n1
/dev/nvme10  /dev/nvme12n1p1  /dev/nvme15n1p1  /dev/nvme5n1p1
/dev/nvme10n1/dev/nvme14  /dev/nvme1n1 /dev/nvme6
/dev/nvme10n1p1  /dev/nvme14n1/dev/nvme1n1p1   /dev/nvme6n1
/dev/nvme11  /dev/nvme14n1p1  /dev/nvme3   /dev/nvme8
/dev/nvme11n1/dev/nvme14n1p2  /dev/nvme3n1 /dev/nvme8n1
/dev/nvme11n1p1  /dev/nvme14n1p3  /dev/nvme3n1p1   /dev/nvme8n1p1
/dev/nvme12  /dev/nvme15  /dev/nvme5
qobi@poto>
-
qobi@poto>lspci -k
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 14a4 (rev 01)
   Subsystem: Super Micro Computer Inc Device 1d1c
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 149e (rev 01)
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 149e
00:00.3 Generic system peripheral [0807]: Advanced Micro Devices, Inc. [AMD] 
Device 14a6
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 14a6
   Kernel driver in use: pcieport
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01)
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1234
   Kernel driver in use: pcieport
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:05.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14aa (rev 01)
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1234
   Kernel driver in use: pcieport
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14a7 (rev 01)
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 14a4
   Kernel driver in use: pcieport
00:07.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14a7 (rev 01)
   Subsystem: Advanced Micro Devices, Inc. [AMD] Device 14a4
   Kernel driver in use: pcieport
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
   Subsystem: Super Micro Computer Inc FCH SMBus Controller
Kernel driver in use: piix4_smbus
   Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA 

Bug#1038105: upgrade-reports: resume from suspend/hibernate broken by upgrade from bullseye to bookworm

2023-06-16 Thread Jeffrey Mark Siskind
I have more information.

As per the suggestion of mooff , I noticed

qobi@sapiencia>ls -l /etc/modprobe.d/nvidia-options.conf
lrwxrwxrwx 1 root root 45 Sep  7  2022 /etc/modprobe.d/nvidia-options.conf -> 
/etc/alternatives/nvidia--nvidia-options.conf
qobi@sapiencia>ls -l /etc/alternatives/nvidia--nvidia-options.conf
lrwxrwxrwx 1 root root 49 Sep  7  2022 
/etc/alternatives/nvidia--nvidia-options.conf -> 
/etc/nvidia/nvidia-525.105.17/nvidia-options.conf

So I moved /etc/nvidia/nvidia-525.105.17/nvidia-options.conf to
/etc/nvidia/nvidia-525.105.17/nvidia-options.conf.orig and edited 
/etc/nvidia/nvidia-525.105.17/nvidia-options.conf to add

options nvidia-current NVreg_PreserveVideoMemoryAllocations=1

This prevented the machine from properly suspending/hibernating in the first
place. I know this because the Lenovo P71 has a red LED on the lid that is on
when the machine is running and pulsates when it is in
suspend/hibernation. Before I made the change it would pulsate when the lid
was closed. But after I made the change, th LED would stay on. But when I
would open the lid the screen would show the pasword request, and the mouse
would work, but the keyboard would no longer work. I would need to power cycle.

Searching the web, I found

https://unix.stackexchange.com/questions/743506/pop-os-22-04-with-nvidia-driver-525-fails-to-suspend-on-a-hybrid-laptop

So I tried changing /etc/nvidia/nvidia-525.105.17/nvidia-options.conf to
instead add

options nvidia-current NVreg_PreserveVideoMemoryAllocations=1 
NVreg_TemporaryFilePath=/var/tmp

I also did

sudo update-initramfs -u

But I observed the same behaviour where the LED would stay on when closing the
lid and the keyboard wouldn't work when opening the lid.

So I backed out the change and am back in a state where it appears to
suspend/hibernate (because the LED pulsates) but 30% of the time it properly
resumes and 70% of the time it does not (the disk light flashes a few times,
the screen is blank, and no response when using the keyboard or mouse.

I dont know if I should try putting it in

/etc/modprobe.d/nvidia-power-management.conf

instead. I haven't tried that yet.

Jeff (http: //engineering.purdue.edu/~qobi)



R815 machine checks under jessie

2017-07-27 Thread Jeffrey Mark Siskind
Jerry - I had some communication with you about a year ago.

I have four R815s. They have been running Debian since purchase about 6 years
ago. I upgraded from wheezy to jessie about a year ago. For the past year,
all four exhibit sporadic machine checks that cause them to crash. I never
observed this before the upgrade. I upgraded a dozen T5500s and four C6145s
from wheezy to jessie at the same time and none of them exhibit this problem.
This appears to be specific to the R815s and jessie.

The R815s run jessie fine. Just that they machine check about a week or a few
weeks after reboot. I read on the net that this may be related to fan speed or
temperature. But I can't find that web page now.

Suggestions on how I might fix this?

Thanks,
Jeff (http://engineering.purdue.edu/~qobi)

Relevant output from ipmitool

root@upplysingaoflun:~# ipmitool sel elist
   1 | 07/12/2017 | 18:29:12 | Event Logging Disabled SEL | Log area 
reset/cleared | Asserted
   2 | 07/17/2017 | 15:03:13 | Power Supply Status | Failure detected () | 
Asserted
   3 | 07/17/2017 | 15:03:14 | Power Supply PS Redundancy | Redundancy Lost | 
Asserted
   4 | 07/17/2017 | 15:03:15 | Power Supply Status | Failure detected () | 
Deasserted
   5 | 07/17/2017 | 15:03:19 | Power Supply PS Redundancy | Fully Redundant | 
Asserted
   6 | 07/18/2017 | 09:18:03 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
   7 | 07/18/2017 | 09:18:04 | Unknown #0x28 |  | Asserted
   8 | 07/18/2017 | 09:18:04 | Unknown #0x28 |  | Asserted
   9 | 07/18/2017 | 09:18:04 | Unknown #0x28 |  | Asserted
   a | 07/18/2017 | 09:18:04 | Unknown #0x28 |  | Asserted
   b | 07/18/2017 | 09:18:04 | Unknown #0x28 |  | Asserted
   c | 07/18/2017 | 09:18:05 | Unknown #0x28 |  | Asserted
   d | 07/18/2017 | 09:18:05 | Unknown #0x28 |  | Asserted
   e | 07/18/2017 | 09:18:05 | Unknown #0x28 |  | Asserted
   f | 07/22/2017 | 10:50:29 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
  10 | 07/22/2017 | 10:50:29 | Unknown #0x28 |  | Asserted
  11 | 07/22/2017 | 10:50:29 | Unknown #0x28 |  | Asserted
  12 | 07/22/2017 | 10:50:30 | Unknown #0x28 |  | Asserted
  13 | 07/22/2017 | 10:50:30 | Unknown #0x28 |  | Asserted
  14 | 07/22/2017 | 10:50:30 | Unknown #0x28 |  | Asserted
  15 | 07/22/2017 | 10:50:30 | Unknown #0x28 |  | Asserted
  16 | 07/22/2017 | 10:50:31 | Unknown #0x28 |  | Asserted
  17 | 07/22/2017 | 10:50:31 | Unknown #0x28 |  | Asserted
  18 | 07/24/2017 | 20:26:16 | Power Supply Status | Failure detected () | 
Asserted
  19 | 07/24/2017 | 20:26:17 | Power Supply Status | Failure detected () | 
Deasserted
  1a | 07/24/2017 | 20:26:22 | Power Supply PS Redundancy | Fully Redundant | 
Asserted
root@upplysingaoflun:~#



Re: machine checks on Dell R815 under jessie

2016-08-10 Thread Jeffrey Mark Siskind
   From: Ritesh Raj Sarraf 

   I (still) have MCE errors on my new laptop [1]. But so far, hasn't created
   any problem.

It causes my servers to halt.

Jeff (http://engineering.purdue.edu/~qobi)



machine checks on Dell R815 under jessie

2016-08-09 Thread Jeffrey Mark Siskind
I upgraded four Dell R815s from wheezy to jessie a few weeks ago. Prior to the
upgrade, they were running reliably for about 5 years. Since the upgrade, two
machines have been getting periodic machine checks. The machines boot fine and
run for a day or more. The machine checks appear to happen sporadically. I
can't determine a correlation with anything in particular.

The front panel on the first machine says the machine check was on CPU #4. The
front panel on the second machine said the first machine check was on CPU #1
and the second machine check was on CPU #2.

I am suspicious that this is really a hardware problem. Three CPUs begin
exhibiting machine checks within a few weeks of each other, all immediately
after upgrading wheezy to jessie, after working reliably for five years.

Has anybody else encountered this issue? Any suggestions on how to debug and
fix?

Thanks,
Jeff (http://engineering.purdue.edu/~qobi)
---
root@arivu:~# ipmitool sel elist
   1 | 08/05/2016 | 00:12:47 | Event Logging Disabled SEL | Log area 
reset/cleared | Asserted
   2 | 08/06/2016 | 11:35:17 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
   3 | 08/06/2016 | 11:35:17 | Unknown #0x28 |  | Asserted
   4 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   5 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   6 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   7 | 08/06/2016 | 11:35:18 | Unknown #0x28 |  | Asserted
   8 | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
   9 | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
   a | 08/06/2016 | 11:35:19 | Unknown #0x28 |  | Asserted
root@arivu:~# 

root@perisikan:~# ipmitool sel elist
[...]
  1c | 08/08/2016 | 12:23:02 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
  1d | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  1e | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  1f | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  20 | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  21 | 08/08/2016 | 12:23:03 | Unknown #0x28 |  | Asserted
  22 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  23 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  24 | 08/08/2016 | 12:23:04 | Unknown #0x28 |  | Asserted
  25 | 08/09/2016 | 18:37:46 | Processor CPU Machine Chk | Transition to 
Non-recoverable | Asserted
  26 | 08/09/2016 | 18:37:46 | Unknown #0x28 |  | Asserted
  27 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  28 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  29 | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  2a | 08/09/2016 | 18:37:47 | Unknown #0x28 |  | Asserted
  2b | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
  2c | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
  2d | 08/09/2016 | 18:37:48 | Unknown #0x28 |  | Asserted
root@perisikan:~#



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-27 Thread Jeffrey Mark Siskind
   I had big issues with mptsas and 3.16 in jessie, so I am still using
   3.2.0-4-rt-amd64

Will jessie run with 3.2.0-4-rt-amd64? If so, where do I get it and how do I
install it on a fresh jessie install that wasn't dist-upgraded from wheezy?

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-27 Thread Jeffrey Mark Siskind
   The non-determinism in which identifiers are shown might be a bug in the
   installer, or it might be caused by failure of ID commands to the
   drives.

   I think most of the problems you're still having must be caused by a
   bug in the RAID driver, mpt2sas (or its firmware, if that's not
   embedded in the BIOS).

Thanks. Please let me know how I can report the potential bug(s) and what I
can do to help track them down.

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-27 Thread Jeffrey Mark Siskind
I'd like to thank everyone for helping out.

Here is an update on installing jessie on R815s.

I succeeded in installing on three of my four R815s. But I am holding off on
the last because it is my file server and there are still issues. Please read
on. I don't believe that the problem is solved and there may be a bug lurking
that can lead to data loss.

Here is what I did.

 1. Before the install, while still running wheezy, I upgraded the BIOS.
  R815_BIOS_JF8YH_LN_3.2.2.BIN
This seemed to alleviate the problem of the jessie installer failing to
find the ISO. More on this later.

 2. Before the install, while still running wheezy, I reduced the number of
components of md0 from 6 to 4. This was in response to Steve' suggestion.
  mdadm /dev/md0 --fail /dev/sdf1
  mdadm /dev/md0 --fail /dev/sde1
  mdadm /dev/md0 --remove /dev/sdf1
  mdadm /dev/md0 --remove /dev/sde1

 3. I did a fresh USB install of jessie. More on this later.

 4. When it asked about which devices to install grub, I answered "manual" and
then typed /dev/sdb. More on this later.

 5. After the fresh install, I rebooted, and in grub, I added rootdelay=20.
This was in response to Don's suggestion.

 6. After the reboot, I ran my standard post-install script. Among other
things, this installs numerous packages, makes a small number of mods to
/etc, and does a dpkg-reconfigure grub-pc. When it did that, I specified
only the 4 drives with active components of md0 and added rootdelay=20.

 7. I rebooted. More on this later.

Now for the issues.

 A. Even after the BIOS upgrade, when it no longer fails to find the ISO,
during the installer phase where it searches for an ISO, I notice
nondetermininstic behavior. Sometimes it searchs sdb{1,2,3}, sdc{1,2,3},
sdd{1,2,3}, sde{1,2,3}, sdf{1,2,3}, sdg{1,2,3}, sd{a,b,c,d,e,f,g} and
eventually finds an ISO (sda is the USB dongle). Sometimes it finds the
ISO right away without any searching. This doesn't cause problems but I
believe that it is symptomatic of other problems.

 B. I'm not sure that reducing the number of components of md0 to 4 and/or
adding rootdelay=20 really solved the problem. I think it just reduced the
likelihood of occurrence. On one of the machines (arivu), during the
reboot in step (7), at an early phase of the boot, the machine first
reported that it found all 4 components of md0 and all 6 components of md1.
Then at  a later phase it reported that there were errors on 3 of the 4
components. After the machine came up, md0 had only one component. Three
of the four components were in failed (F) state. I did mdadm --remove to
them and then mdadm --add to them. This doesn't happen all of the time. But
it happens some of the time.


  qobi@upplysingaoflun>all-n-3g dmesg --level=err
  upplysingaoflun:
  verstand:
  arivu:
  [   28.012558] mpt2sas0: fault_state(0x265d)!
  [   29.231355] end_request: I/O error, dev sdb, sector 2056
  [   29.231600] end_request: I/O error, dev sdc, sector 2056
  [   29.231773] end_request: I/O error, dev sde, sector 2056
  [   29.232020] end_request: I/O error, dev sda, sector 2056
  perisikan:
  [   13.035132] mpt2sas0: fault_state(0x265d)!
  [   28.600099] mpt2sas0: fault_state(0x265d)!
  qobi@upplysingaoflun>

  qobi@upplysingaoflun>all-n-3g "dmesg --level=warn|fgrep -i error|fgrep -v 
ACPI"
  upplysingaoflun:
  verstand:
  arivu:
  [   29.231430] md: super_written gets error=-5, uptodate=0
  [   29.231670] md: super_written gets error=-5, uptodate=0
  [   29.231869] md: super_written gets error=-5, uptodate=0
  [   29.232117] md: super_written gets error=-5, uptodate=0
  perisikan:
  qobi@upplysingaoflun>

(These are my four R815s. upplysingaflun is the file server that has not
been updated. The other three have.) Note that one machine reports no
"mpt2sas0: fault_state(0x265d)" errors, one machine reports one, and one
machine reports two. Note that the machine that dropped three components
of md0 during boot reported I/O errors on all 4 disks with the 4
components of md0. I don't believe that there really are faulty disks.
Whenever I observe any of the behavior reported in this email, it is
almost always associated with dmesg reporting the same error on the same
sector 2056 (sometimes 2058 or 2062). Given the dozens of attempted
reinstalls and reboots, at this point, I have seen this on almost all, if
not all, of the six disks on each of the four machines. I don't believe
that 24 disks all have the same bad sectors.

 C. In step (3), sometimes, but not always, during the install, I get a screen
that says that some partition failed. If offers a menu of two options. I
select "retry". Sometimes, but not always, this causes md0 to drop
components in the installer, which I fix by going to ctrl-alt-f2 

Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-24 Thread Jeffrey Mark Siskind
Please note that bootint with rootdelay=20 does not solve the problem. It only
masks it.

 1. If I attempt a fresh USB install of jessie, when md0 is correctly built
before the install, the process of doing the fresh install breaks
md0. When it gets to grub install, components of md0 are missing (even
though all six components were present before the install). And
grub-install fails. At this point it is impossible to complete the install
and produce a bootable system.

 2. If I do a fresh minimal USB install of wheezy, rebuilding md0 in the
process, and then do a dist-upgrade to jessie, I can manually add
rootdelay=20 in grub and boot into jessie with all six components of md0
present. But if I do so, then after boot, if I do dpkg-reconfigure pc-grub,
doing that gives errors, drops components of md0, precludes me from adding
them back, fails to install grub, and leaves the machine in an unbootable
state.

I fear that there is a problem writing to disk. Even if I boot with
rootdelay=20, unless the kind of writes that dpkg-reconfigure pc-grub does are
different, doing ordinary writes to disk may also corrupt the disk.

Please let me know what new information you would like me to gather.

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-22 Thread Jeffrey Mark Siskind
   >Are you certain that there isn't a PERC H700 in this machine? [Sort of
   >odd that mpt2sas is triggering a state error in your screenshot if there
   >actually isn't one.]
   > 
   > There could be one. But I probably don't use it. I use software RAID. Dell
   > wouldn't sell an R815 without an OS. I think I purchased it with RHEL which
   > may have needed the PERC H700. But I never even booted RHEL. The first
   > thing I did was a fresh install of squeeze, or maybe wheezy.

   We definitely sell PowerEdge systems without an OS and have for quite a
   while. However, we do limit configuration for higher end systems to include
   hardware RAID.

My appologies. I may misremember. I purchased the machines (twelve T5500s,
four R815s, and four C6145s) about 5 years ago and don't remember precisely
the arrangements. I'd have to check archived email to know for sure.

The machines were purchased through ECN (Purdue's Engineering IT services). I'm
a lowly professor. But I software-maintain my own machines. I definitely
didn't spec out a hardware RAID controller. The mechanisms by which one was
included are unclear at this point.

   There's definitely a PERC controller in there based on 

   "05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 
PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)"

   I'm not seeing the subvendor/subsystem ID's there but it's presumably the
   PERC 6/i. If you're really not using it at all, you might be able to pull
   it out if the driver for it is causing problems. However, I suspect you
   need it to connect to the drive backplane. Stuart (CCed) may be able to
   offer some more insight into driver issues you might see.

   The SATA controller should only really be in use by the optical drive if
   present. Some of the mid-tier systems of that generation support SATA
   drives connected directly to a controller on the motherboard, but support
   for that under Linux was spotty from my recollection.

My T5500s have optical drives. But neither my R815s nor my C6145s have optical
drives. All my machines have SATA drives. The R815s in question each have
six ST9500530NS drives. They have been running squeeze and then wheezy with
software RAID for 5 years since purchase.

Now that I have someone from Dell on the line who appears to be
Debian-friendly, it would be nice if you made firmware upgrades
Debian-friendly. I have been able to apply

  R815_BIOS_JF8YH_LN_3.2.2.BIN

but have not been able to apply

  ESM_Firmware_7N76T_LN32_1.07_A00.BIN
  ESM_Firmware_J7YYK_LN32_2.85_A00.BIN
  SATA_FRMW_LX_R300994.BIN

(I don't even know if either of the ESM upgrades are for my hardware. But the
shell scripts don't run.)

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-22 Thread Jeffrey Mark Siskind
I conjecture that there may be two to five separate issues.

 1. Setting up md0 upon boot takes a long time. rootdelay=20 fixes this.
 2. There is a problem writing to disk. Perhaps just writing to certain blocks.
Because even when the machine boots with rootdelay=20, and md0 has all 6
components, grub-install fails and causes md0 to drop some/most of its
components.

Both of these are observed with a dist path-upgrade from a fresh USB install
of wheezy to jessie. Separate from this, there are two other errors observed
with a direct fresh USB install of jessie.

 3. Can't find the ISO.
 4. grub-install
This may be the same as (2) above.

This is yet distinct from the fact that

 5. a fresh direct USB install of jessie on the Dell Poweredge C6145s takes a
really long time (an hour) for each hardware probe (three times, once
before finding the ISO, once before partitioning, and once before grub
install).

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-22 Thread Jeffrey Mark Siskind
   > and attempted
   > 
   >mdadm /dev/md0 --add /dev/sda1
   >mdadm /dev/md0 --add /dev/sdb1
   >mdadm /dev/md0 --add /dev/sdc1
   >mdadm /dev/md0 --add /dev/sdd1
   >mdadm /dev/md0 --add /dev/sde1
   >mdadm /dev/md0 --add /dev/sdf1
   > 
   > but these all failed.

   This is the wrong command; it should be mdadm --assemble /dev/md0
   /dev/sd[abcdef]1;

   And that should only be done if the md0 device doesn't show up in the
   initrd when you cat /proc/mdstat.

   What's happened is that the raid1 device now has 12 drives instead of 6,
   which basically isn't going to work at all.

You can see from the transcript that md0 is there and has only 6 drives. Just
that 5 of the six are marked as failed. And you can see that it refused to do
the mdadm --add.

   http://upplysingaoflun.ecn.purdue.edu/~qobi/upgrade-jessie2.script

   root@verstand:~# cat /proc/mdstat
   Personalities : [raid1] [raid6] [raid5] [raid4] 
   md1 : active raid5 sda2[0] sdf2[5] sdd2[4] sdc2[3] sde2[2] sdb2[1]
 1953118720 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/6] 
[UU]

   md0 : active raid1 sda1[6](F) sdd1[8](F) sdb1[7](F) sde1[9](F) sdc1[10] 
sdf1[11](F)
 39157688 blocks super 1.2 [6/1] [__U___]

   unused devices: 
   root@verstand:~# mdadm --add/dev/md0 --add /defv/sda1
   mdadm: Cannot open /dev/sda1: Device or resource busy
   root@verstand:~# mdadm /dev/md0 --add /dev/sda1b1
   mdadm: Cannot open /dev/sdb1: Device or resource busy
   root@verstand:~# mdadm /dev/md0 --add /dev/sdb1d1
   mdadm: Cannot open /dev/sdd1: Device or resource busy
   root@verstand:~# mdadm /dev/md0 --add /dev/sdd1e1
   mdadm: Cannot open /dev/sde1: Device or resource busy
   root@verstand:~# mdadm /dev/md0 --add /dev/sde1f1
   mdadm: Cannot open /dev/sdf1: Device or resource busy
   root@verstand:~# mdadm /dev/md0 --add /dev/sdf11c1
   mdadm: Cannot open /dev/sdc1: Device or resource busy

   You should be able to just directly reinstall jessie on this machine;

In earlier posts I explained how this fails. If I do a direct install from
USB, I observe two kinds of errors.

 1. Sometimes, but not every time, (it is nondeterministic) after the first 3
questions, the installer complains that it can't find the ISO.
 2. Whenever it does find the ISO, the install progresses without error all
the way to the grub install and then complains that it can't install grub.
I've tried several different things. Sometimes, I just answer sda to the
grub install question. (Actually sometimes sdb, because if I plug the USB
into the front port, the USB gets sdg and the drives get sd[a-f] but if I
plug the USB into the back port, the USB gets sda and the drives get
sd[b-g].) But this always fails. Sometimes, I go into ctrl-alt-f2 and do
  chroot target
  grub-install /dev/sda
  ...
  grub-install /dev/sdf
  (or b-g as appropriate)
but this also fails. At that point, I have no way to install grub. (If I
abort the install, the machine is unbootable.) Whenever I'm in this state
I do cat /proc/mdstat and it shows that some components of md0 are failed
or missing. Some are present. This is nondeterministic. Which components
are present and which are missing changes each time I attempt this. If I
attempt to do mdadm --add I get errors. If I reinstall fresh wheezy from
USB and then in wheezy do mdadm --add, it works and rebuilds the
array. When it is done it has all 6 components. And then I immediately do
a fresh install of jessie from USB and the same problem happens.

   I'd also zero out the superblocks on the devices in /dev/md0,

What command?

Jeff (http://engineering.purdue.edu/~qobi)



Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-22 Thread Jeffrey Mark Siskind
   Are you certain that there isn't a PERC H700 in this machine? [Sort of
   odd that mpt2sas is triggering a state error in your screenshot if there
   actually isn't one.]

There could be one. But I probably don't use it. I use software RAID. Dell
wouldn't sell an R815 without an OS. I think I purchased it with RHEL which
may have needed the PERC H700. But I never even booted RHEL. The first thing I
did was a fresh install of squeeze, or maybe wheezy.

   OK. This:

   > 00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI 
SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode]

   makes me think that the SATA controller is in IDE/Legacy mode instead of
   AHCI. In theory, this shouldn't matter, but it's possible that this is
   also a problem. I'd try switching it in the bios and see what happens.

I'll do that in a bit. Before I got your current post, I tried some things in
response to your previous post. I'll report on that here and then go back and
try the new things.

Here is what I did.

I had a fresh minimal USB install of wheezy running. That install was done
with debian-wheezy-DI-b1-amd64-netinst.iso from Jul 15  2012. I also put the
non-free firmware on the USB. When I did that, I unchecked all of the boxes
during the install for any extra packages. The only thing that I installed
after that was

   apt-get install less

I then did

   nano /etc/apt/source.list
   (change all wheezy to jessie)
   apt-get update
   apt-get dist-upgrade

I answered all of the defaults.

(default) all
(default) no
(default) cron

I captured this with

   script -t 2>upgrade-jessie1 time -a ~/upgrade-jessie1.script

(My mistake. I forgot a period between upgrade-jessie1 and time.)

   http://upplysingaoflun.ecn.purdue.edu/~qobi/time
   http://upplysingaoflun.ecn.purdue.edu/~qobi/upgrade-jessie1

You can see that it all worked.

You can see that at the end I did

   apt-get install firmware-linux

   dpkg-reconfigure grub-pc
   # default
   # default
   # check all /dev/sd?

and it all worked.

You can also see that at the end I did

   cat /proc/mdstat

and all 6 components of both md0 and md1 were there.

Then I did and

   /sbin/reboot

The first reboot failed. It gave a similar screen as to the one that you
already saw.

Then I did a second reboot, with delay=20. That did the same.

Then I did a third reboot, with rootdelay=20. That worked. I got a login
prompt, logged in, and got a root shell.

At that point, I did a 

   cat /proc/mdstat

and all 6 components of both md0 and md1 were there.

Then I did a

   dpkg-reconfigure grub-pc

My intent was to add rootdelay=20 to the command line. But I got lots of
errors while doing so. I realized that I should have done this under script.
So I did

   script -t 2>upgrade-jessie2.time -a ~/upgrade-jessie2.script

(this time with the period) and redid

   dpkg-reconfigure grub-pc

and also did

   cat /proc/mdstat

and attempted

   mdadm /dev/md0 --add /dev/sda1
   mdadm /dev/md0 --add /dev/sdb1
   mdadm /dev/md0 --add /dev/sdc1
   mdadm /dev/md0 --add /dev/sdd1
   mdadm /dev/md0 --add /dev/sde1
   mdadm /dev/md0 --add /dev/sdf1

but these all failed.

   http://upplysingaoflun.ecn.purdue.edu/~qobi/upgrade-jessie2.script
   http://upplysingaoflun.ecn.purdue.edu/~qobi/upgrade-jessie2.time

The machine is now in the state left at the end of the above script. If you
want me to do some more things in this state, let me know. Or I can do a fresh
USB install of wheezy and rebuild md0.

   >What does the kernel output while it is detecting the disks and
   >partitions?

   Remove the quiet option from the kernel command line by editing it in grub.

I will do this next time.

   > Do all of the drives show up properly?

   echo /dev/sd*; should give you an idea of what is there in the initramfs.

I will do this next time.

   >When the boot fails, can you read from the underlying block
   >devices?

   more /dev/sda; should work, I believe.

I will do this next time.

   > I don't know what one can do in at the initramfs command prompt. If you 
give
   > me some commands, I will try them out and post the output.
   > 
   >Does specifying delay=20 or similar result in a successful boot?

   > I will try this.

   This should actually be rootdelay=20; sorry.

Done. See above.

   > I will try to get this info. It will require me to redo the exercise
   > of a fresh jessie install from USB. I'll have to take and post screen
   > pictures because I have no way to capture the console output.

   I believe the R815 still has a serial port; you can just plug in a
   serial cable and append an appropriate serial tty option to the kernel
   command line to get output as text.

I figured out how to use script. That will work for most situations.

   What I'm trying to do is get enough information so that the error is
   obvious.

Thanks. Let me know what you want me to try next. Do you still wish me to do
the following?

   >What does the kernel output while it 

Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-21 Thread Jeffrey Mark Siskind
Thanks for your help.

   > Here is a screen picture.

   Could you upload this to an image paste site or send it along (or use a
   serial console to get it as text?)

http://upplysingaoflun.ecn.purdue.edu/~qobi/20160619_140357.jpg

(The other screen picture of a machine (not an R815) that does boot but that
takes a really long time to bring up the network is at

http://upplysingaoflun.ecn.purdue.edu/~qobi/IMG-20160609-WA.jpeg

)

   > I conjecture that the jessie kernel has difficulty accessing the MD
   > array on disk. The same problem occurs when I attempt a direct fresh
   > install of jessie with the installer.

   Which add-in card are you using on the R815s?

I don't believe that I have any add-in cards. The machine was purchased
straight from Dell. It has six SATA disks and 4 gigabit ethernet ports. It has
four 12-core AMD CPUs and 128GB RAM. The output of lspci on an indentical
machin purchased at the same time that is still running wheezy is enclosed
below.

   What does the kernel
   output while it is detecting the disks and partitions? Do all of the
   drives show up properly? Are the blocksizes correct for the partitions?

I don't know how to get this info when in the initramfs after boot. If you
tell me what commands I should give I will redo this exercise. Right now, I
have a fresh minimal wheezy reinstalled. But after the reinstall of wheezy,
everything works. I did not repartition either during the (re)install of
jessie or during the (re)install of wheezy. I go back and forth. The
(re)install of wheezy works and the (re)install of jessie does not.

   When the boot fails, can you read from the underlying block devices? Do
   the block devices get detected after the boot fails?

I don't know what one can do in at the initramfs command prompt. If you give
me some commands, I will try them out and post the output.

   Does specifying delay=20 or similar result in a successful boot?

I will try this.

 I made the dongle
   > as follows:
   > 
   ># cd /tmp
   ># wget 
http://ftp.nl.debian.org/debian/dists/jessie/main/installer-amd64/current/images/hd-media/boot.img.gz
   ># wget 
http://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-firmware/8.5.0+nonfree/amd64/iso-cd/firmware-8.5.0-amd64-netinst.iso
   ># zcat boot.img.gz >/dev/sdf
   ># mount /dev/sdf /mnt
   ># cp firmware-8.5.0-amd64-netinst.iso /mnt/.

   You can actually just cat firmware-8.5.0-amd64-netinst.iso > /dev/sdf;

Please see my other post to debian-user

subject: how to make bootable live wheezy USB that doesn't use isohybrid

One of the exercises I tried was when the machine failed to boot after a fresh
USB-install of jessie, I tried to boot a live wheezy from USB by using a USB
dongle that I made by catting the isohybrid live wheezy ISO to the USB. But
the BIOS failed to detect the USB as bootable. I haven't tried to do that with
the netinst ISO but I suspect that it also won't be detected as bootable. But
when I build the USB dongle as per above it is detected by the BIOS as bootable.

   > Every time so far, md1 has all 6 components. But md0 has only some of
   > the components, sometimes 5/6, sometimes 4/6, and sometimes 1/6. And
   > every time it is a different set of components. Even though, just a
   > few minutes earlier, I was running wheezy and md0 had all 6
   > components. I do
   > 
   > mdadm /dev/md0 --add 
   > 
   > but it refuses. I forget the error.

   The error would be useful to know. Most likely one or more of them
   dropped out of the array for some reason and you're booting off of one
   which has a lower event count and it won't assemble.

   But it could be any number of things.

   The output of mdadm --examine /dev/sd[abcdef]1; when md0 fails to
   assemble would also be useful.

I will try to get this info. It will require me to redo the exercise of a
fresh jessie install from USB. I'll have to take and post screen pictures
because I have no way to capture the console output. (I guess that I could use
iDRAC but I don't know how to and would have to learn.) If you let me know all
of the info you would like me to collect, I will try to collect it all in the
same retry of the fresh install.

But again note, that I do not believe that there are any disk hardware
errors. And I do not believe that there are any data errors in the layout of
the ext3 file system, the layout of the md0 raid array, or the partition
tables. The reason is that after the failed jessie install, I reinstall a
fressh wheezy from USB. I don't repartition. And I don't rebuild md1 and don't
rebuild /aux. But I do rebuild md0 and / as part of the fresh install. And it
works. I have done this over and over, switching between wheezy and jessie,
about a half dozen times. Each time, the jessie install leaves a different
collection of md0 components out. And each time, as part of the wheezy
install, I add them back in.

Thanks for your help.
Jeff (http://engineering.purdue.edu/~qobi)

Re: jessie won't install/boot on a Dell Poweredge R815

2016-06-21 Thread Jeffrey Mark Siskind
I get no
errors during the upgrade. And after the upgrade, before reboot, all 6
components of md0 are there. (That is still running the wheezy kernel.) All I
do is /sbin/reboot and then it comes up in the initfs. And if I then do a
fresh reinstall of wheezy, I need to rebuild md0.

So it seems to me that something in the jessie kernel is broken, probably
related to the disk driver.

Also note that I upgraded to the latest BIOS. But the same exact problems
occurred both before the BIOS upgrade and after.

   booting jessie also takes hours to do systemd
   > configuration of the network

FYI, here is a screen picture where it takes minutes for systemd to bring up
the network. Note that I am not using DHCP. As per the enclosed, each host has
a fixed IPv4 address. There are fixed DNS servers. I am at a university and IT
services maintains the network for thousands of machines. I do not observe
issues bringing up the network when running wheezy.

Jeff (http://engineering.purdue.edu/~qobi)

default Install
default English
default United States
default American English
Go Back
default Configure network manually
128.46.115.211
default netmask
default gateway
128.210.11.57 128.210.11.5 128.46.154.76
default hostname
default domain name
root password
root password
Jeffrey Mark Siskind
qobi
password
password
default Eastern
Manual
RAID1 #1
Ext3 journaling file system
Format the partition: yes, format it
Mount point: /
Done setting up the partition
RAID5 #1
Ext3 journaling file system
default Format the partition: no, keep existing data
Mount point: /aux
Done setting up the partition
Finish partitioning and write changes to disk
Yes
default United States
default ftp.us.debian.org
default blank
Yes
uncheck all
Yes
/dev/sda
Continue
---
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0080

   Device Boot  Start End  Blocks   Id  System
/dev/sda1   *20487831961539158784   fd  Linux raid autodetect
/dev/sda278319616   859570175   390625280   fd  Linux raid autodetect
/dev/sda3   859570176   97677107158600448   82  Linux swap / Solaris



jessie won't install/boot on a Dell Poweredge R815

2016-06-19 Thread Jeffrey Mark Siskind
I am attempting to install jessie on a Dell Poweredge R815 (I have four). It
has been running wheezy reliably for years. And running squeeze reliably for
years before that. But no matter what I try it won't install or boot.

I have tried two ways.

 1. I attempt a fresh install from a USB dongle. It gets all the way to
installing grub and then fails.

 2. I do a fresh install of wheezy from a USB dongle. It boots wheezy just fine.
I do nothing but

  nano /etc/apt/sources.list
  (change all instances of wheezy to jessie, save, and exit)
  apt-get update
  apt-get dist-upgrade
  (It upgrades without error. I answer the default to all questions.)
  /sbin/reboot

Then it fails to reboot and goes into the initramfs. I have a picture of
the screen if anybody wishes.

I can reliably install and run wheezy over and over. I have not been able to
install or boot jessie despite numerous attempts.

Any suggestions?

If anybody wishes, I can provide precise details of how I built the USB image
of the installer and what answers I gave to all the installer questions. But
the short is, I did exactly the same thing on numerous other machines,
including Dell T5500s, Dell Poweredge C6145s, and HP DL165s and was able to
successfully install and boot jessie. I'm almost certain that the kernel
upgrade from wheezy to jessie tickles something that is incompatible with the
R815.

Also note that while the fresh install was successfull on the C6145 (all four
that I own), the hardware detection phase of the fresh install from USB took
hours (all four) and booting jessie also takes hours to do systemd
configuration of the network and MD arrays (all four). dmesg (all four)
reports continual

  usb 1-5.2: reset high-speed USB device number 4 using ehci-pci

even though nothing is connected to USB. The C6145s (all four) ran wheezy
reliably for years (and squeeze for years before that) without such issues. I
did the exact same install on my T5500s (eleven). These have exactly the same
disks partitioned exactly the same way and none of the T5500s exhibit any
issues. I believe that there are also issues with the change of kernel from
wheezy to jessi that tickles something that is incompatible with the C6145,
but is less severe than that that tickles the R815.

Jeff (http://engineering.purdue.edu/~qobi)