Unfortunately, the system is unusable this morning. Still trying to
recover it.  May have to flatline it again.

It seems I have gotten myself stuck in a loop:
1. try to reboot and that causes kernel panic
2. after that happens a few times, the NVME needs fsck'd because of corrupt 
group descriptors
3. `fsck -CVvfy` the drive (twice for the ext partition and once for the EFI)
4. after doing 1-3 a few times, packages and symlinks start getting broken.  I 
try to manually repair them until eventually I can't get into the system 
anymore.

I tried to run memtest.  If it is set to 1 cpu at a time, it goes
without error until it eventually hangs on a random (inconsistent) test.
If I run with all cpus, it shows tons of errors pretty quickly.  Always
on the same bit of every bank (ie: 80808080 -> 8080A080) and always off
by two.  But again, it doesn't do that unless multiple cpus are running
at the same time. I thought it could be the other security features
(interleaving, memory encryption, etc) that the BIOS has set to auto.

Launching the live usb and just sitting at a terminal with `journalctl
--follow`, the last thing that happens before it hangs is usually
cleaning temp files; but I haven't run that enough to know if it is a
pattern.

>From the BIOS, I can set it to auto overclock or manual -- there is no
option to disable overclocking; so I cleared the CMOS and tried again
immediately after that without any change.

I have attempted 44 bionic installs this month.  4 of those went through
to completion. Two normal and two minimal.  The rest failed during
ubiquity.

grub-install almost always succeeds when acpi=off and almost always
hangs when it isn't.

I also have to have pcie_aspm=off or the system is spammed with errors
and crashes quickly.  Others have reported the same thing for
threadripper.

I have tried with and without livepatch enabled.

The system is stable when mining or gaming, and seems unstable when
underutilized -- so I tried disabling the C-states in the BIOS.  I have
tried disabling every form of power management I could find in the OS
and in the BIOS.  I am sure I have missed quite a few.

I have tried manually updating the kernel (per your requests) as well as
using ukuu.  Since it is my primary machine, I tend to have things
installed that have to then be uninstalled for that to work well (like
nvidia drivers, virtualbox, etc).

I am seeing a ton of segfaults, even from the live usb.  It more often
happens when the machine is sitting idle for a few minutes (which is
what had me thinking about power management).  I thought it could be the
memory, but since they don't fail memtest (if I run then 1 cpu at a
time)....

I know that "Erase disk and reinstall" will not solve the problem.  It
would be nice to figure out how to solve the problem before I do that
again.

So... I'm not sure how I can try a new kernel for you.  If there is some
way for me to update a live usb with an alternate kernel from a live
usb; that might work since I see errors on the daily bionic iso as well.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1765838

Title:
  BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765838/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to