Hi Ivan, the best way to engage Canonical Support to get assistance with this issue will be to file a support case on support.canonical.com and attach an sosreport of the affected system that is collected when the issue happens. See my previous comment #5 for the details of sosreport. Please check with Stephen Zarkos if you need access to the Canonical Support Portal.
One other idea that may help in case your system is not responsive is to have a serial console output logged in a gnu screen or tmux session. Inside this console session, you can enable the maximum log level ("echo 9 > /proc/sysrq-trigger", you might have to run "sysctl -w kernel.sysrq=1" before to enable sysrq) and run "dmesg -w", which will dump dmesg and continuously append new entries to the kernel log. This way, you won't depend on saving logs to the disk to see what's going on, since the disk access could freeze in the moment of the failure. You can also enable kdump and all the "panic_on_X" sysctl settings (section Enabling various types of panics in CrashdumpRecipe article[1]). If the system is locking up so hard that it freezes, it may then capture a dump so that we can see what's going on. Refer to the CrashdumpRecipe article[1] for more information. [1] https://wiki.ubuntu.com/Kernel/CrashdumpRecipe Thank you, David -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to systemd in Ubuntu. https://bugs.launchpad.net/bugs/1788643 Title: zombies pile up, system becomes unresponsive Status in systemd package in Ubuntu: New Bug description: Description: Ubuntu 16.04.5 LTS Release: 16.04 systemd: Installed: 229-4ubuntu21.4 Candidate: 229-4ubuntu21.4 Version table: *** 229-4ubuntu21.4 500 500 http://azure.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages 100 /var/lib/dpkg/status 229-4ubuntu21.1 500 500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages 229-4ubuntu4 500 500 http://azure.archive.ubuntu.com/ubuntu xenial/main amd64 Packages This problem is in Azure. We are seeing these problems on different systems. Worker nodes (Ubuntu 16.04) in a hadoop cluster start piling up zombies and become unresponsive. The syslog and the kernel logs don't provide much information. The only error we could correlate with what we are seeing was in the audit logs. See at the end of this message, the "Connection timed out" and the "Cannot create session: Already running in a session" messages. Our first suspect was memory pressure on the machines. We added logging and settings to reboot on out of memory, but all these turned to be red herrings. Aug 18 19:11:08 wn2-d3ncsp su[112600]: Successful su for root by root Aug 18 19:11:08 wn2-d3ncsp su[112600]: + ??? root:root Aug 18 19:11:08 wn2-d3ncsp su[112600]: pam_unix(su:session): session opened for user root by (uid=0) Aug 18 19:11:08 wn2-d3ncsp systemd-logind[1486]: New session c8 of user root. Aug 18 19:11:26 wn2-d3ncsp sshd[112690]: Did not receive identification string from 10.84.93.35 Aug 18 19:11:34 wn2-d3ncsp su[112600]: pam_systemd(su:session): Failed to create session: Connection timed out Aug 18 19:11:34 wn2-d3ncsp su[112600]: pam_unix(su:session): session closed for user root Aug 18 19:11:34 wn2-d3ncsp systemd-logind[1486]: Removed session c8. Aug 18 19:12:03 wn2-d3ncsp sudo: ehiadmin : TTY=pts/1 ; PWD=/home/ehiadmin ; USER=root ; COMMAND=/bin/su - Aug 18 19:12:03 wn2-d3ncsp sudo: pam_unix(sudo:session): session opened for user root by ehiadmin(uid=0) Aug 18 19:12:03 wn2-d3ncsp su[113085]: Successful su for root by root Aug 18 19:12:03 wn2-d3ncsp su[113085]: + /dev/pts/1 root:root Aug 18 19:12:03 wn2-d3ncsp su[113085]: pam_unix(su:session): session opened for user root by ehiadmin(uid=0) Aug 18 19:12:03 wn2-d3ncsp su[113085]: pam_systemd(su:session): Cannot create session: Already running in a session Aug 18 19:12:42 wn2-d3ncsp sshd[113274]: Did not receive identification string from 10.84.93.42 Aug 18 19:13:37 wn2-d3ncsp su[113085]: pam_unix(su:session): session closed for user root Aug 18 19:13:37 wn2-d3ncsp sudo: pam_unix(sudo:session): session closed for user root Aug 18 19:13:37 wn2-d3ncsp sshd[112285]: pam_unix(sshd:session): session closed for user ehiadmin Aug 18 19:13:37 wn2-d3ncsp systemd-logind[1486]: Removed session 1291. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1788643/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp