[Kernel-packages] [Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure
** Changed in: linux (Ubuntu Focal) Status: New => In Progress ** Changed in: linux (Ubuntu Groovy) Status: New => In Progress ** Changed in: linux (Ubuntu Focal) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Groovy) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Focal) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Changed in: linux (Ubuntu Groovy) Assignee: (unassigned) => Matthew Ruffell (mruffell) ** Summary changed: - Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure + qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP tx csum offload -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1909062 Title: qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP tx csum offload Status in linux package in Ubuntu: Confirmed Status in linux source package in Focal: In Progress Status in linux source package in Groovy: In Progress Bug description: With QL41xxx and Ubuntu DNS server DNS failures are seen when updated to the latest Ubuntu kernel 20.04.1 LTS version 5.4.0-52-generic. Issue was not observed with 4.5 ubuntu-linux. Problem Definition: OS Version: /etc/os-release shows Ubuntu 18.04.4 LTS, but Booted kernel is the latest Ubuntu 20.04.1 LTS version 5.4.0-52-generic NIC: 2 dual-port (4) QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller [1077:8070] (rev 02) Inbox driver qede v8.37.0.20 Complete Detailed Problem Description: Anything that uses the internal Kubernetes DNS server fails. If an external DNS server is used resolution works for non-Kubernetes IPs. Similar issue is described in this article. https://github.com/kubernetes/kubernetes/issues/95365 Below patch recently on upstream fixes this - [Note that issue was introduced by driver's tunnel offload support which was added in after 4.5 kernel] commit 5d5647dad259bb416fd5d3d87012760386d97530 Author: Manish Chopra Date: Mon Dec 21 06:55:30 2020 -0800 qede: fix offload for IPIP tunnel packets IPIP tunnels packets are unknown to device, hence these packets are incorrectly parsed and caused the packet corruption, so disable offlods for such packets at run time. Signed-off-by: Manish Chopra Signed-off-by: Sudarsana Kalluru Signed-off-by: Igor Russkikh Link: https://lore.kernel.org/r/20201221145530.7771-1-mani...@marvell.com Signed-off-by: Jakub Kicinski Thanks, Manish To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure
** Also affects: linux (Ubuntu Groovy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1909062 Title: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure Status in linux package in Ubuntu: Confirmed Status in linux source package in Focal: New Status in linux source package in Groovy: New Bug description: With QL41xxx and Ubuntu DNS server DNS failures are seen when updated to the latest Ubuntu kernel 20.04.1 LTS version 5.4.0-52-generic. Issue was not observed with 4.5 ubuntu-linux. Problem Definition: OS Version: /etc/os-release shows Ubuntu 18.04.4 LTS, but Booted kernel is the latest Ubuntu 20.04.1 LTS version 5.4.0-52-generic NIC: 2 dual-port (4) QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller [1077:8070] (rev 02) Inbox driver qede v8.37.0.20 Complete Detailed Problem Description: Anything that uses the internal Kubernetes DNS server fails. If an external DNS server is used resolution works for non-Kubernetes IPs. Similar issue is described in this article. https://github.com/kubernetes/kubernetes/issues/95365 Below patch recently on upstream fixes this - [Note that issue was introduced by driver's tunnel offload support which was added in after 4.5 kernel] commit 5d5647dad259bb416fd5d3d87012760386d97530 Author: Manish Chopra Date: Mon Dec 21 06:55:30 2020 -0800 qede: fix offload for IPIP tunnel packets IPIP tunnels packets are unknown to device, hence these packets are incorrectly parsed and caused the packet corruption, so disable offlods for such packets at run time. Signed-off-by: Manish Chopra Signed-off-by: Sudarsana Kalluru Signed-off-by: Igor Russkikh Link: https://lore.kernel.org/r/20201221145530.7771-1-mani...@marvell.com Signed-off-by: Jakub Kicinski Thanks, Manish To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure
** Also affects: linux (Ubuntu Groovy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909062 Title: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Thanks Tobias for the testing. Good to hear it functions as intended. Performing verification for Bionic I installed adcli 0.8.2-1ubuntu1.2 from -proposed, and joined a domain without using the --use-ldaps flag. https://paste.ubuntu.com/p/RByVZRPhCK/ Next, I added the firewall rules from the test section: # ufw deny out 389 # ufw deny out 3268 # ufw enable Now, I tried to join, again without --use-ldaps: https://paste.ubuntu.com/p/KMPNtS5SYK/ I got rejected, due to firewall. Now, lets try connect with --use-ldaps: https://paste.ubuntu.com/p/bKzx6K6PXd/ Realm join works, and I checked with strace to see what port is being used: connect(3, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("192.168.122.66")}, 16) = 0 We see port 636 as expected. I am happy with the packages in -proposed, they implement the new feature properly, and more importantly, fix the regression from bug 1906627. Happy to mark as verified. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1908473] [NEW] rsyslog-relp: imrelp module leaves sockets in CLOSE_WAIT state which leads to file descriptor leak
Public bug reported: [Impact] In recent versions of rsyslog and librelp, the imrelp module leaks file descriptors due to a bug where it does not correctly close sockets, and instead, leaves them in the CLOSE_WAIT state. This causes rsyslogd on busy servers to eventually hit the limit of maximum open files allowed, which locks rsyslogd up until it is restarted. A workaround is to restart rsyslogd every month or so to manually close all of the open sockets. Only users of the imrelp module are affected, and not rsyslog users in general. [Testcase] Install the rsyslog-relp module like so: $ sudo apt install rsyslog rsyslog-relp Next, generate a working directory, and make a config file that loads the relp module. $ sudo mkdir /workdir $ cat << EOF >> ./spool.conf \$LocalHostName spool \$AbortOnUncleanConfig on \$PreserveFQDN on global( workDirectory="/workdir" maxMessageSize="256k" ) main_queue(queue.type="Direct") module(load="imrelp") input( type="imrelp" name="imrelp" port="601" ruleset="spool" MaxDataSize="256k" ) ruleset(name="spool" queue.type="direct") { } # Just so rsyslog doesn't whine that we do not have outputs ruleset(name="noop" queue.type="direct") { action( type="omfile" name="omfile" file="/workdir/spool.log" ) } EOF Verify that the config is valid, then start a rsyslog server. $ sudo rsyslogd -f ./spool.conf -N9 $ sudo rsyslogd -f ./spool.conf -i /workdir/rsyslogd.pid Fetch the rsyslogd PID and check for open files. $ RLOGPID=$(cat /workdir/rsyslogd.pid) $ sudo ls -l /proc/$RLOGPID/fd total 0 lr-x-- 1 root root 64 Dec 17 01:22 0 -> /dev/urandom lrwx-- 1 root root 64 Dec 17 01:22 1 -> 'socket:[41228]' lrwx-- 1 root root 64 Dec 17 01:22 3 -> 'socket:[41222]' lrwx-- 1 root root 64 Dec 17 01:22 4 -> 'socket:[41223]' lrwx-- 1 root root 64 Dec 17 01:22 7 -> 'anon_inode:[eventpoll]' We have 3 sockets open by default. Next, use netcat to open 100 connections: $ for i in {1..100} ; do nc -z 127.0.0.1 601 ; done Now check for open file descriptors, and there will be an extra 100 sockets in the list: $ sudo ls -l /proc/$RLOGPID/fd https://paste.ubuntu.com/p/f6NQVNbZcR/ We can check the state of these sockets with: $ ss -t https://paste.ubuntu.com/p/7Ts2FbxJrg/ The listening sockets will be in CLOSE-WAIT, and the netcat sockets will be in FIN-WAIT-2. If you install the test package available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf299578-test When you open connections with netcat, these will be closed properly, and the file descriptor leak will be fixed. [Where problems could occur] If a regression were to occur, it would be limited to users of the imrelp module, which is a part of the rsyslogd-relp package, and depends on librelp. rsyslog-relp is not part of a default installation of rsyslog, and is opt in by changing a configuration file to enable imrelp. The changes to rsyslog implement a testcase which exercises the problematic code to ensure things are working as expected, and should run during autopkgtest time. [Other] Upstream bug list: https://github.com/rsyslog/rsyslog/issues/4350 https://github.com/rsyslog/rsyslog/issues/4005 https://github.com/rsyslog/librelp/issues/188 The following commits fix the problem: rsyslogd commit baee0bd5420649329793746f0daf87c4f59fe6a6 Author: Andre lorbach Date: Thu Apr 9 13:00:35 2020 +0200 Subject: testbench: Add test for imrelp to check broken session handling. Link: https://github.com/rsyslog/rsyslog/commit/baee0bd5420649329793746f0daf87c4f59fe6a6 librelp === commit 7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0 Author: Andre lorbach Date: Mon May 11 14:59:55 2020 +0200 Subject: fix memory leak on session break. Link: https://github.com/rsyslog/librelp/commit/7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0 commit 4a6ad8637c244fd3a1caeb9a93950826f58e956a Author: Andre lorbach Date: Wed Apr 8 15:55:32 2020 +0200 Subject: replsess: fix double free of sendbuf in some cases. Link: https://github.com/rsyslog/librelp/commit/4a6ad8637c244fd3a1caeb9a93950826f58e956a ** Affects: librelp (Ubuntu) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog (Ubuntu) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: librelp (Ubuntu Focal) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog (Ubuntu Focal) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: librelp (Ubuntu Groovy) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog
[Touch-packages] [Bug 1908473] [NEW] rsyslog-relp: imrelp module leaves sockets in CLOSE_WAIT state which leads to file descriptor leak
Public bug reported: [Impact] In recent versions of rsyslog and librelp, the imrelp module leaks file descriptors due to a bug where it does not correctly close sockets, and instead, leaves them in the CLOSE_WAIT state. This causes rsyslogd on busy servers to eventually hit the limit of maximum open files allowed, which locks rsyslogd up until it is restarted. A workaround is to restart rsyslogd every month or so to manually close all of the open sockets. Only users of the imrelp module are affected, and not rsyslog users in general. [Testcase] Install the rsyslog-relp module like so: $ sudo apt install rsyslog rsyslog-relp Next, generate a working directory, and make a config file that loads the relp module. $ sudo mkdir /workdir $ cat << EOF >> ./spool.conf \$LocalHostName spool \$AbortOnUncleanConfig on \$PreserveFQDN on global( workDirectory="/workdir" maxMessageSize="256k" ) main_queue(queue.type="Direct") module(load="imrelp") input( type="imrelp" name="imrelp" port="601" ruleset="spool" MaxDataSize="256k" ) ruleset(name="spool" queue.type="direct") { } # Just so rsyslog doesn't whine that we do not have outputs ruleset(name="noop" queue.type="direct") { action( type="omfile" name="omfile" file="/workdir/spool.log" ) } EOF Verify that the config is valid, then start a rsyslog server. $ sudo rsyslogd -f ./spool.conf -N9 $ sudo rsyslogd -f ./spool.conf -i /workdir/rsyslogd.pid Fetch the rsyslogd PID and check for open files. $ RLOGPID=$(cat /workdir/rsyslogd.pid) $ sudo ls -l /proc/$RLOGPID/fd total 0 lr-x-- 1 root root 64 Dec 17 01:22 0 -> /dev/urandom lrwx-- 1 root root 64 Dec 17 01:22 1 -> 'socket:[41228]' lrwx-- 1 root root 64 Dec 17 01:22 3 -> 'socket:[41222]' lrwx-- 1 root root 64 Dec 17 01:22 4 -> 'socket:[41223]' lrwx-- 1 root root 64 Dec 17 01:22 7 -> 'anon_inode:[eventpoll]' We have 3 sockets open by default. Next, use netcat to open 100 connections: $ for i in {1..100} ; do nc -z 127.0.0.1 601 ; done Now check for open file descriptors, and there will be an extra 100 sockets in the list: $ sudo ls -l /proc/$RLOGPID/fd https://paste.ubuntu.com/p/f6NQVNbZcR/ We can check the state of these sockets with: $ ss -t https://paste.ubuntu.com/p/7Ts2FbxJrg/ The listening sockets will be in CLOSE-WAIT, and the netcat sockets will be in FIN-WAIT-2. If you install the test package available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf299578-test When you open connections with netcat, these will be closed properly, and the file descriptor leak will be fixed. [Where problems could occur] If a regression were to occur, it would be limited to users of the imrelp module, which is a part of the rsyslogd-relp package, and depends on librelp. rsyslog-relp is not part of a default installation of rsyslog, and is opt in by changing a configuration file to enable imrelp. The changes to rsyslog implement a testcase which exercises the problematic code to ensure things are working as expected, and should run during autopkgtest time. [Other] Upstream bug list: https://github.com/rsyslog/rsyslog/issues/4350 https://github.com/rsyslog/rsyslog/issues/4005 https://github.com/rsyslog/librelp/issues/188 The following commits fix the problem: rsyslogd commit baee0bd5420649329793746f0daf87c4f59fe6a6 Author: Andre lorbach Date: Thu Apr 9 13:00:35 2020 +0200 Subject: testbench: Add test for imrelp to check broken session handling. Link: https://github.com/rsyslog/rsyslog/commit/baee0bd5420649329793746f0daf87c4f59fe6a6 librelp === commit 7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0 Author: Andre lorbach Date: Mon May 11 14:59:55 2020 +0200 Subject: fix memory leak on session break. Link: https://github.com/rsyslog/librelp/commit/7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0 commit 4a6ad8637c244fd3a1caeb9a93950826f58e956a Author: Andre lorbach Date: Wed Apr 8 15:55:32 2020 +0200 Subject: replsess: fix double free of sendbuf in some cases. Link: https://github.com/rsyslog/librelp/commit/4a6ad8637c244fd3a1caeb9a93950826f58e956a ** Affects: librelp (Ubuntu) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog (Ubuntu) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: librelp (Ubuntu Focal) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog (Ubuntu Focal) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: librelp (Ubuntu Groovy) Importance: Medium Assignee: Matthew Ruffell (mruffell) Status: In Progress ** Affects: rsyslog
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Hi Tobias, If you have a moment, could you please help test the new adcli package in -proposed? Mainly focusing on testing Bionic, to ensure the regression has been fixed. Can you run through some tests with and without the --use-ldaps flag? You can install the new adcli package in -proposed like so: Enable -proposed by running the following command to make a new sources.list.d entry: 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list # Enable Ubuntu proposed archive deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe EOF 2) sudo apt update 3) sudo apt install adcli 4) sudo apt-cache policy adcli | grep Installed Installed: 0.8.2-1ubuntu1.2 5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3 6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list 7) sudo apt update In my testing, everything works as intended. This new version fixes the regression from bug 1906627, as GSS-SPNEGO is now compatible with the one in Active Directory. I will be marking this bug as verified in the coming days, once I am satisfied with my own testing. Thanks, Matthew ** Tags removed: verification-done verification-failed-bionic ** Tags added: verification-needed verification-needed-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
To anyone following this bug: As we get ready to re-release the new adcli package which implements the --use-ldaps flag, if you are happy to spend a few moments testing the new package, I would really appreciate it. I really don't want to cause another regression again. You can install the new adcli package in -proposed like so: Enable -proposed by running the following command to make a new sources.list.d entry: 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list # Enable Ubuntu proposed archive deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe EOF 2) sudo apt update 3) sudo apt install adcli 4) sudo apt-cache policy adcli | grep Installed Installed: 0.8.2-1ubuntu1.2 5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3 6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list 7) sudo apt update >From there, join your domain like normal, and if you like, try out other adcli or realm commands to ensure they work. Let me know how the new adcli package in -proposed goes. In my testing, it fixes the regression, and works as intended. To Jason Alavaliant, thanks! I really appreciate the help testing. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
To anyone following this bug: As we get ready to re-release the new adcli package which implements the --use-ldaps flag, if you are happy to spend a few moments testing the new package, I would really appreciate it. I really don't want to cause another regression again. You can install the new adcli package in -proposed like so: Enable -proposed by running the following command to make a new sources.list.d entry: 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list # Enable Ubuntu proposed archive deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe EOF 2) sudo apt update 3) sudo apt install adcli 4) sudo apt-cache policy adcli | grep Installed Installed: 0.8.2-1ubuntu1.2 5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3 6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list 7) sudo apt update >From there, join your domain like normal, and if you like, try out other adcli or realm commands to ensure they work. Let me know how the new adcli package in -proposed goes. In my testing, it fixes the regression, and works as intended. To Jason Alavaliant, thanks! I really appreciate the help testing. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: Fix Committed Status in cyrus-sasl2 source package in Bionic: Fix Committed Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package from -proposed https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Performing verification for Bionic Firstly, I installed adcli and libsasl2-modules-gssapi-mit from -updates: adcli 0.8.2-1 libsasl2-modules-gssapi-mit 2.1.27~101-g0780600+dfsg-3ubuntu2.1 >From there, I joined a Active Directory realm: https://paste.ubuntu.com/p/zJhvpRzktk/ Next, I enabled -proposed and installed the fixed cyrus-sasl2 and adcli packages: https://paste.ubuntu.com/p/cRrbkjjFmw/ We see that installing adcli 0.8.2-1ubuntu1.2 automatically pulls in the fixed cyrus-sasl2 2.1.27~101-g0780600+dfsg-3ubuntu2.3 packages because of the depends we set. Next, I joined a Active Directory realm, using the same commands as previous, i.e. not using the new --use-ldaps flag, but instead, falling back to GSS-API and the new GSS-SPNEGO changes: https://paste.ubuntu.com/p/WdKYxxDBQm/ The join succeeds, and does not get stuck. This shows that the implementation of GSS-SPNEGO is now compatible with Active Directory, and that the new adcli package is using the new implementation. Looking at the packet trace, we see the full 30 or so packets exchanged, which matches the expect count. https://paste.ubuntu.com/p/k9njh3jYHh/ With these changes, the adcli and cyrus-sasl2 packages in -proposed can join realms in the same ways that the initial packages in -updates can. These changes fix the recent adcli regression. Happy to mark verified. ** Tags removed: regression-update verification-needed verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Performing verification for Bionic Firstly, I installed adcli and libsasl2-modules-gssapi-mit from -updates: adcli 0.8.2-1 libsasl2-modules-gssapi-mit 2.1.27~101-g0780600+dfsg-3ubuntu2.1 >From there, I joined a Active Directory realm: https://paste.ubuntu.com/p/zJhvpRzktk/ Next, I enabled -proposed and installed the fixed cyrus-sasl2 and adcli packages: https://paste.ubuntu.com/p/cRrbkjjFmw/ We see that installing adcli 0.8.2-1ubuntu1.2 automatically pulls in the fixed cyrus-sasl2 2.1.27~101-g0780600+dfsg-3ubuntu2.3 packages because of the depends we set. Next, I joined a Active Directory realm, using the same commands as previous, i.e. not using the new --use-ldaps flag, but instead, falling back to GSS-API and the new GSS-SPNEGO changes: https://paste.ubuntu.com/p/WdKYxxDBQm/ The join succeeds, and does not get stuck. This shows that the implementation of GSS-SPNEGO is now compatible with Active Directory, and that the new adcli package is using the new implementation. Looking at the packet trace, we see the full 30 or so packets exchanged, which matches the expect count. https://paste.ubuntu.com/p/k9njh3jYHh/ With these changes, the adcli and cyrus-sasl2 packages in -proposed can join realms in the same ways that the initial packages in -updates can. These changes fix the recent adcli regression. Happy to mark verified. ** Tags removed: regression-update verification-needed verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: Fix Committed Status in cyrus-sasl2 source package in Bionic: Fix Committed Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package from -proposed https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi @hloeung, these patches are available in 4.15.0-128-generic, and 5.4.0-58-generic. They are both re-spins of 4.15.0-126-generic and 5.4.0-56-generic, respectively. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following commits fix the problem, and all landed in 5.6-rc1: commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 Author: Coly Li Date: Fri Jan 24 01:01:40 2020 +0800 Subject: bcache: remove member accessed from struct btree Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 Author: Coly Li Date: Fri Jan 24 01:01:41 2020 +0800 Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 commit e3de04469a49ee09c89e80bf821508df458ccee6 Author: Coly Li Date: Fri Jan 24 01:01:42 2020 +0800 Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 The first commit is a dependency of the other two. The first commit removes a "recently accessed" marker, used to indicate if a particular cache has been used recently, and if it has been, not consider it for cache eviction. The commit mentions that under heavy IO, all caches will end up being recently accessed, and nothing will ever be shrunk. The second commit changes a previous design decision of skipping the first 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very first commit I mentioned. This time, if we land on 1 or 2, the loop exits and nothing happens, and we waste time waiting on locks, just like the very first commit I mentioned. The fix is to try shrink caches from the tail of the list, and not the beginning. The third commit fixes a minor issue where we don't want to re-arrange the linked list c->btree_cache, which is what the second commit ended up doing, and instead, just shrink the cache at the end of the linked list, and not change the order. One minor backport / context adjustment was required in the first commit for Bionic, and the rest are all clean cherry picks to Bionic and Focal. [Testcase] This is kind of hard to test, since the problem shows up in production environments that are under constant IO pressure, with many different items entering and leaving the cache. The Launchpad git server is currently suffering this issue, and has been sitting at a constant IO wait of 1 second / slightly over 1 second which was causing slow response times, which was leading to build failures when git clones ended up timing out. We installed a test kernel, which is available in the following PPA:
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi @hloeung, these patches are available in 4.15.0-128-generic, and 5.4.0-58-generic. They are both re-spins of 4.15.0-126-generic and 5.4.0-56-generic, respectively. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, The respun kernel has now landed in -updates, and is version 4.15.0-128-generic. Please re-schedule the maintenance window for the Launchpad git server, and re-attempt moving to the fixed kernel. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, The respun kernel has now landed in -updates, and is version 4.15.0-128-generic. Please re-schedule the maintenance window for the Launchpad git server, and re-attempt moving to the fixed kernel. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following commits fix the problem, and all landed in 5.6-rc1: commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 Author: Coly Li Date: Fri Jan 24 01:01:40 2020 +0800 Subject: bcache: remove member accessed from struct btree Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 Author: Coly Li Date: Fri Jan 24 01:01:41 2020 +0800 Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 commit e3de04469a49ee09c89e80bf821508df458ccee6 Author: Coly Li Date: Fri Jan 24 01:01:42 2020 +0800 Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 The first commit is a dependency of the other two. The first commit removes a "recently accessed" marker, used to indicate if a particular cache has been used recently, and if it has been, not consider it for cache eviction. The commit mentions that under heavy IO, all caches will end up being recently accessed, and nothing will ever be shrunk. The second commit changes a previous design decision of skipping the first 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very first commit I mentioned. This time, if we land on 1 or 2, the loop exits and nothing happens, and we waste time waiting on locks, just like the very first commit I mentioned. The fix is to try shrink caches from the tail of the list, and not the beginning. The third commit fixes a minor issue where we don't want to re-arrange the linked list c->btree_cache, which is what the second commit ended up doing, and instead, just shrink the cache at the end of the linked list, and not change the order. One minor backport / context adjustment was required in the first commit for Bionic, and the rest are all clean cherry picks to Bionic and Focal. [Testcase] This is kind of hard to test, since the problem shows up in production environments that are under constant IO pressure, with many different items entering and leaving the cache. The Launchpad git server is currently suffering this issue, and has been sitting at a constant IO wait of 1 second / slightly over 1 second which was causing slow response times, which was leading to build failures when git clones ended up timing out. We installed a test kernel, which is available in the
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
Performing verification for Focal. I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe drives that support block discard. I enabled -proposed, and installed the 5.4.0-58-generic kernel. The following is the repro session running through the full testcase: https://paste.ubuntu.com/p/Zr4C2pMbrk/ A 2 disk Raid10 array was created, LVM created and formatted ext4. I let the consistency checks finish, and created, then deleted a file. Did another consistency check, then performed a fstrim. After another consistency check, we unmount and perform a fsck on each individual disk. root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.45.5 (07-Jan-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.45.5 (07-Jan-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks Both of them pass, there is no corruption to the filesystem. 5.4.0-58-generic fixes the problem, the revert is effective. Marking bug as verified for Focal. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Invalid Status in linux source package in Xenial: Invalid Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Bug 1907262] Re: raid10: discard leads to corrupted file system
Performing verification for Focal. I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe drives that support block discard. I enabled -proposed, and installed the 5.4.0-58-generic kernel. The following is the repro session running through the full testcase: https://paste.ubuntu.com/p/Zr4C2pMbrk/ A 2 disk Raid10 array was created, LVM created and formatted ext4. I let the consistency checks finish, and created, then deleted a file. Did another consistency check, then performed a fstrim. After another consistency check, we unmount and perform a fsck on each individual disk. root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.45.5 (07-Jan-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.45.5 (07-Jan-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks Both of them pass, there is no corruption to the filesystem. 5.4.0-58-generic fixes the problem, the revert is effective. Marking bug as verified for Focal. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1907262] Re: raid10: discard leads to corrupted file system
Performing verification for Bionic. I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe drives that support block discard. I enabled -proposed, and installed the 4.15.0-128-generic kernel. The following is the repro session running through the full testcase: https://paste.ubuntu.com/p/VpwjbRRcy6/ A 2 disk Raid10 array was created, LVM created and formatted ext4. I let the consistency checks finish, and created, then deleted a file. Did another consistency check, then performed a fstrim. After another consistency check, we unmount and perform a fsck on each individual disk. root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks Both of them pass, there is no corruption to the filesystem. 4.15.0-128-generic fixes the problem, the revert is effective. Marking bug as verified for Bionic. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
Performing verification for Bionic. I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe drives that support block discard. I enabled -proposed, and installed the 4.15.0-128-generic kernel. The following is the repro session running through the full testcase: https://paste.ubuntu.com/p/VpwjbRRcy6/ A 2 disk Raid10 array was created, LVM created and formatted ext4. I let the consistency checks finish, and created, then deleted a file. Did another consistency check, then performed a fstrim. After another consistency check, we unmount and perform a fsck on each individual disk. root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks Both of them pass, there is no corruption to the filesystem. 4.15.0-128-generic fixes the problem, the revert is effective. Marking bug as verified for Bionic. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Invalid Status in linux source package in Xenial: Invalid Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression
Hi Lukasz, I think you understand the plan correctly. Here it is in bullet points: 1) Re-instate Bionic sssd 1.16.1-1ubuntu1.7 and Focal sssd 2.2.3-3ubuntu0.1 to -updates. Their [what could go wrong] still holds, as their changes are behind an opt-in configuration file option, and it has been tested by me, the customer, and the original bug reporter. Unlikely to cause regressions, and if they do, they will be opt in via intentional configuration file change. 2) Re-instate Groovy adcli 0.9.0-1ubuntu1.2 to -updates. Changes to adcli on Groovy are minimal, and will not cause any problems. 3) Build (likely in special security ppa), and accept cyrus-sasl2 upload to bionic-proposed. We need to start the ball rolling on fixing the root cause, which is the bad GSS-SPNEGO implementation in Bionic. 4) Delete adcli 0.8.2-1ubuntu2 from bionic-proposed upload queue. It is likely a bit late for a revert package now, affected users would have downgraded to adcli from -release. We will push for a fix instead. 5) Go with option one from the previous email, build, and accept adcli 0.8.2-1ubuntu2.1 to bionic-proposed. This builds on 0.8.2-1ubuntu1 with the SRU changes, and depends on the fixed cyrus-sasl2 package. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff 6) Although adcli for Focal should be safe for release, we will play it safe, and only release it when adcli for Bionic is ready. 7) I will re-test and verify adcli on both Bionic and Focal, as well as test and verify cyrus-sasl2. I will also get the customer to perform some testing. 8) Once all testing has been completed, we will release adcli for Bionic and Focal and cyrus-sasl2 to -updates. I hope this action plan is okay. Feel free to ask for clarifications before we put the plan into action. Thanks, Matthew On Thu, Dec 10, 2020 at 5:29 AM Lukasz Zemczak wrote: > > Ok, thanks for the clarification! > > So, if I understand correctly, we should reinstate the reverted sssd > for all the series, and adcli for focal and groovy? Then for bionic > accept the cyrus-sasl2 upload + possibly an adcli with the changes > that were reverted? I suppose adcli would need a breaks statement in > that case. > > Anyway, I'm around if any SRU reviews or package copying is needed. > Let me reach out to Eric. > > Cheers, > > On Wed, 9 Dec 2020 at 05:13, Matthew Ruffell > wrote: > > > > > Ok, so there was a LOT happening in this thread, so I'd use some quick > > > summary. > > > Since what I'd like to know: > > > > > 1) Does this cyrus-sasl2 fix both the adcli and sssd regressions? > > > Since we reverted both as people were reporting regressions first for sssd > > > and then for adcli - not sure which one was the actual cause of it though > > > > The cyrus-sasl2 fix fixes the adcli regression, due to adcli changing to > > using > > GSS-SPNEGO by default, which was broken. > > > > sssd never had a regression in the first place, due to the changes having > > nothing to do with GSS-SPNEGO. > > > > The confusion with sssd came from confused users who did not know that adcli > > is the program under the hood of realm, and thought that sssd had broken, > > when > > in reality, it was adcli. > > > > > 2) Does it need fixing for all the stable series where we updated adcli > > > and > > > (additionally) sssd? > > > > cyrus-sasl2 is only broken in Bionic. Focal onward already have the patch > > and > > work fine. > > > > Let me know if you have any more questions, happy to answer. > > > > Thanks, > > Matthew > > > > On Tue, Dec 8, 2020 at 4:57 PM Matthew Ruffell > > wrote: > > > > > > Hello Eric and Lukasz, > > > > > > I have created new debdiffs for adcli. Please review and also sponsor one > > > of them to -proposed. > > > > > > Since there are multiple versions of adcli floating around I made two > > > debdiffs. > > > > > > Please choose the one most convenient / cleanest to apply. > > > > > > The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, > > > and is > > > the version pull-lp-source pulls down. It simply adds the dependency > > > to the fixed > > > libsasl2-modules-gssapi-mit package with a greater than or equal to > > > relationship. > > > > > > Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload > > > queue, > > > and treated as 0.8.2-1ubuntu2 never existed. > > > > > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachm
[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Hi Markus, I am deeply sorry for causing the regression. We are aware, and tracking the issue in bug 1907262. The kernel team have started an emergency revert and you can expect fixed kernels to be released in the next day or so. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Hi Markus, I am deeply sorry for causing the regression. We are aware, and tracking the issue in bug 1907262. The kernel team have started an emergency revert and you can expect fixed kernels to be released in the next day or so. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Status in linux source package in Groovy: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link:
[Bug 1907262] Re: raid10: discard leads to corrupted file system
Hi Thimo, Firstly, thank you for your bug report, we really, really appreciate it. You are correct, the recent raid10 patches appear to cause filesystem corruption on raid10 arrays. I have spent the day reproducing, and I can confirm that the 4.15.0-126-generic, 5.4.0-56-generic and 5.8.0-31-generic kernels are affected. The kernel team are aware of the situation, and we have begun an emergency revert of the patches, and we should have new kernels available in the next few hours / day or so. The current mainline kernel is affected, so I have written to the raid subsystem maintainer, and the original author of the raid10 block discard patches, to aid with debugging and fixing the problem. You can follow the upstream thread here: https://www.spinics.net/lists/kernel/msg3765302.html As for the data corruption on your servers, I am deeply sorry for causing this regression. When I was testing the raid10 block discard patches on the Ubuntu stable kernels, I did not think to fsck each of the disks in the array, instead, I was contempt with the speed of creating new arrays, writing a basic dataset to the disks, and rebooting the server to ensure the array came up again with those same files. Since the first disk seems to be okay, there is at least a small window of opportunity for you to restore any data that you have not backed up. I will keep you informed of getting the patches reverted, and getting the root cause fixed upstream. If you have any questions, feel free to ask, and if you have any more details from your own debugging, feel free to share in this bug, or on the upstream mailing list discussion. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
Hi Thimo, Firstly, thank you for your bug report, we really, really appreciate it. You are correct, the recent raid10 patches appear to cause filesystem corruption on raid10 arrays. I have spent the day reproducing, and I can confirm that the 4.15.0-126-generic, 5.4.0-56-generic and 5.8.0-31-generic kernels are affected. The kernel team are aware of the situation, and we have begun an emergency revert of the patches, and we should have new kernels available in the next few hours / day or so. The current mainline kernel is affected, so I have written to the raid subsystem maintainer, and the original author of the raid10 block discard patches, to aid with debugging and fixing the problem. You can follow the upstream thread here: https://www.spinics.net/lists/kernel/msg3765302.html As for the data corruption on your servers, I am deeply sorry for causing this regression. When I was testing the raid10 block discard patches on the Ubuntu stable kernels, I did not think to fsck each of the disks in the array, instead, I was contempt with the speed of creating new arrays, writing a basic dataset to the disks, and rebooting the server to ensure the array came up again with those same files. Since the first disk seems to be okay, there is at least a small window of opportunity for you to restore any data that you have not backed up. I will keep you informed of getting the patches reverted, and getting the root cause fixed upstream. If you have any questions, feel free to ask, and if you have any more details from your own debugging, feel free to share in this bug, or on the upstream mailing list discussion. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: In Progress Status in linux source package in Focal: In Progress Status in linux source package in Groovy: In Progress Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help :
[Bug 1907262] Re: raid10: discard leads to corrupted file system
** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Groovy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Bionic) Status: New => In Progress ** Changed in: linux (Ubuntu Focal) Status: New => In Progress ** Changed in: linux (Ubuntu Groovy) Status: New => In Progress ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Focal) Importance: Undecided => High ** Changed in: linux (Ubuntu Groovy) Importance: Undecided => High -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Groovy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Bionic) Status: New => In Progress ** Changed in: linux (Ubuntu Focal) Status: New => In Progress ** Changed in: linux (Ubuntu Groovy) Status: New => In Progress ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Focal) Importance: Undecided => High ** Changed in: linux (Ubuntu Groovy) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: In Progress Status in linux source package in Focal: In Progress Status in linux source package in Groovy: In Progress Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression
> Ok, so there was a LOT happening in this thread, so I'd use some quick > summary. > Since what I'd like to know: > 1) Does this cyrus-sasl2 fix both the adcli and sssd regressions? > Since we reverted both as people were reporting regressions first for sssd > and then for adcli - not sure which one was the actual cause of it though The cyrus-sasl2 fix fixes the adcli regression, due to adcli changing to using GSS-SPNEGO by default, which was broken. sssd never had a regression in the first place, due to the changes having nothing to do with GSS-SPNEGO. The confusion with sssd came from confused users who did not know that adcli is the program under the hood of realm, and thought that sssd had broken, when in reality, it was adcli. > 2) Does it need fixing for all the stable series where we updated adcli and > (additionally) sssd? cyrus-sasl2 is only broken in Bionic. Focal onward already have the patch and work fine. Let me know if you have any more questions, happy to answer. Thanks, Matthew On Tue, Dec 8, 2020 at 4:57 PM Matthew Ruffell wrote: > > Hello Eric and Lukasz, > > I have created new debdiffs for adcli. Please review and also sponsor one > of them to -proposed. > > Since there are multiple versions of adcli floating around I made two > debdiffs. > > Please choose the one most convenient / cleanest to apply. > > The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, and is > the version pull-lp-source pulls down. It simply adds the dependency > to the fixed > libsasl2-modules-gssapi-mit package with a greater than or equal to > relationship. > > Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload > queue, > and treated as 0.8.2-1ubuntu2 never existed. > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff > > Option two builds upon 0.8.2-1ubuntu2, and re-applies all of the --use-ldaps > patches from the previous SRU which 0.8.2-1ubuntu2 reverts. It also adds the > dependency to the fixed libsasl2-modules-gssapi-mit package with a > greater than > or equal to relationship. > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff > > My preference is for option one, but use whatever is required. I only made > both > of these to lower round trip time due to timezones if you don't like the > option > one idea. > > Thanks, > Matthew > > On Mon, Dec 7, 2020 at 3:25 PM Matthew Ruffell > wrote: > > > > Hi Eric, Lukasz, > > > > Please review and potentially sponsor the cyrus-sasl2 debdff attached > > to LP1906627. > > > > [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 > > > > It fixes the root cause of the GSS-SPNEGO implementation being incompatible > > with > > Microsoft's implementation in Active Directory. > > > > If you are still planning to re-release adcli and sssd to -security, then > > you > > should also build cyrus-sasl2 in the same way: > > > > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages > > > > Again, I am sorry for causing the regression and these patches should fix > > the > > underlying cause. > > > > Thanks, > > Matthew -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim
block/md0/md/mismatch_cnt 205324928 # cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid10 nvme1n1[1] nvme2n1[0] 292836352 blocks super 1.2 2 near-copies [2/2] [UU] bitmap: 0/3 pages [0KB], 65536KB chunk unused devices: # cat /sys/block/md0/md/mismatch_cnt 205324928 Now, we need to take the raid10 array down, and perform a fsck on one disk at a time: # umount /mnt # vgchange -a n /dev/VolGroup 0 logical volume(s) in volume group "VolGroup" now active # mdadm --stop /dev/md0 mdadm: stopped /dev/md0 Let's do first disk; # mdadm --assemble /dev/md127 /dev/nvme1n1 mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). # mdadm --run /dev/md127 mdadm: started array /dev/md/lv-raid # vgchange -a y /dev/VolGroup 1 logical volume(s) in volume group "VolGroup" now active # fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 blocks # vgchange -a n /dev/VolGroup 0 logical volume(s) in volume group "VolGroup" now active # mdadm --stop /dev/md127 mdadm: stopped /dev/md127 The second disk: # mdadm --assemble /dev/md127 /dev/nvme2n1 mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). # mdadm --run /dev/md127 mdadm: started array /dev/md/lv-raid # vgchange -a y /dev/VolGroup 1 logical volume(s) in volume group "VolGroup" now active # fsck.ext4 -n -f /dev/VolGroup/root e2fsck 1.44.1 (24-Mar-2018) Resize inode not valid. Recreate? no Pass 1: Checking inodes, blocks, and sizes Inode 7 has illegal block(s). Clear? no Illegal indirect block (1714656753) in inode 7. IGNORED. Error while iterating over blocks in inode 7: Illegal indirect block found /dev/VolGroup/root: ** WARNING: Filesystem still has errors ** e2fsck: aborted /dev/VolGroup/root: ** WARNING: Filesystem still has errors ** # vgchange -a n /dev/VolGroup 0 logical volume(s) in volume group "VolGroup" now active # mdadm --stop /dev/md127 mdadm: stopped /dev/md127 There are no panics or anything in dmesg. The directory structure of the first disk is intact, but the second disk only has Lost+Found present. I can confirm it is the patches listed at the top of the email, but I have not had an opportunity to bisect to find the exact root cause. I will do that once we confirm what Ubuntu stable kernels are affected and begin reverting the patches. Let me know if you need any more details. Thanks, Matthew Ruffell
[Bug 1907262] Re: raid10: discard leads to corrupted file system
Hi Thimo, Thank you for the very detailed bug report. I will start investigating this immediately. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
Hi Thimo, Thank you for the very detailed bug report. I will start investigating this immediately. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: Bug Triage - Friday 4th December
Hi Christian, > Maybe when you go for adcli and sssd in LP #1868703 again - they might > have their dependency to libsasl2-modules-gssapi-mit be versioned to > be greater or equal the fixed cyrus_sasl2? That is an excellent idea. I will do exactly that. I have prepared a new debdiff for adcli which adds a dependency to libsasl2-modules-gssapi-mit at the new upload version of 2.1.27~101-g0780600+dfsg-3ubuntu2.2. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff Thanks for suggesting! Matthew On Tue, Dec 8, 2020 at 12:28 AM Christian Ehrhardt wrote: > > On Mon, Dec 7, 2020 at 3:45 AM Matthew Ruffell > wrote: > > > ... > > Again, I apologise for the regression, and things are on their way to being > > fixed. > > Thanks for jumping on it once it was identified. > > One suggestion for the coming related uploads. > Do you think it would make sense to ensure that the now-known-bad > combinations of packages won't be allowed together. > Maybe when you go for adcli and sssd in LP #1868703 again - they might > have their dependency to libsasl2-modules-gssapi-mit be versioned to > be greater or equal the fixed cyrus_sasl2? > > > > [1] > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff > > [2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 -- ubuntu-server mailing list ubuntu-server@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-server More info: https://wiki.ubuntu.com/ServerTeam
Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression
Hello Eric and Lukasz, I have created new debdiffs for adcli. Please review and also sponsor one of them to -proposed. Since there are multiple versions of adcli floating around I made two debdiffs. Please choose the one most convenient / cleanest to apply. The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, and is the version pull-lp-source pulls down. It simply adds the dependency to the fixed libsasl2-modules-gssapi-mit package with a greater than or equal to relationship. Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload queue, and treated as 0.8.2-1ubuntu2 never existed. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff Option two builds upon 0.8.2-1ubuntu2, and re-applies all of the --use-ldaps patches from the previous SRU which 0.8.2-1ubuntu2 reverts. It also adds the dependency to the fixed libsasl2-modules-gssapi-mit package with a greater than or equal to relationship. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff My preference is for option one, but use whatever is required. I only made both of these to lower round trip time due to timezones if you don't like the option one idea. Thanks, Matthew On Mon, Dec 7, 2020 at 3:25 PM Matthew Ruffell wrote: > > Hi Eric, Lukasz, > > Please review and potentially sponsor the cyrus-sasl2 debdff attached > to LP1906627. > > [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 > > It fixes the root cause of the GSS-SPNEGO implementation being incompatible > with > Microsoft's implementation in Active Directory. > > If you are still planning to re-release adcli and sssd to -security, then you > should also build cyrus-sasl2 in the same way: > > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages > > Again, I am sorry for causing the regression and these patches should fix the > underlying cause. > > Thanks, > Matthew -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option two: a debdiff for adcli, which builds on 0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater or equal to relationship. Use this if option 1 is a no go. ** Patch added: "debdiff for adcli on Bionic option two" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option two: a debdiff for adcli, which builds on 0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater or equal to relationship. Use this if option 1 is a no go. ** Patch added: "debdiff for adcli on Bionic option two" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Sts-sponsors] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option two: a debdiff for adcli, which builds on 0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater or equal to relationship. Use this if option 1 is a no go. ** Patch added: "debdiff for adcli on Bionic option two" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity modes. This shouldn't
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option one: a debdiff for adcli, which builds on 0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules- gssapi-mit at greater or equal to relationship. This will require the 0.8.2-1ubuntu2 package in -unapproved queue to be deleted. ** Patch added: "debdiff for adcli on Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option one: a debdiff for adcli, which builds on 0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules- gssapi-mit at greater or equal to relationship. This will require the 0.8.2-1ubuntu2 package in -unapproved queue to be deleted. ** Patch added: "debdiff for adcli on Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity modes. This
[Sts-sponsors] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is option one: a debdiff for adcli, which builds on 0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules- gssapi-mit at greater or equal to relationship. This will require the 0.8.2-1ubuntu2 package in -unapproved queue to be deleted. ** Patch added: "debdiff for adcli on Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity modes. This shouldn't be a problem
Re: Bug Triage - Friday 4th December
Status update: - There is a new build of adcli, version 0.8.2-1ubuntu2 which reverts the patches introduced in the previous build, on the -unapproved queue in -proposed. This is likely to be released to fix anyone using the faulty 0.8.2-1ubuntu1 package. - As mentioned in previous messages, I have determined the root cause of the failure to be an incompatible implementation of GSS-SPNEGO in cyrus-sasl2, and I have created a debdiff which fixes the problem [1]. - I have added a SRU template for cyrus-sasl2 in [2], and asked for the changes to be sponsored and placed into -proposed. This regression will be resolved when either the cyrus-sasl2 fixes have made their way to -updates, likely in a week's time, or when the adcli package with the reverted patches is released. Once the fixed cyrus-sasl2 is released, we will re-perform verification on the changes to adcli and sssd in LP #1868703, and hopefully go for release again. Again, I apologise for the regression, and things are on their way to being fixed. Thanks, Matthew [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff [2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 On Sat, Dec 5, 2020 at 3:32 PM Matthew Ruffell wrote: > > Status update: > > - all recent releases of sssd and adcli have been pulled from -updates and > -security, and placed back into -proposed. > > - I made a debdiff to revert the problematic patches for adcli in Bionic, > Lukasz has built it in > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages > > - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above > ppa to bionic-proposed for testing. > > - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after. > > - I have written to customers and confirmed the regression to be limited to > adcli on Bionic, and given them instructions to dowongrade to the version in > the -release pocket. > > Again, I am sorry for causing the regression. On Monday I will begin fixing up > cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation. > > Thanks, > Matthew > > On Sat, Dec 5, 2020 at 12:33 PM Matthew Ruffell > wrote: > > > > Hi everyone, > > > > Firstly, I deeply apologise for causing the regression. > > > > Even with three separate people testing the test packages and the packages > > in > > -proposed, the failure still went unnoticed. I should have considered > > the impacts > > of changing the default behaviour of adcli a little more deeply than > > treating it > > like a normal SRU. > > > > Here are the facts: > > > > The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the > > time > > of writing, it is still in the archive. To archive admins, this needs > > to be pulled. > > > > adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy > > and > > 0.9.0-1ubuntu2 in Hirsute are not affected. > > > > sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are > > not affected. > > > > Bug Reports: > > > > There are two launchpad bugs open: > > > > LP #1906627 "adcli fails, can't contact LDAP server" > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 > > > > LP #1906673 "Realm join hangs" > > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673 > > > > Customer Cases: > > > > SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain" > > https://canonical.my.salesforce.com/5004K03u9EW > > > > SF 00299039 "Regression Issue due to > > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673; > > https://canonical.my.salesforce.com/5004K03uAkL > > > > Root Cause: > > > > The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD > > requirements (ADV190023)" > > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703 > > > > introduced two changes for adcli on Bionic. The first, was to change from > > GSS-API to GSS-SPNEGO, and the second was to implement support for the flag > > --use-ldaps. > > > > I built a upstream master of adcli, and it still fails on Ubuntu. This > > indicates > > that the failure is not actually in the adcli package. adcli does not > > implement > > GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package, > > which is a part of cyrus-sasl2. > > > > I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it > > works with the proble
Re: [Sts-sponsors] sssd/adcli regression after last upload
Status update: - There is a new build of adcli, version 0.8.2-1ubuntu2 which reverts the patches introduced in the previous build, on the -unapproved queue in -proposed. This is likely to be released to fix anyone using the faulty 0.8.2-1ubuntu1 package. - As mentioned in previous messages, I have determined the root cause of the failure to be an incompatible implementation of GSS-SPNEGO in cyrus-sasl2, and I have created a debdiff which fixes the problem [1]. - I have added a SRU template for cyrus-sasl2 in [2], and asked for the changes to be sponsored and placed into -proposed. This regression will be resolved when either the cyrus-sasl2 fixes have made their way to -updates, likely in a week's time, or when the adcli package with the reverted patches is released. Once the fixed cyrus-sasl2 is released, we will re-perform verification on the changes to adcli and sssd in LP #1868703, and hopefully go for release again. Again, I apologise for the regression, and things are on their way to being fixed. Thanks, Matthew [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff [2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 On Sat, Dec 5, 2020 at 3:32 PM Matthew Ruffell wrote: > > Status update: > > - all recent releases of sssd and adcli have been pulled from -updates and > -security, and placed back into -proposed. > > - I made a debdiff to revert the problematic patches for adcli in Bionic, > Lukasz has built it in > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages > > - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above > ppa to bionic-proposed for testing. > > - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after. > > - I have written to customers and confirmed the regression to be limited to > adcli on Bionic, and given them instructions to dowongrade to the version in > the -release pocket. > > Again, I am sorry for causing the regression. On Monday I will begin fixing up > cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation. > > Thanks, > Matthew > > On Sat, Dec 5, 2020 at 12:23 PM Sergio Durigan Junior > wrote: > > > > On Friday, December 04 2020, Matthew Ruffell wrote: > > > > > Hi everyone, > > > > > > Firstly, I deeply apologise for causing the regression. > > > > Thanks for working on this and for the detailed analysis, Matthew. > > > > -- > > Sergio > > GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14 -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
** Tags added: sts-sponsor -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
** Tags added: sts-sponsor -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity modes. This shouldn't be a problem as the krb5 implementation signals its intentions by setting the correct flags during handshake, which these patches to cyrus-sasl2 should now parse correctly. [Other Info] The below two commits are needed. The first fixes the problem, the second fixes some unused parameter warnings. commit 816e529043de08f3f9dcc4097380de39478b0b16 Author: Simo Sorce Date: Thu Feb
[Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression
Hi Eric, Lukasz, Please review and potentially sponsor the cyrus-sasl2 debdff attached to LP1906627. [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 It fixes the root cause of the GSS-SPNEGO implementation being incompatible with Microsoft's implementation in Active Directory. If you are still planning to re-release adcli and sssd to -security, then you should also build cyrus-sasl2 in the same way: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages Again, I am sorry for causing the regression and these patches should fix the underlying cause. Thanks, Matthew -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is a debdiff for cyrus-sasl2 on Bionic, which resolves the incompatibilities of the GSS-SPNEGO implementation with the one in Active Directory. ** Patch added: "cyrus-sasl2 debdiff for Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
Attached is a debdiff for cyrus-sasl2 on Bionic, which resolves the incompatibilities of the GSS-SPNEGO implementation with the one in Active Directory. ** Patch added: "cyrus-sasl2 debdiff for Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: [Impact] A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a regression for some users when attempting to join a Active Directory realm. adcli introduced a default behaviour change, moving from GSS- API to GSS-SPNEGO as the default channel encryption algorithm. adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- mit, a part of cyrus-sasl2. The implementation seems to have some compatibility issues with particular configurations of Active Directory on recent Windows Server systems. Particularly, adcli sends a ldap query to the domain controller, which responds with a tcp ack, but never returns a ldap response. The connection just hangs at this point and no more traffic is sent. You can see it on the packet trace below: https://paste.ubuntu.com/p/WRnnRMGBPm/ On Focal, where the implementation of GSS-SPNEGO is working, we see a full exchange, and adcli works as expected: https://paste.ubuntu.com/p/8668pJrr2m/ The fix is to not assume use of confidentiality and integrity modes, and instead use the flags negotiated by GSS-API during the initial handshake, as required by Microsoft's implementation. [Testcase] You will need to set up a Windows Server 2019 system, install and configure Active Directory and enable LDAP extensions and configure LDAPS and import the AD SSL certificate to the Ubuntu client. Create some users in Active Directory. On the Ubuntu client, set up /etc/hosts with the hostname of the Windows Server machine, if your system isn't configured for AD DNS. From there, install adcli 0.8.2-1 from -release. $ sudo apt install adcli Set up a packet trace with tcpdump: $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)' Next, join the AD realm using the normal GSS-API: # adcli join --verbose -U Administrator --domain WIN- SB6JAS7PH22.testing.local --domain-controller WIN- SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL You will be prompted for Administrator's passowrd. The output should look like the below: https://paste.ubuntu.com/p/NWHGQn746D/ Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the regression. Repeat the above steps. Now you should see the connection hang. https://paste.ubuntu.com/p/WRnnRMGBPm/ Finally, install the fixed cyrus-sasl2 package, which is available from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test $ sudo add-apt-repository ppa:mruffell/lp1906627-test $ sudo apt-get update $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db libsasl2-modules-gssapi-mit Repeat the steps. GSS-SPNEGO should be working as intended, and you should get output like below: https://paste.ubuntu.com/p/W5cJNGvCsx/ [Where problems could occur] Since we are changing the implementation of GSS-SPNEGO, and cyrus- sasl2 is the library which provides it, we can potentially break any package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO. $ apt rdepends libsasl2-modules-gssapi-mit libsasl2-modules-gssapi-mit Reverse Depends: |Suggests: ldap-utils Depends: adcli Conflicts: libsasl2-modules-gssapi-heimdal |Suggests: libsasl2-modules Conflicts: libsasl2-modules-gssapi-heimdal |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules |Suggests: ldap-utils |Depends: msktutil Conflicts: libsasl2-modules-gssapi-heimdal |Depends: libapache2-mod-webauthldap Depends: freeipa-server Depends: freeipa-client Depends: adcli Depends: 389-ds-base |Recommends: sssd-krb5-common |Suggests: slapd |Suggests: libsasl2-modules While this SRU makes cyrus-sasl2 work with Microsoft implementations of GSS-SPNEGO, which will be the more common usecase, it may change the behaviour when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 assumes use of confidentiality and integrity modes. This shouldn't be a problem as the krb5 implementation signals its intentions by setting the correct flags
[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
** Summary changed: - adcli fails, can't contact LDAP server + GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression ** Description changed: - Package: adcli - Version: 0.8.2-1ubuntu1 - Release: Ubuntu 18.04 LTS + [Impact] - When trying to join the domain with this new version of adcli, it gets - to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do - anything for 10 minutes. It will then fail, complaining it can't reach - the LDAP server. + A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a + regression for some users when attempting to join a Active Directory + realm. adcli introduced a default behaviour change, moving from GSS-API + to GSS-SPNEGO as the default channel encryption algorithm. - Logs: - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459 - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain + adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- + mit, a part of cyrus-sasl2. The implementation seems to have some + compatibility issues with particular configurations of Active Directory + on recent Windows Server systems. - On the network level, adcli gets to the point of send an ldap query to - the domain controller and the domain controller returns an ack tcp - packet, but then there is no more traffic between the domain controller - and the server except for ntp packets until it fails. + Particularly, adcli sends a ldap query to the domain controller, which + responds with a tcp ack, but never returns a ldap response. The + connection just hangs at this point and no more traffic is sent. - The domain controller traffic also shows that it is receiving the ldap - query packet from the server but it never sends a
[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression
** Summary changed: - adcli fails, can't contact LDAP server + GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression ** Description changed: - Package: adcli - Version: 0.8.2-1ubuntu1 - Release: Ubuntu 18.04 LTS + [Impact] - When trying to join the domain with this new version of adcli, it gets - to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do - anything for 10 minutes. It will then fail, complaining it can't reach - the LDAP server. + A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a + regression for some users when attempting to join a Active Directory + realm. adcli introduced a default behaviour change, moving from GSS-API + to GSS-SPNEGO as the default channel encryption algorithm. - Logs: - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind - Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind - Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 - Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab - Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server - Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459 - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain - Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain + adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi- + mit, a part of cyrus-sasl2. The implementation seems to have some + compatibility issues with particular configurations of Active Directory + on recent Windows Server systems. - On the network level, adcli gets to the point of send an ldap query to - the domain controller and the domain controller returns an ack tcp - packet, but then there is no more traffic between the domain controller - and the server except for ntp packets until it fails. + Particularly, adcli sends a ldap query to the domain controller, which + responds with a tcp ack, but never returns a ldap response. The + connection just hangs at this point and no more traffic is sent. - The domain controller traffic also shows that it is receiving the ldap - query packet from the server but it never sends a
Re: Bug Triage - Friday 4th December
Hi everyone, Firstly, I deeply apologise for causing the regression. Even with three separate people testing the test packages and the packages in -proposed, the failure still went unnoticed. I should have considered the impacts of changing the default behaviour of adcli a little more deeply than treating it like a normal SRU. Here are the facts: The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time of writing, it is still in the archive. To archive admins, this needs to be pulled. adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and 0.9.0-1ubuntu2 in Hirsute are not affected. sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are not affected. Bug Reports: There are two launchpad bugs open: LP #1906627 "adcli fails, can't contact LDAP server" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 LP #1906673 "Realm join hangs" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673 Customer Cases: SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain" https://canonical.my.salesforce.com/5004K03u9EW SF 00299039 "Regression Issue due to https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673; https://canonical.my.salesforce.com/5004K03uAkL Root Cause: The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD requirements (ADV190023)" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703 introduced two changes for adcli on Bionic. The first, was to change from GSS-API to GSS-SPNEGO, and the second was to implement support for the flag --use-ldaps. I built a upstream master of adcli, and it still fails on Ubuntu. This indicates that the failure is not actually in the adcli package. adcli does not implement GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package, which is a part of cyrus-sasl2. I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it works with the problematic adcli package. The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on Bionic is broken, and has never worked. There is more details about commits which the cyrus-sasl2 package in Bionic is missing in comment #5 in LP #1906627. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5 Steps taken yesterday: I added regression-update to LP #1906627, and I pinged ubuntu-archive in #ubuntu-release with these details, but they seem to have been lost in the noise. Located root cause to cryus-sasl2 on Bionic. Next steps: We don't need to revert any changes for adcli or sssd on Focal onward. We don't need to revert any changes on sssd on Bionic. We need to push a new adcli into Bionic with the recent patches reverted. We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic. We need to re-release all the SRUs from LP #1868703 after some very thorough testing and validation. Again, I am deeply sorry for causing this regression. I will fix it, starting with getting adcli removed from the Bionic archive. Thanks, Matthew On Fri, Dec 4, 2020 at 10:40 PM Lukasz Zemczak wrote: > > Hey! > > I prefer broken upgrades to get pulled anyway. Besides, packages are > updated by unattended-upgrades in up-to 24 hours, so some users might > have not gotten it yet. And there's also those not using > undattended-upgrades. Let me demote it back to -proposed from -updates > as well. > > On Fri, 4 Dec 2020 at 10:00, Christian Ehrhardt > wrote: > > > > On Fri, Dec 4, 2020 at 9:49 AM Lukasz Zemczak > > wrote: > > > > > > Hey Christian! > > > > > > This sounds bad indeed, let's see what Matthew has to say. In the > > > meantime I have backed it out from both bionic-security and > > > focal-security. > > > > Thank you > > > > > Should we also consider dropping it from -updates? > > > > Well, compared to other cases in this case we don't even yet have a > > "ok this is a mess, but this is how you can resolve it afterwards to > > work again". > > Therefore I think pulling it from -updates as well makes sense until > > Matthew had time to look at it in detail and give all-clear (or not). > > > > P.S.: you slightly raced vorlon who had a different assessment > > [09:30] cpaelzer: well, by this point almost everyone will > > have picked it up from security via unattended-upgrades so there's not > > much point > > But having it pulled for now is on the safe-side and we can re-instate > > it at any time once we know more. > > > > > Cheers, > > > > > > On Fri, 4 Dec 2020 at 09:01, Christian Ehrhardt > > > wrote: > > > > > > > > I was looking at 16 recently touched bugs. Of these a few needed a > > > > comment or > > > > task update but not a lot of work. Worth to mention are two of them. > > > > > > > > First we've had "one more" kind of conflicting mysql packages from > > > > third party breaking install/upgrade of the one provided by Ubuntu. I > > > > dupped it onto bug 1771630 which is our single place to unite all > > > > those. > > > > > > > >
Re: Bug Triage - Friday 4th December
Status update: - all recent releases of sssd and adcli have been pulled from -updates and -security, and placed back into -proposed. - I made a debdiff to revert the problematic patches for adcli in Bionic, Lukasz has built it in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above ppa to bionic-proposed for testing. - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after. - I have written to customers and confirmed the regression to be limited to adcli on Bionic, and given them instructions to dowongrade to the version in the -release pocket. Again, I am sorry for causing the regression. On Monday I will begin fixing up cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation. Thanks, Matthew On Sat, Dec 5, 2020 at 12:33 PM Matthew Ruffell wrote: > > Hi everyone, > > Firstly, I deeply apologise for causing the regression. > > Even with three separate people testing the test packages and the packages in > -proposed, the failure still went unnoticed. I should have considered > the impacts > of changing the default behaviour of adcli a little more deeply than treating > it > like a normal SRU. > > Here are the facts: > > The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time > of writing, it is still in the archive. To archive admins, this needs > to be pulled. > > adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and > 0.9.0-1ubuntu2 in Hirsute are not affected. > > sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are > not affected. > > Bug Reports: > > There are two launchpad bugs open: > > LP #1906627 "adcli fails, can't contact LDAP server" > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 > > LP #1906673 "Realm join hangs" > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673 > > Customer Cases: > > SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain" > https://canonical.my.salesforce.com/5004K03u9EW > > SF 00299039 "Regression Issue due to > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673; > https://canonical.my.salesforce.com/5004K03uAkL > > Root Cause: > > The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD > requirements (ADV190023)" > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703 > > introduced two changes for adcli on Bionic. The first, was to change from > GSS-API to GSS-SPNEGO, and the second was to implement support for the flag > --use-ldaps. > > I built a upstream master of adcli, and it still fails on Ubuntu. This > indicates > that the failure is not actually in the adcli package. adcli does not > implement > GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package, > which is a part of cyrus-sasl2. > > I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it > works with the problematic adcli package. > > The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on > Bionic is broken, and has never worked. > > There is more details about commits which the cyrus-sasl2 package in Bionic is > missing in comment #5 in LP #1906627. > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5 > > Steps taken yesterday: > > I added regression-update to LP #1906627, and I pinged ubuntu-archive in > #ubuntu-release with these details, but they seem to have been lost in the > noise. > > Located root cause to cryus-sasl2 on Bionic. > > Next steps: > > We don't need to revert any changes for adcli or sssd on Focal onward. > > We don't need to revert any changes on sssd on Bionic. > > We need to push a new adcli into Bionic with the recent patches reverted. > > We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic. > > We need to re-release all the SRUs from LP #1868703 after some very thorough > testing and validation. > > Again, I am deeply sorry for causing this regression. I will fix it, starting > with getting adcli removed from the Bionic archive. > > Thanks, > Matthew > > On Fri, Dec 4, 2020 at 10:40 PM Lukasz Zemczak > wrote: > > > > Hey! > > > > I prefer broken upgrades to get pulled anyway. Besides, packages are > > updated by unattended-upgrades in up-to 24 hours, so some users might > > have not gotten it yet. And there's also those not using > > undattended-upgrades. Let me demote it back to -proposed from -updates > > as well. > > > > On Fri, 4 Dec 2020 at 10:00, Christian Ehrhardt > > wrote: > &
Re: [Sts-sponsors] sssd/adcli regression after last upload
Status update: - all recent releases of sssd and adcli have been pulled from -updates and -security, and placed back into -proposed. - I made a debdiff to revert the problematic patches for adcli in Bionic, Lukasz has built it in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above ppa to bionic-proposed for testing. - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after. - I have written to customers and confirmed the regression to be limited to adcli on Bionic, and given them instructions to dowongrade to the version in the -release pocket. Again, I am sorry for causing the regression. On Monday I will begin fixing up cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation. Thanks, Matthew On Sat, Dec 5, 2020 at 12:23 PM Sergio Durigan Junior wrote: > > On Friday, December 04 2020, Matthew Ruffell wrote: > > > Hi everyone, > > > > Firstly, I deeply apologise for causing the regression. > > Thanks for working on this and for the detailed analysis, Matthew. > > -- > Sergio > GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14 -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
[Bug 1906627] Re: adcli fails, can't contact LDAP server
** Changed in: cyrus-sasl2 (Ubuntu) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server
** Changed in: cyrus-sasl2 (Ubuntu) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: Package: adcli Version: 0.8.2-1ubuntu1 Release: Ubuntu 18.04 LTS When trying to join the domain with this new version of adcli, it gets to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do anything for 10 minutes. It will then fail, complaining it can't reach the LDAP server. Logs: Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain On the network level, adcli gets to the point of send an ldap query to the domain controller and the domain controller returns an ack tcp packet, but then there is no more traffic between the domain controller and the server except for ntp packets until it fails. The domain controller traffic also shows that it is receiving the ldap query packet from the server but it never sends a reply and there is no log in directory services regarding the query. We also couldn't find anything in procmon regarding this query either. Workaround/Fix: Downgrading the adcli package back to version 0.8.2-1 fixes the issues and domain join works properly again. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions --
[Bug 1906627] Re: adcli fails, can't contact LDAP server
** Changed in: cyrus-sasl2 (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: cyrus-sasl2 (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: cyrus-sasl2 (Ubuntu Bionic) Assignee: (unassigned) => Matthew Ruffell (mruffell) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server
** Changed in: cyrus-sasl2 (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: cyrus-sasl2 (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: cyrus-sasl2 (Ubuntu Bionic) Assignee: (unassigned) => Matthew Ruffell (mruffell) -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: Confirmed Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: In Progress Bug description: Package: adcli Version: 0.8.2-1ubuntu1 Release: Ubuntu 18.04 LTS When trying to join the domain with this new version of adcli, it gets to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do anything for 10 minutes. It will then fail, complaining it can't reach the LDAP server. Logs: Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain On the network level, adcli gets to the point of send an ldap query to the domain controller and the domain controller returns an ack tcp packet, but then there is no more traffic between the domain controller and the server except for ntp packets until it fails. The domain controller traffic also shows that it is receiving the ldap query packet from the server but it never sends a reply and there is no log in directory services regarding the query. We also couldn't find anything in procmon regarding this query either. Workaround/Fix: Downgrading the adcli package back to versio
[Bug 1906673] Re: Realm join hangs
*** This bug is a duplicate of bug 1906627 *** https://bugs.launchpad.net/bugs/1906627 ** This bug has been marked a duplicate of bug 1906627 adcli fails, can't contact LDAP server -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906673 Title: Realm join hangs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: adcli fails, can't contact LDAP server
Attached is a debdiff to revert the changes we made to adcli to restore functionality to GSS-API. ** Patch added: "Debdiff for adcli on Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441133/+files/lp1906627_adcli_bionic.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server
Attached is a debdiff to revert the changes we made to adcli to restore functionality to GSS-API. ** Patch added: "Debdiff for adcli on Bionic" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441133/+files/lp1906627_adcli_bionic.debdiff -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: New Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: New Bug description: Package: adcli Version: 0.8.2-1ubuntu1 Release: Ubuntu 18.04 LTS When trying to join the domain with this new version of adcli, it gets to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do anything for 10 minutes. It will then fail, complaining it can't reach the LDAP server. Logs: Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using computer account name: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain realm: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * Calculated computer account name from fqdn: EXAMPLE001 Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * With user principal: host/example001.domain@domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Generated 120 character computer password Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using keytab: FILE:/etc/krb5.keytab Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Failed to join the domain On the network level, adcli gets to the point of send an ldap query to the domain controller and the domain controller returns an ack tcp packet, but then there is no more traffic between the domain controller and the server except for ntp packets until it fails. The domain controller traffic also shows that it is receiving the ldap query packet from the server but it never sends a reply and there is no log in directory services regarding the query. We also couldn't find anything in procmon regarding this query either. Workaround/Fix: Downgrading the adcli package back to version 0.8.2-1 fixes the
Re: [Sts-sponsors] sssd/adcli regression after last upload
Hi everyone, Firstly, I deeply apologise for causing the regression. Even with three separate people testing the test packages and the packages in -proposed, the failure still went unnoticed. I should have considered the impacts of changing the default behaviour of adcli a little more deeply than treating it like a normal SRU. Here are the facts: The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time of writing, it is still in the archive. To archive admins, this needs to be pulled. adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and 0.9.0-1ubuntu2 in Hirsute are not affected. sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are not affected. Bug Reports: There are two launchpad bugs open: LP #1906627 "adcli fails, can't contact LDAP server" https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627 LP #1906673 "Realm join hangs" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673 Customer Cases: SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain" https://canonical.my.salesforce.com/5004K03u9EW SF 00299039 "Regression Issue due to https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673; https://canonical.my.salesforce.com/5004K03uAkL Root Cause: The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD requirements (ADV190023)" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703 introduced two changes for adcli on Bionic. The first, was to change from GSS-API to GSS-SPNEGO, and the second was to implement support for the flag --use-ldaps. I built a upstream master of adcli, and it still fails on Ubuntu. This indicates that the failure is not actually in the adcli package. adcli does not implement GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package, which is a part of cyrus-sasl2. I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it works with the problematic adcli package. The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on Bionic is broken, and has never worked. There is more details about commits which the cyrus-sasl2 package in Bionic is missing in comment #5 in LP #1906627. https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5 Steps taken yesterday: I added regression-update to LP #1906627, and I pinged ubuntu-archive in #ubuntu-release with these details, but they seem to have been lost in the noise. Located root cause to cryus-sasl2 on Bionic. Next steps: We don't need to revert any changes for adcli or sssd on Focal onward. We don't need to revert any changes on sssd on Bionic. We need to push a new adcli into Bionic with the recent patches reverted. We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic. We need to re-release all the SRUs from LP #1868703 after some very thorough testing and validation. Again, I am deeply sorry for causing this regression. I will fix it, starting with getting adcli removed from the Bionic archive. Thanks, Matthew On Sat, Dec 5, 2020 at 10:37 AM Jamie Strandboge wrote: > > Looping in security@ > On Fri, 04 Dec 2020, Sergio Durigan Junior wrote: > > > Hi Matthew, > > > > How are things? I'm writing to you because the last upload to > > sssd/adcli introduced a regression that is causing "realm join" to > > hang. The bug in question is this one: > > > > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673 > > > > There is also a SalesForce case opened from AWS: > > > > https://canonical.my.salesforce.com/5004K03uAkLQAU > > > > (I don't have access to it, but cnewcomer said it's basically the same > > issue, but that AWS is actually reporting it against adcli). > > > > I am not entirely sure whether this bug affects both sssd and adcli, or > > just one of them. It is possible that this is just affecting adcli, > > based on input from Tobias Karnat, but we have to investigate this > > further. > > > > This regression was introduced because of the work done here: > > > > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703 > > > > Lukasz (sil2100) has already pulled the sssd package from the > > -security/-update pockets. I've asked him to also pull the adcli > > package. At the time of this writing, he hasn't done that yet (he had > > to go AFK), but he told me he would. In any case, this is not going to > > help much because by now most systems probably have the updates already > > because of unattended-upgrades. > > > > Having said all that, would it be possible for you to handle this issue? > > I can offer any help you need, of course, but I feel like you already > > have all the context in your head and would be able to make progress > > much faster. > > > > Thanks in advance, > > > > -- > > Sergio > > GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14 > > > -- > Jamie Strandboge | http://www.canonical.com -- Mailing list: https://launchpad.net/~sts-sponsors Post
[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server
Yes, when --use-ldaps is specified, adcli will make a TLS connection to the domain controller, and speak LDAPS. This works, and is the reason why this bug slipped through our regression testing. I should have tested without the --use-ldaps flag as well. Regardless, this bug seems to be caused by the GSS-SPNEGO implementation in the cyrus-sasl2 package being broken. adcli links to libsasl2 -modules-gssapi-mit, which is a part of cyrus-sasl2, since adcli does not implement GSS-SPNEGO itself, and relies on cyrus-sasl libraries. I downloaded the source package of cyrus-sasl2 2.1.27+dfsg-2 from Focal, and I built it on Bionic, and installed it. I then tried a adcli join, and it worked: https://paste.ubuntu.com/p/R8PyHJMNtT/ Looking at the cyrus-sasl2 source repo, it seems the Bionic version is missing a lot of commits related to GSS-SPNEGO support. Commit 816e529043de08f3f9dcc4097380de39478b0b16 From: Simo Sorce Date: Thu, 16 Feb 2017 15:25:56 -0500 Subject: Fix GSS-SPNEGO mechanism's incompatible behavior Link: https://github.com/cyrusimap/cyrus-sasl/commit/816e529043de08f3f9dcc4097380de39478b0b16 Commit 4b0306dcd76031460246b2dabcb7db766d6b04d8 From: Simo Sorce Date: Mon, 10 Apr 2017 19:54:19 -0400 Subject: Add support for retrieving the mech_ssf Link: https://github.com/cyrusimap/cyrus-sasl/commit/4b0306dcd76031460246b2dabcb7db766d6b04d8 Commit 31b68a9438c24fc9e3e52f626462bf514de31757 From: Ryan Tandy Date: Mon, 24 Dec 2018 15:07:02 -0800 Subject: Restore LIBS after checking gss_inquire_sec_context_by_oid Link: https://github.com/cyrusimap/cyrus-sasl/commit/31b68a9438c24fc9e3e52f626462bf514de31757 This doesn't even seem to be a complete list either, and if we backport these patches to the Bionic cyrus-sasl2 package, it fails to build for numerous reasons. I also found a similar bug report in Debian, which features the above third commit: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129 >From what I can tell, GSS-SPNEGO in cyrus-sasl2 for Bionic has never worked, and changing it to the default was a bad idea. So, we have a decision to make. If supporting the new Active Directory requirements in ADV190023 [1][2] which adds --use-ldaps for adcli, as a part of bug 1868703 is important, and something the community wants, we need to fix up cyrus-sasl2 to have a working GSS-SPNEGO implementation. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows If we don't want --use-ldaps for adcli, then we can revert the patches for adcli on Bionic, and go back to what was working previously, with GSS-API. ** Bug watch added: Debian Bug tracker #917129 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129 ** Also affects: cyrus-sasl2 (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server Status in adcli package in Ubuntu: Fix Released Status in cyrus-sasl2 package in Ubuntu: New Status in adcli source package in Bionic: In Progress Status in cyrus-sasl2 source package in Bionic: New Bug description: Package: adcli Version: 0.8.2-1ubuntu1 Release: Ubuntu 18.04 LTS When trying to join the domain with this new version of adcli, it gets to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do anything for 10 minutes. It will then fail, complaining it can't reach the LDAP server. Logs: Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Authenticated as user: domain-join-acco...@domain.com Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com realmd[6419]: * Using GSS-SPNEGO for SASL bind Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1 Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: ! Couldn't lookup domain short name: Can't contact LDAP server Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using fully qualified name: example001.domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using domain name: domain.com Dec 03 01:55:27 example001.domain.com realmd[6419]: * Using
[Bug 1906627] Re: adcli fails, can't contact LDAP server
Yes, when --use-ldaps is specified, adcli will make a TLS connection to the domain controller, and speak LDAPS. This works, and is the reason why this bug slipped through our regression testing. I should have tested without the --use-ldaps flag as well. Regardless, this bug seems to be caused by the GSS-SPNEGO implementation in the cyrus-sasl2 package being broken. adcli links to libsasl2 -modules-gssapi-mit, which is a part of cyrus-sasl2, since adcli does not implement GSS-SPNEGO itself, and relies on cyrus-sasl libraries. I downloaded the source package of cyrus-sasl2 2.1.27+dfsg-2 from Focal, and I built it on Bionic, and installed it. I then tried a adcli join, and it worked: https://paste.ubuntu.com/p/R8PyHJMNtT/ Looking at the cyrus-sasl2 source repo, it seems the Bionic version is missing a lot of commits related to GSS-SPNEGO support. Commit 816e529043de08f3f9dcc4097380de39478b0b16 From: Simo Sorce Date: Thu, 16 Feb 2017 15:25:56 -0500 Subject: Fix GSS-SPNEGO mechanism's incompatible behavior Link: https://github.com/cyrusimap/cyrus-sasl/commit/816e529043de08f3f9dcc4097380de39478b0b16 Commit 4b0306dcd76031460246b2dabcb7db766d6b04d8 From: Simo Sorce Date: Mon, 10 Apr 2017 19:54:19 -0400 Subject: Add support for retrieving the mech_ssf Link: https://github.com/cyrusimap/cyrus-sasl/commit/4b0306dcd76031460246b2dabcb7db766d6b04d8 Commit 31b68a9438c24fc9e3e52f626462bf514de31757 From: Ryan Tandy Date: Mon, 24 Dec 2018 15:07:02 -0800 Subject: Restore LIBS after checking gss_inquire_sec_context_by_oid Link: https://github.com/cyrusimap/cyrus-sasl/commit/31b68a9438c24fc9e3e52f626462bf514de31757 This doesn't even seem to be a complete list either, and if we backport these patches to the Bionic cyrus-sasl2 package, it fails to build for numerous reasons. I also found a similar bug report in Debian, which features the above third commit: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129 >From what I can tell, GSS-SPNEGO in cyrus-sasl2 for Bionic has never worked, and changing it to the default was a bad idea. So, we have a decision to make. If supporting the new Active Directory requirements in ADV190023 [1][2] which adds --use-ldaps for adcli, as a part of bug 1868703 is important, and something the community wants, we need to fix up cyrus-sasl2 to have a working GSS-SPNEGO implementation. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows If we don't want --use-ldaps for adcli, then we can revert the patches for adcli on Bionic, and go back to what was working previously, with GSS-API. ** Bug watch added: Debian Bug tracker #917129 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129 ** Also affects: cyrus-sasl2 (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: adcli fails, can't contact LDAP server
I built the current upstream master branch of adcli, and it too fails on Bionic: https://paste.ubuntu.com/p/vsgfxyb9X7/ This must be why the exact same patches work on Focal. The problem probably isn't adcli itself, but more likely a library it depends on. # apt depends adcli adcli Depends: libsasl2-modules-gssapi-mit Depends: libc6 (>= 2.14) Depends: libgssapi-krb5-2 (>= 1.6.dfsg.2) Depends: libk5crypto3 (>= 1.7+dfsg) Depends: libkrb5-3 (>= 1.10+dfsg~alpha1) Depends: libldap-2.4-2 (>= 2.4.7) I will try upgrading each of these one at a time to see if it improves the situation. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: adcli fails, can't contact LDAP server
Hi Rolf, I sincerely apologise for causing this regression, it seems my testing was not good enough during the recent SRU. I recently made a change to adcli in bug 1868703 to add the --use-ldaps flag, so adcli can communicate with a domain controller over LDAPS. It also introduced a change where it will use GSS-SPENGO by default, and enforce channel signing, over doing everything in cleartext, which was the old default. The good news is that it seems to be limited to Bionic only, and even though Focal got the exact same patches, Focal seems unaffected. For anyone experiencing this bug, you can downgrade to a working adcli with: $ sudo apt install adcli=0.8.2-1 I am working to fix this now. Comparison of logging and packet traces from various versions: Bionic adcli 0.8.2-1 https://paste.ubuntu.com/p/NWHGQn746D/ Bionic adcli 0.8.2-1ubuntu1 https://paste.ubuntu.com/p/WRnnRMGBPm/ Focal adcli 0.9.0-1ubuntu0.20.04.1 https://paste.ubuntu.com/p/8668pJrr2m/ We can see that Bionic 0.8.2-1ubuntu1 stops at Couldn't lookup computer account: BIONIC$: Can't contact LDAP server Starting debugging now. Will update soon. ** Changed in: adcli (Ubuntu) Status: Confirmed => Fix Released ** Changed in: adcli (Ubuntu Bionic) Status: New => In Progress ** Changed in: adcli (Ubuntu Bionic) Importance: Undecided => High ** Changed in: adcli (Ubuntu Bionic) Assignee: (unassigned) => Matthew Ruffell (mruffell) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: adcli fails, can't contact LDAP server
** Tags added: regression-update -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1906627] Re: adcli fails, can't contact LDAP server
** Also affects: adcli (Ubuntu Bionic) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1906627 Title: adcli fails, can't contact LDAP server To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, I have good news. The SRU has completed, and the new kernels have now been released to -updates. Their versions are: Bionic: 4.15.0-126-generic Focal: 5.4.0-56-generic You can go ahead and schedule that maintenance window now, to install the latest kernel from -updates. These kernels also have full livepatch support, which is good news for you. Let me know how the 4.15.0-126-generic kernel goes on the Launchpad git server, since it should perform just as well as the test kernel you are currently running. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, I have good news. The SRU has completed, and the new kernels have now been released to -updates. Their versions are: Bionic: 4.15.0-126-generic Focal: 5.4.0-56-generic You can go ahead and schedule that maintenance window now, to install the latest kernel from -updates. These kernels also have full livepatch support, which is good news for you. Let me know how the 4.15.0-126-generic kernel goes on the Launchpad git server, since it should perform just as well as the test kernel you are currently running. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Focal: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following commits fix the problem, and all landed in 5.6-rc1: commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 Author: Coly Li Date: Fri Jan 24 01:01:40 2020 +0800 Subject: bcache: remove member accessed from struct btree Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 Author: Coly Li Date: Fri Jan 24 01:01:41 2020 +0800 Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 commit e3de04469a49ee09c89e80bf821508df458ccee6 Author: Coly Li Date: Fri Jan 24 01:01:42 2020 +0800 Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 The first commit is a dependency of the other two. The first commit removes a "recently accessed" marker, used to indicate if a particular cache has been used recently, and if it has been, not consider it for cache eviction. The commit mentions that under heavy IO, all caches will end up being recently accessed, and nothing will ever be shrunk. The second commit changes a previous design decision of skipping the first 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very first commit I mentioned. This time, if we land on 1 or 2, the loop exits and nothing happens, and we waste time waiting on locks, just like the very first commit I mentioned. The fix is to try shrink caches from the tail of the list, and not the beginning. The third commit fixes a minor issue where we don't want to re-arrange the linked list c->btree_cache, which is what the second commit ended up doing, and instead, just shrink the cache at the end of the linked list, and not change the order. One minor backport / context adjustment was required in the first commit for Bionic, and the rest are all clean cherry picks to Bionic and Focal. [Testcase] This is kind of hard to test, since the problem shows up in production environments that are under constant IO pressure, with many different items entering and leaving the cache. The
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, No worries about being busy. Now, the kernel is scheduled to be released early next week, around the 30th of November. I think at this stage it is best to wait it out and install the kernel once it reaches -updates. That way you will have a fixed kernel that is supported by livepatch, and you don't have to justify a reboot twice. I did some regression testing in my comments above, and everything looks okay. These patches also worked great in your test kernel. We have done the best can to verify the kernel in the time given, so don't worry about testing at this stage. I'll let you know once the kernel has reached -updates, likely Monday or Tuesday next week. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following commits fix the problem, and all landed in 5.6-rc1: commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 Author: Coly Li Date: Fri Jan 24 01:01:40 2020 +0800 Subject: bcache: remove member accessed from struct btree Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 Author: Coly Li Date: Fri Jan 24 01:01:41 2020 +0800 Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 commit e3de04469a49ee09c89e80bf821508df458ccee6 Author: Coly Li Date: Fri Jan 24 01:01:42 2020 +0800 Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 The first commit is a dependency of the other two. The first commit removes a "recently accessed" marker, used to indicate if a particular cache has been used recently, and if it has been, not consider it for cache eviction. The commit mentions that under heavy IO, all caches will end up being recently accessed, and nothing will ever be shrunk. The second commit changes a previous design decision of skipping the first 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very first commit I mentioned. This time, if we land on 1 or 2, the loop exits and nothing happens, and we waste time waiting on locks, just like the very first commit I mentioned. The fix is to try shrink caches from the tail of the list, and not the beginning. The third commit fixes a minor issue where we don't want to re-arrange the linked list c->btree_cache, which is what the second commit ended up doing, and instead, just shrink the cache at the end of the linked list, and not change the order. One minor backport / context adjustment was required in the first commit for Bionic, and the rest are all clean cherry picks to Bionic and Focal. [Testcase] This is kind of hard to test,
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, No worries about being busy. Now, the kernel is scheduled to be released early next week, around the 30th of November. I think at this stage it is best to wait it out and install the kernel once it reaches -updates. That way you will have a fixed kernel that is supported by livepatch, and you don't have to justify a reboot twice. I did some regression testing in my comments above, and everything looks okay. These patches also worked great in your test kernel. We have done the best can to verify the kernel in the time given, so don't worry about testing at this stage. I'll let you know once the kernel has reached -updates, likely Monday or Tuesday next week. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Verification for sssd on Bionic: The customer tested sssd from -updates, version 1.16.1-1ubuntu1.6 and the package from -proposed, version 1.16.1-1ubuntu1.7. Begins: Before applying the patch [package from -proposed] I confirmed open ports to our domain controllers using ss and grepping for the DC IPs. Before the patch 389 and 3268 were being actively used. After the patch [installing the package from -proposed] (and after running a few user queries with `id`) ports 636 and 3269 were being used. Ends. This matches my testing and testing Tobias has done, so happy to mark sssd as verified for Bionic. ** Tags removed: verification-needed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Verification for sssd on Focal: The customer tested sssd from -updates, version 2.2.3-3 and the package from -proposed, version 2.2.3-3ubuntu0.1. Begins: I have successfully tested the [package from -proposed] on Ubuntu 20.04.1. Before applying the patch [package from -proposed] I confirmed open ports to our domain controllers using ss and grepping for the DC IPs. Before the patch 389 and 3268 were being actively used. After the patch [installing the package from -proposed] (and after running a few user queries with `id`) ports 636 and 3269 were being used. Ends. This matches my testing and testing Tobias has done, so happy to mark sssd as verified for Focal. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Performing verification for Bionic. Since Benjamin hasn't responded, I will try and verify the best I can. I made a instance on AWS. I used a c5d.large instance type, and added 8gb extra EBS storage. I installed the latest kernel from -updates to get a performance baseline. kernel is 4.15.0-124-generic. I made a bcache disk with the following. Note, the 8gb disk was used as the cache disk, and the 50gb disk the backing disk. Having the cache small is to try force cache evictions often, and possibly try trigger the bug. $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:00 46.6G 0 disk nvme0n1 259:108G 0 disk nvme2n1 259:208G 0 disk └─nvme2n1p1 259:308G 0 part / $ sudo apt install bcache-tools $ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=8 $ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=8 $ sudo wipefs -a /dev/nvme0n1 $ sudo wipefs -a /dev/nvme1n1 $ sudo make-bcache -C /dev/nvme0n1 -B /dev/nvme1n1 UUID: 3f28ca5d-856b-42e9-bbb7-54cae12b5538 Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:0 nbuckets: 16384 block_size: 1 bucket_size:1024 nr_in_set: 1 nr_this_dev:0 first_bucket: 1 UUID: cc3e36fd-3694-4c50-aeac-0b79d2faab4a Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:1 block_size: 1 data_offset:16 $ sudo mkfs.ext4 /dev/bcache0 $ sudo mkdir /media/bcache $ sudo mount /dev/bcache0 /media/bcache $ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab >From there, I installed fio to do some benchmarks, and to try apply some IO pressure to the cache. $ sudo apt install fio I used the following fio jobfile: https://paste.ubuntu.com/p/RNBmXdy3zG/ It is based on the ssd test in: https://github.com/axboe/fio/blob/master/examples/ssd-test.fio Running the fio job gives us the following output: https://paste.ubuntu.com/p/ghkQcyT2sv/ Now we have the baseline, I enabled -proposed and installed 4.15.0-125-generic and rebooted. I started the fio job again, and got the following output: # uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 https://paste.ubuntu.com/p/DSTnKvXMGZ/ If you compare the two outputs, there really isn't much difference in latencies / read / write speeds. The bcache patches don't seem to cause any large impacts. I managed to set up a bcache disk, and did some IO stress tests. Things seem to be okay. Since we had positive test results on the test kernel on the Launchpad git server, and the above shows we don't appear to have any regressions, I will mark this bug as verified for Bionic. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Performing verification for Bionic. Since Benjamin hasn't responded, I will try and verify the best I can. I made a instance on AWS. I used a c5d.large instance type, and added 8gb extra EBS storage. I installed the latest kernel from -updates to get a performance baseline. kernel is 4.15.0-124-generic. I made a bcache disk with the following. Note, the 8gb disk was used as the cache disk, and the 50gb disk the backing disk. Having the cache small is to try force cache evictions often, and possibly try trigger the bug. $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:00 46.6G 0 disk nvme0n1 259:108G 0 disk nvme2n1 259:208G 0 disk └─nvme2n1p1 259:308G 0 part / $ sudo apt install bcache-tools $ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=8 $ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=8 $ sudo wipefs -a /dev/nvme0n1 $ sudo wipefs -a /dev/nvme1n1 $ sudo make-bcache -C /dev/nvme0n1 -B /dev/nvme1n1 UUID: 3f28ca5d-856b-42e9-bbb7-54cae12b5538 Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:0 nbuckets: 16384 block_size: 1 bucket_size:1024 nr_in_set: 1 nr_this_dev:0 first_bucket: 1 UUID: cc3e36fd-3694-4c50-aeac-0b79d2faab4a Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:1 block_size: 1 data_offset:16 $ sudo mkfs.ext4 /dev/bcache0 $ sudo mkdir /media/bcache $ sudo mount /dev/bcache0 /media/bcache $ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab >From there, I installed fio to do some benchmarks, and to try apply some IO pressure to the cache. $ sudo apt install fio I used the following fio jobfile: https://paste.ubuntu.com/p/RNBmXdy3zG/ It is based on the ssd test in: https://github.com/axboe/fio/blob/master/examples/ssd-test.fio Running the fio job gives us the following output: https://paste.ubuntu.com/p/ghkQcyT2sv/ Now we have the baseline, I enabled -proposed and installed 4.15.0-125-generic and rebooted. I started the fio job again, and got the following output: # uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 https://paste.ubuntu.com/p/DSTnKvXMGZ/ If you compare the two outputs, there really isn't much difference in latencies / read / write speeds. The bcache patches don't seem to cause any large impacts. I managed to set up a bcache disk, and did some IO stress tests. Things seem to be okay. Since we had positive test results on the test kernel on the Launchpad git server, and the above shows we don't appear to have any regressions, I will mark this bug as verified for Bionic. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Performing verification for Focal. Since Benjamin hasn't responded, I will try and verify the best I can. I made a instance on AWS. I used a c5d.large instance type, and added 8gb extra EBS storage. I installed the latest kernel from -updates to get a performance baseline. kernel is 5.4.0-54-generic. I made a bcache disk with the following. Note, the 8gb disk was used as the cache disk, and the 50gb disk the backing disk. Having the cache small is to try force cache evictions often, and possibly try trigger the bug. $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme2n1 259:00 46.6G 0 disk nvme1n1 259:108G 0 disk nvme0n1 259:208G 0 disk └─nvme0n1p1 259:308G 0 part / $ sudo apt install bcache-tools $ sudo dd if=/dev/zero if=/dev/nvme1n1 bs=512 count=8 $ sudo dd if=/dev/zero if=/dev/nvme2n1 bs=512 count=8 $ sudo wipefs -a /dev/nvme1n1 $ sudo wipefs -a /dev/nvme2n1 $ sudo make-bcache -C /dev/nvme1n1 -B /dev/nvme2n1 UUID: 3f28ca5d-856b-42e9-bbb7-54cae12b5538 Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:0 nbuckets: 16384 block_size: 1 bucket_size:1024 nr_in_set: 1 nr_this_dev:0 first_bucket: 1 UUID: cc3e36fd-3694-4c50-aeac-0b79d2faab4a Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:1 block_size: 1 data_offset:16 $ sudo mkfs.ext4 /dev/bcache0 $ sudo mkdir /media/bcache $ sudo mount /dev/bcache0 /media/bcache $ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab >From there, I installed fio to do some benchmarks, and to try apply some IO pressure to the cache. $ sudo apt install fio I used the following fio jobfile: https://paste.ubuntu.com/p/RNBmXdy3zG/ It is based on the ssd test in: https://github.com/axboe/fio/blob/master/examples/ssd-test.fio Running the fio job gives us the following output: https://paste.ubuntu.com/p/HrWGNDJPfv/ Now we have the baseline, I enabled -proposed and installed 5.4.0-55-generic and rebooted. I started the fio job again, and got the following output: # uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 https://paste.ubuntu.com/p/pDVXnspmvs/ If you compare the two outputs, there really isn't much difference in latencies / read / write speeds. The bcache patches don't seem to cause any large impacts. I managed to set up a bcache disk, and did some IO stress tests. Things seem to be okay. Since we had positive test results on the test kernel on the Launchpad git server, and the above shows we don't appear to have any regressions, I will mark this bug as verified for Focal. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Performing verification for Focal. Since Benjamin hasn't responded, I will try and verify the best I can. I made a instance on AWS. I used a c5d.large instance type, and added 8gb extra EBS storage. I installed the latest kernel from -updates to get a performance baseline. kernel is 5.4.0-54-generic. I made a bcache disk with the following. Note, the 8gb disk was used as the cache disk, and the 50gb disk the backing disk. Having the cache small is to try force cache evictions often, and possibly try trigger the bug. $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme2n1 259:00 46.6G 0 disk nvme1n1 259:108G 0 disk nvme0n1 259:208G 0 disk └─nvme0n1p1 259:308G 0 part / $ sudo apt install bcache-tools $ sudo dd if=/dev/zero if=/dev/nvme1n1 bs=512 count=8 $ sudo dd if=/dev/zero if=/dev/nvme2n1 bs=512 count=8 $ sudo wipefs -a /dev/nvme1n1 $ sudo wipefs -a /dev/nvme2n1 $ sudo make-bcache -C /dev/nvme1n1 -B /dev/nvme2n1 UUID: 3f28ca5d-856b-42e9-bbb7-54cae12b5538 Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:0 nbuckets: 16384 block_size: 1 bucket_size:1024 nr_in_set: 1 nr_this_dev:0 first_bucket: 1 UUID: cc3e36fd-3694-4c50-aeac-0b79d2faab4a Set UUID: 756747bc-f27c-44ca-a9b9-dbd132722838 version:1 block_size: 1 data_offset:16 $ sudo mkfs.ext4 /dev/bcache0 $ sudo mkdir /media/bcache $ sudo mount /dev/bcache0 /media/bcache $ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab >From there, I installed fio to do some benchmarks, and to try apply some IO pressure to the cache. $ sudo apt install fio I used the following fio jobfile: https://paste.ubuntu.com/p/RNBmXdy3zG/ It is based on the ssd test in: https://github.com/axboe/fio/blob/master/examples/ssd-test.fio Running the fio job gives us the following output: https://paste.ubuntu.com/p/HrWGNDJPfv/ Now we have the baseline, I enabled -proposed and installed 5.4.0-55-generic and rebooted. I started the fio job again, and got the following output: # uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 https://paste.ubuntu.com/p/pDVXnspmvs/ If you compare the two outputs, there really isn't much difference in latencies / read / write speeds. The bcache patches don't seem to cause any large impacts. I managed to set up a bcache disk, and did some IO stress tests. Things seem to be okay. Since we had positive test results on the test kernel on the Launchpad git server, and the above shows we don't appear to have any regressions, I will mark this bug as verified for Focal. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Performing verification of adcli on Bionic The patches for Bionic are a bit more involved, as it adds the whole --use-ldaps ecosystem. Firstly, I installed adcli 0.8.2-1 from -updates. The manpage did not have any mention of --use-ldaps, and if I ran a command with --use- ldaps, it would complain it was unrecongized. # adcli join --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL join: unrecognized option '--use-ldaps' usage: adcli join I then enabled -proposed and installed adcli 0.8.2-1ubuntu1. The man page now talks about --use-ldaps $ man adcli | grep -i ldaps --use-ldaps Connect to the domain controller with LDAPS. By default the LDAP port is used and SASL GSS-SPNEGO or GSSAPI is used for authentication and to establish encryption. This should satisfy all requirements set on the server side and LDAPS should only be used if the LDAP port is not accessible due to firewalls or other reasons. $ LDAPTLS_CACERT=/path/to/ad_dc_ca_cert.pem adcli join --use-ldaps -D domain.example.com I then enabled a firewall rule to block ldap connections: # ufw deny 389 # ufw deny 3268 And tried the join command. # adcli join --use-ldaps --verbose -U Administrator --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL * Using domain name: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local * Wrote out krb5.conf snippet to /tmp/adcli-krb5-ihG1h9/krb5.d/adcli-krb5-conf-bt9nd8 Password for Administrator@TESTING.LOCAL: * Authenticated as user: Administrator@TESTING.LOCAL * Using GSS-API for SASL bind * Looked up short domain name: TESTING * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570 * Using fully qualified name: ubuntu * Using domain name: WIN-SB6JAS7PH22.testing.local * Using computer account name: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Generated 120 character computer password * Using keytab: FILE:/etc/krb5.keytab * Found computer account for UBUNTU$ at: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Set computer password * Retrieved kvno '13' for computer account in directory: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Checking RestrictedKrbHost/ubuntu.testing.local *Added RestrictedKrbHost/ubuntu.testing.local * Checking host/ubuntu.testing.local *Added host/ubuntu.testing.local * Checking RestrictedKrbHost/UBUNTU *Added RestrictedKrbHost/UBUNTU * Checking host/UBUNTU *Added host/UBUNTU * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Discovered which keytab salt to use * Added the entries to the keytab: UBUNTU$@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: host/UBUNTU@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: RestrictedKrbHost/UBUNTU@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: RestrictedKrbHost/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: host/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab I couldn't catch the open port with netstat, so I used strace, and 636 was being used: connect(3, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("192.168.122.66")}, 16) = 0 I then went through all the other sub commands and did a quick test to ensure they all took --use-ldaps and did not complain about "being unrecognized". All commands except "info" took the flag fine, and "info" was never intended to use --use-ldaps anyway. Everything seems okay. Happy to mark adcli for Bionic verified. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Performing verification of adcli on Focal The patches for Focal are a bit more involved, as it adds the whole --use-ldaps ecosystem. Firstly, I installed adcli 0.9.0-1 from -updates. The manpage did not have any mention of --use-ldaps, and if I ran a command with --use- ldaps, it would complain it was unrecongized. # adcli join --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL join: unrecognized option '--use-ldaps' usage: adcli join I then enabled -proposed and installed adcli 0.9.0-1ubuntu0.20.04.1. The man page now talks about --use-ldaps $ man adcli | grep -i ldaps --use-ldaps Connect to the domain controller with LDAPS. By default the LDAP port is used and SASL GSS-SPNEGO or GSSAPI is used for authentication and to establish encryption. This should satisfy all requirements set on the server side and LDAPS should only be used if the LDAP port is not accessible due to firewalls or other reasons. $ LDAPTLS_CACERT=/path/to/ad_dc_ca_cert.pem adcli join --use-ldaps -D domain.example.com I then enabled a firewall rule to block ldap connections: # ufw deny 389 # ufw deny 3268 And tried the join command: # adcli join --use-ldaps --verbose -U Administrator --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL * Using domain name: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local * Wrote out krb5.conf snippet to /tmp/adcli-krb5-ihG1h9/krb5.d/adcli-krb5-conf-bt9nd8 Password for Administrator@TESTING.LOCAL: * Authenticated as user: Administrator@TESTING.LOCAL * Using GSS-API for SASL bind * Looked up short domain name: TESTING * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570 * Using fully qualified name: ubuntu * Using domain name: WIN-SB6JAS7PH22.testing.local * Using computer account name: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Generated 120 character computer password * Using keytab: FILE:/etc/krb5.keytab * Found computer account for UBUNTU$ at: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Set computer password * Retrieved kvno '13' for computer account in directory: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Checking RestrictedKrbHost/ubuntu.testing.local *Added RestrictedKrbHost/ubuntu.testing.local * Checking host/ubuntu.testing.local *Added host/ubuntu.testing.local * Checking RestrictedKrbHost/UBUNTU *Added RestrictedKrbHost/UBUNTU * Checking host/UBUNTU *Added host/UBUNTU * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Discovered which keytab salt to use * Added the entries to the keytab: UBUNTU$@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: host/UBUNTU@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: RestrictedKrbHost/UBUNTU@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: RestrictedKrbHost/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab * Cleared old entries from keytab: FILE:/etc/krb5.keytab * Added the entries to the keytab: host/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab I couldn't catch the open port with netstat, so I used strace, and 636 was being used: connect(3, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("192.168.122.66")}, 16) = 0 I then went through all the other sub commands and did a quick test to ensure they all took --use-ldaps and did not complain about "being unrecognized". All commands except "info" took the flag fine, and "info" was never intended to use --use-ldaps anyway. Everything looks good. Happy to mark adcli for Focal verified. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Performing verification of adcli on Groovy. Groovy only required one patch, which fixed a missed enablement of --use-ldaps for the testjoin and update commands. So, just testing those two. I installed adcli 0.9.0-1ubuntu1 from -updates, and I set everything up by issuing a join command. After that, I tried the --use-ldaps flag with testjoin and update commands: # adcli testjoin --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local testjoin: unrecognized option '--use-ldaps' usage: adcli testjoin # adcli update --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local update: unrecognized option '--use-ldaps' usage: adcli update I then enabled -proposed, and installed adcli 0.9.0-1ubuntu1.2 and tried again: We block port 389 on firewall, so # ufw deny 389 # ufw deny 3268 Then try testjoin and update: # adcli testjoin --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local * Found realm in keytab: TESTING.LOCAL * Found computer name in keytab: UBUNTU * Found service principal in keytab: host/UBUNTU * Found service principal in keytab: host/ubuntu.testing.local * Found host qualified name in keytab: ubuntu.testing.local * Found service principal in keytab: RestrictedKrbHost/UBUNTU * Found service principal in keytab: RestrictedKrbHost/ubuntu.testing.local * Using domain name: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Wrote out krb5.conf snippet to /tmp/adcli-krb5-6SRtqJ/krb5.d/adcli-krb5-conf-YGzgnK * Authenticated as default/reset computer account: UBUNTU * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local * Looked up short domain name: TESTING * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570 Sucessfully validated join to domain WIN-SB6JAS7PH22.testing.local # adcli update --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local * Found realm in keytab: TESTING.LOCAL * Found computer name in keytab: UBUNTU * Found service principal in keytab: host/UBUNTU * Found service principal in keytab: host/ubuntu.testing.local * Found host qualified name in keytab: ubuntu.testing.local * Found service principal in keytab: RestrictedKrbHost/UBUNTU * Found service principal in keytab: RestrictedKrbHost/ubuntu.testing.local * Using domain name: WIN-SB6JAS7PH22.testing.local * Calculated computer account name from fqdn: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Wrote out krb5.conf snippet to /tmp/adcli-krb5-6FQ1ZS/krb5.d/adcli-krb5-conf-LHowkP * Authenticated as default/reset computer account: UBUNTU * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local * Looked up short domain name: TESTING * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570 * Using fully qualified name: ubuntu * Using domain name: WIN-SB6JAS7PH22.testing.local * Using computer account name: UBUNTU * Using domain realm: WIN-SB6JAS7PH22.testing.local * Using fully qualified name: ubuntu.testing.local * Enrolling computer name: UBUNTU * Generated 120 character computer password * Using keytab: FILE:/etc/krb5.keytab * Found computer account for UBUNTU$ at: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Retrieved kvno '12' for computer account in directory: CN=UBUNTU,CN=Computers,DC=testing,DC=local * Password not too old, no change needed * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local * Modifying computer account: dNSHostName * Checking RestrictedKrbHost/ubuntu.testing.local *Added RestrictedKrbHost/ubuntu.testing.local * Checking host/ubuntu.testing.local *Added host/ubuntu.testing.local * Checking RestrictedKrbHost/UBUNTU *Added RestrictedKrbHost/UBUNTU * Checking host/UBUNTU *Added host/UBUNTU Everything seems fine. Happy to mark Groovy as verified for adcli. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Hi Tobias, thanks for testing and verifying! I really appreciate it, and it's good to hear that everything works. I'll just add some of my own test output below, and we should be good to go for a release to -updates in about a week's time. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, The kernel team have built the next kernel update, and they have placed it in -proposed for verification. The versions are 4.15.0-125-generic for Bionic, and 5.4.0-55-generic for Focal. Can you please schedule a maintenance window for the Launchpad git server, to install the new kernel in -proposed, and reboot into it, so we can verify that it fixes the problem. Instructions to install (On a Bionic system): Enable -proposed by running the following command to make a new sources.list.d entry: 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-bionic-proposed.list # Enable Ubuntu proposed archive deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main EOF 2) sudo apt update 3) sudo apt install linux-image-4.15.0-125-generic linux-modules-4.15.0-125-generic \ linux-modules-extra-4.15.0-125-generic linux-headers-4.15.0-125-generic linux-headers-4.15.0-125 4) sudo reboot 5) uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 6) sudo rm /etc/apt/sources.list.d/ubuntu-bionic-proposed.list 7) sudo apt update If you get a different uname, you may need to adjust your grub configuration to boot into the correct kernel. Also, since this is a production machine, make sure you remove the -proposed software source once you have installed the kernel. Let me know how this kernel performs, and if everything seems fine after a week we will mark the Launchpad bug as verified. The timeline for release to -updates is still set for the 30th of November, give or take a few days if any CVEs turn up. I believe this kernel should be live-patchable, although this may not be the case if the kernel is respun before release. Hopefully you will only have to schedule the maintenance window just the once. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled
Hi Benjamin, The kernel team have built the next kernel update, and they have placed it in -proposed for verification. The versions are 4.15.0-125-generic for Bionic, and 5.4.0-55-generic for Focal. Can you please schedule a maintenance window for the Launchpad git server, to install the new kernel in -proposed, and reboot into it, so we can verify that it fixes the problem. Instructions to install (On a Bionic system): Enable -proposed by running the following command to make a new sources.list.d entry: 1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-bionic-proposed.list # Enable Ubuntu proposed archive deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main EOF 2) sudo apt update 3) sudo apt install linux-image-4.15.0-125-generic linux-modules-4.15.0-125-generic \ linux-modules-extra-4.15.0-125-generic linux-headers-4.15.0-125-generic linux-headers-4.15.0-125 4) sudo reboot 5) uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 6) sudo rm /etc/apt/sources.list.d/ubuntu-bionic-proposed.list 7) sudo apt update If you get a different uname, you may need to adjust your grub configuration to boot into the correct kernel. Also, since this is a production machine, make sure you remove the -proposed software source once you have installed the kernel. Let me know how this kernel performs, and if everything seems fine after a week we will mark the Launchpad bug as verified. The timeline for release to -updates is still set for the 30th of November, give or take a few days if any CVEs turn up. I believe this kernel should be live-patchable, although this may not be the case if the kernel is respun before release. Hopefully you will only have to schedule the maintenance window just the once. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1898786 Title: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1898786 [Impact] Systems that utilise bcache can experience extremely high IO wait times when under constant IO pressure. The IO wait times seem to stay at a consistent 1 second, and never drop as long as the bcache shrinker is enabled. If you disable the shrinker, then IO wait drops significantly to normal levels. We did some perf analysis, and it seems we spend a huge amount of time in bch_mca_scan(), likely waiting for the mutex ">bucket_lock". Looking at the recent commits in Bionic, we found the following commit merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b Author: Coly Li Date: Wed Nov 13 16:03:24 2019 +0800 Subject: bcache: at least try to shrink 1 node in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b It mentions in the description that: > If sc->nr_to_scan is smaller than c->btree_pages, after the above > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > and releasing mutex c->bucket_lock. This seems to be what is going on here, but the above commit only addresses when nr is 0. From what I can see, the problems we are experiencing are when nr is 1 or 2, and again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to release c->bucket_lock since the shrinker loop never executes since there is no work to do. [Fix] The following commits fix the problem, and all landed in 5.6-rc1: commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 Author: Coly Li Date: Fri Jan 24 01:01:40 2020 +0800 Subject: bcache: remove member accessed from struct btree Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 Author: Coly Li Date: Fri Jan 24 01:01:41 2020 +0800 Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 commit e3de04469a49ee09c89e80bf821508df458ccee6 Author: Coly Li Date: Fri Jan 24 01:01:42 2020 +0800 Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 The first commit is a dependency of the other two. The first commit removes a "recently accessed" marker, used to indicate if a particular cache has been used recently, and if it has been, not consider it for
[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Bionic. I enabled -proposed and installed 4.15.0-125-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme2n1 259:20 1.7T 0 disk nvme3n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=0, rmapbt=0, reflink=0 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real0m3.615s user0m0.002s sys 0m0.179s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m1.898s user0m0.002s sys 0m0.015s We can see that mkfs.xfs took 3.6 seconds, and fstrim only 2 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 4.15.0-125-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90
[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Bionic. I enabled -proposed and installed 4.15.0-125-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme2n1 259:20 1.7T 0 disk nvme3n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=0, rmapbt=0, reflink=0 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real0m3.615s user0m0.002s sys 0m0.179s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m1.898s user0m0.002s sys 0m0.015s We can see that mkfs.xfs took 3.6 seconds, and fstrim only 2 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 4.15.0-125-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Focal. I enabled -proposed and installed 5.4.0-55-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme3n1 259:20 1.7T 0 disk nvme2n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real0m5.350s user0m0.022s sys 0m0.179s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m2.944s user0m0.006s sys 0m0.013s We can see that mkfs.xfs took 5.3 seconds, and fstrim only 3 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 5.4.0-55-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>]
[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Focal. I enabled -proposed and installed 5.4.0-55-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme3n1 259:20 1.7T 0 disk nvme2n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 real0m5.350s user0m0.022s sys 0m0.179s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m2.944s user0m0.006s sys 0m0.013s We can see that mkfs.xfs took 5.3 seconds, and fstrim only 3 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 5.4.0-55-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Groovy. I enabled -proposed and installed 5.8.0-30-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 5.8.0-30-generic #32-Ubuntu SMP Mon Nov 9 21:03:15 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme3n1 259:20 1.7T 0 disk nvme2n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. real0m4.413s user0m0.022s sys 0m0.245s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m1.973s user0m0.000s sys 0m0.037s We can see that mkfs.xfs took 4.4 seconds, and fstrim only 2 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 5.8.0-30-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-groovy ** Tags added: verification-done-groovy -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Performing verification for Groovy. I enabled -proposed and installed 5.8.0-30-generic to a i3.8xlarge AWS instance. >From there, I followed the testcase steps: $ uname -rv 5.8.0-30-generic #32-Ubuntu SMP Mon Nov 9 21:03:15 UTC 2020 $ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda202:008G 0 disk └─xvda1 202:108G 0 part / nvme0n1 259:00 1.7T 0 disk nvme1n1 259:10 1.7T 0 disk nvme3n1 259:20 1.7T 0 disk nvme2n1 259:30 1.7T 0 disk $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 mdadm: layout defaults to n2 mdadm: layout defaults to n2 mdadm: chunk size defaults to 512K mdadm: size set to 1855336448K mdadm: automatically enabling write-intent bitmap on large array mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. $ time sudo mkfs.xfs /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=512agcount=32, agsize=28989568 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=927666176, imaxpct=5 = sunit=128swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=452968, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Discarding blocks...Done. real0m4.413s user0m0.022s sys 0m0.245s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real0m1.973s user0m0.000s sys 0m0.037s We can see that mkfs.xfs took 4.4 seconds, and fstrim only 2 seconds. This is a significant improvement over the current 11 minutes. I started up a c5.large instance, and attached 4x EBS drives, which do not support block discard, and went through the testcase steps. Everything worked fine, and the changes have not caused any regressions to disks which do not support block discard. I also started another i3.8xlarge instance and tested raid0, to check for regressions around the refactoring. raid0 deployed fine, and was as performant as usual. The 5.8.0-30-generic kernel in -proposed fixes the issue, and I am happy to mark as verified. ** Tags removed: verification-needed-groovy ** Tags added: verification-done-groovy -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>]
[Bug 1896154] Re: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data
Performing verification for Focal. I created a i3.large instance on AWS, since it has 1x NVMe drive that supports trim and block discard. I ensured that I could reproduce the problem with 5.4.0-54-generic from -updates, and I followed the instructions in the Testcase section, and the final fstrim after shrinking locked up the instance, and filled up the root disk. I terminated the instance. I then created a new instance, and enabled -proposed, and installed 5.4.0-55-generic, and rebooted. From there, I ran though the test steps again: $ uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 $ sudo -s # lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:00 28.1M 1 loop /snap/amazon-ssm-agent/2012 loop1 7:10 97.8M 1 loop /snap/core/10185 loop2 7:20 55.3M 1 loop /snap/core18/1885 loop3 7:30 70.6M 1 loop /snap/lxd/16922 xvda202:00 8G 0 disk └─xvda1 202:10 8G 0 part / nvme0n1 259:00 442.4G 0 disk # dev=/dev/nvme0n1 # mnt=/mnt # mkfs.btrfs -f $dev -b 10G btrfs-progs v5.4.1 See http://btrfs.wiki.kernel.org for more information. Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication. Label: (null) UUID: db9dd9f5-7993-4827-9a43-93a72a73aa3a Node size: 16384 Sector size:4096 Filesystem size:10.00GiB Block group profiles: Data: single8.00MiB Metadata: single8.00MiB System: single4.00MiB SSD detected: yes Incompat features: extref, skinny-metadata Checksum: crc32c Number of devices: 1 Devices: IDSIZE PATH 110.00GiB /dev/nvme0n1 # mount $dev $mnt # fstrim $mnt # btrfs filesystem resize 1:-1G $mnt Resize '/mnt' of '1:-1G' # fstrim $mnt # The final fstrim completed almost immediately, the same speed as the initial fstrim. The instance did not lock up, and the root disk did not get filled with any garbage data. The kernel in -proposed fixes the problem, happy to mark as verified. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1896154 Title: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896154/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Kernel-packages] [Bug 1896154] Re: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data
Performing verification for Focal. I created a i3.large instance on AWS, since it has 1x NVMe drive that supports trim and block discard. I ensured that I could reproduce the problem with 5.4.0-54-generic from -updates, and I followed the instructions in the Testcase section, and the final fstrim after shrinking locked up the instance, and filled up the root disk. I terminated the instance. I then created a new instance, and enabled -proposed, and installed 5.4.0-55-generic, and rebooted. From there, I ran though the test steps again: $ uname -rv 5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020 $ sudo -s # lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:00 28.1M 1 loop /snap/amazon-ssm-agent/2012 loop1 7:10 97.8M 1 loop /snap/core/10185 loop2 7:20 55.3M 1 loop /snap/core18/1885 loop3 7:30 70.6M 1 loop /snap/lxd/16922 xvda202:00 8G 0 disk └─xvda1 202:10 8G 0 part / nvme0n1 259:00 442.4G 0 disk # dev=/dev/nvme0n1 # mnt=/mnt # mkfs.btrfs -f $dev -b 10G btrfs-progs v5.4.1 See http://btrfs.wiki.kernel.org for more information. Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication. Label: (null) UUID: db9dd9f5-7993-4827-9a43-93a72a73aa3a Node size: 16384 Sector size:4096 Filesystem size:10.00GiB Block group profiles: Data: single8.00MiB Metadata: single8.00MiB System: single4.00MiB SSD detected: yes Incompat features: extref, skinny-metadata Checksum: crc32c Number of devices: 1 Devices: IDSIZE PATH 110.00GiB /dev/nvme0n1 # mount $dev $mnt # fstrim $mnt # btrfs filesystem resize 1:-1G $mnt Resize '/mnt' of '1:-1G' # fstrim $mnt # The final fstrim completed almost immediately, the same speed as the initial fstrim. The instance did not lock up, and the root disk did not get filled with any garbage data. The kernel in -proposed fixes the problem, happy to mark as verified. ** Tags removed: verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-azure in Ubuntu. https://bugs.launchpad.net/bugs/1896154 Title: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data Status in linux package in Ubuntu: Fix Released Status in linux-azure package in Ubuntu: New Status in linux source package in Focal: Fix Committed Status in linux-azure source package in Focal: Fix Released Bug description: BugLink: https://bugs.launchpad.net/bugs/1896154 [Impact] Since 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") which landed in 5.3, btrfs wont trim a range that has already been trimmed, and will instead go looking for a range where the CHUNK_TRIMMED and CHUNK_ALLOCATED bits aren't set. If a device had been shrunk, the CHUNK_TRIMMED and CHUNK_ALLOCATED bits are never cleared, which means that btrfs could go looking for a range to trim which is beyond the new device size. This leads to an underflow in a length calculation for the range to trim, and we will end up trimming past the device's boundary. This has an unfortunate side effect of mangling and filling the root disk with garbage data, and it will not stop until the root disk is totally filled, and makes the instance unusable. [Fix] The issue was fixed in the following commit, in 5.9-rc1: commit c57dd1f2f6a7cd1bb61802344f59ccdc5278c983 Author: Qu Wenruo Date: Fri Jul 31 19:29:11 2020 +0800 Subject: btrfs: trim: fix underflow in trim length to prevent access beyond device boundary Link: https://github.com/torvalds/linux/commit/c57dd1f2f6a7cd1bb61802344f59ccdc5278c983 The fix clears the CHUNK_TRIMMED and CHUNK_ALLOCATED bits when a device is being shrunk, and performs some additional checks to ensure we do not trim past the device size boundary. The fix was backported to 5.7.17 and 5.8.3 upstream stable, but it seems 5.4 was skipped. The patch required a minor backport to 5.4, with the CHUNK_STATE_MASK #define moving files back to fs/btrfs/extent_io.h, as the file had been renamed in later kernels. [Testcase] The easiest way to reproduce is to use a cloud instance that supplies a real NVMe drive, that supports TRIM and block discards. Warning, this will fill the root disk with garbage data, ONLY run on a throwaway instance! Run the following commands: $ dev=/dev/nvme0n1 $ mnt=/mnt $ mkfs.btrfs -f $dev -b 10G $ mount $dev $mnt $ fstrim $mnt $ btrfs filesystem resize 1:-1G $mnt $ fstrim $mnt The last command will appear to hang, while the root filesystem will begin filling with garbage data. Once the root filesystem fills, you will see the
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
** Description changed: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that haven't had LDAP signing or authenticated channel binding enforced, due to the measures being optional. For both sssd and adcli, the changes don't implement anything new, and instead, the changes add configuration and logic to "select" what protocol to use to talk to the AD server. LDAP and LDAPS are already implemented in both sssd and adcli, the changes just add some logic to select the use of LDAPS over LDAP. For sssd, the changes are hidden behind configuration parameters, such as "ldap_sasl_mech" and "ad_use_ldaps". If a regression were to occur, it would be limited to systems where the system administrator had enabled these configuration options to the /etc/sssd/sssd.conf file. For adcli, the changes are more immediate. adcli will now use GSS-SPENGO by default if the server supports it, which is a behaviour change. The "use-ldaps" option is a flag on the command line, e.g. "--use-ldaps", and if a regression were to occur, users can remove "--use-ldaps" from their command to fall back to the new GSS-SPENGO defaults on
[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
** Description changed: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that haven't had LDAP signing or authenticated channel binding enforced, due to the measures being optional. For both sssd and adcli, the changes don't implement anything new, and instead, the changes add configuration and logic to "select" what protocol to use to talk to the AD server. LDAP and LDAPS are already implemented in both sssd and adcli, the changes just add some logic to select the use of LDAPS over LDAP. For sssd, the changes are hidden behind configuration parameters, such as "ldap_sasl_mech" and "ad_use_ldaps". If a regression were to occur, it would be limited to systems where the system administrator had enabled these configuration options to the /etc/sssd/sssd.conf file. For adcli, the changes are more immediate. adcli will now use GSS-SPENGO by default if the server supports it, which is a behaviour change. The "use-ldaps" option is a flag on the command line, e.g. "--use-ldaps", and if a regression were to occur, users can remove "--use-ldaps" from their command to fall back to the new GSS-SPENGO defaults on
Re: [Sts-sponsors] Please review and potentially sponsor LP1868703 Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Hi Eric, I have revised the patches and fixed the issues you found. The revised debdiffs are attached to the Launchpad bug. Please review and potentially sponsor. Thanks, Matthew On Tue, Nov 10, 2020 at 2:28 AM Eric Desrochers wrote: > > I'll review it today or tomorrow, > > Thanks for the very detailed SRU template. > > On Sun, Nov 8, 2020 at 11:06 PM Matthew Ruffell > wrote: >> >> Hello Dan, Eric and Mauricio, >> >> Can you please review and consider sponsoring LP1868703 [1]? >> >> [1] https://bugs.launchpad.net/bugs/1868703 >> >> Debdiffs for adcli and sssd are attached to the bug, and are for >> Bionic and Focal. Groovy has all the fixes already. >> >> Myself, the customer and the bug reporter have done some testing, and >> things are looking good. >> >> Let me know if I need to make any changes or fix anything. >> >> Thanks, >> Matthew >> >> -- >> Mailing list: https://launchpad.net/~sts-sponsors >> Post to : sts-sponsors@lists.launchpad.net >> Unsubscribe : https://launchpad.net/~sts-sponsors >> More help : https://help.launchpad.net/ListHelp -- Mailing list: https://launchpad.net/~sts-sponsors Post to : sts-sponsors@lists.launchpad.net Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp
[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a revised debdiff for adcli for Focal. ** Patch added: "adcli debdiff for Focal v2" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432871/+files/lp1868703_adcli_focal_v2.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) Status in Cyrus-sasl2: Unknown Status in sssd package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in sssd source package in Bionic: In Progress Status in adcli source package in Disco: Won't Fix Status in sssd source package in Disco: Won't Fix Status in adcli source package in Eoan: Won't Fix Status in sssd source package in Eoan: Won't Fix Status in adcli source package in Focal: In Progress Status in sssd source package in Focal: In Progress Status in adcli source package in Groovy: Fix Released Status in sssd source package in Groovy: Fix Released Status in sssd source package in Hirsute: Fix Released Bug description: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a revised debdiff for adcli for Focal. ** Patch added: "adcli debdiff for Focal v2" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432871/+files/lp1868703_adcli_focal_v2.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a revised debdiff for adcli in Bionic. ** Patch added: "adcli debdiff for Bionic v2" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432874/+files/lp1868703_adcli_bionic_v2.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a debdiff for adcli in Groovy. ** Patch added: "adcli debdiff for groovy" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432870/+files/lp1868703_adcli_groovy.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a debdiff for adcli for Hirsute. ** Patch added: "adcli debdiff for hirsute" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432869/+files/lp1868703_adcli_hirsute.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a revised debdiff for adcli in Bionic. ** Patch added: "adcli debdiff for Bionic v2" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432874/+files/lp1868703_adcli_bionic_v2.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) Status in Cyrus-sasl2: Unknown Status in sssd package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in sssd source package in Bionic: In Progress Status in adcli source package in Disco: Won't Fix Status in sssd source package in Disco: Won't Fix Status in adcli source package in Eoan: Won't Fix Status in sssd source package in Eoan: Won't Fix Status in adcli source package in Focal: In Progress Status in sssd source package in Focal: In Progress Status in adcli source package in Groovy: Fix Released Status in sssd source package in Groovy: Fix Released Status in sssd source package in Hirsute: Fix Released Bug description: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that
[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a debdiff for adcli for Hirsute. ** Patch added: "adcli debdiff for hirsute" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432869/+files/lp1868703_adcli_hirsute.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) Status in Cyrus-sasl2: Unknown Status in sssd package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in sssd source package in Bionic: In Progress Status in adcli source package in Disco: Won't Fix Status in sssd source package in Disco: Won't Fix Status in adcli source package in Eoan: Won't Fix Status in sssd source package in Eoan: Won't Fix Status in adcli source package in Focal: In Progress Status in sssd source package in Focal: In Progress Status in adcli source package in Groovy: Fix Released Status in sssd source package in Groovy: Fix Released Status in sssd source package in Hirsute: Fix Released Bug description: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that haven't
[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a debdiff for adcli in Groovy. ** Patch added: "adcli debdiff for groovy" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432870/+files/lp1868703_adcli_groovy.debdiff -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) Status in Cyrus-sasl2: Unknown Status in sssd package in Ubuntu: Fix Released Status in adcli source package in Bionic: In Progress Status in sssd source package in Bionic: In Progress Status in adcli source package in Disco: Won't Fix Status in sssd source package in Disco: Won't Fix Status in adcli source package in Eoan: Won't Fix Status in sssd source package in Eoan: Won't Fix Status in adcli source package in Focal: In Progress Status in sssd source package in Focal: In Progress Status in adcli source package in Groovy: Fix Released Status in sssd source package in Groovy: Fix Released Status in sssd source package in Hirsute: Fix Released Bug description: [Impact] Microsoft has released a new security advisory for Active Directory (AD) which outlines that man-in-the-middle attacks can be performed on a LDAP server, such as AD DS, that works by an attacker forwarding an authentication request to a Windows LDAP server that does not enforce LDAP channel binding or LDAP signing for incoming connections. To address this, Microsoft has announced new Active Directory requirements in ADV190023 [1][2]. [1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023 [2] https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows These new requirements strongly encourage system administrators to require LDAP signing and authenticated channel binding in their AD environments. The effects of this is to stop unauthenticated and unencrypted traffic from communicating over LDAP port 389, and to force authenticated and encrypted traffic instead, over LDAPS port 636 and Global Catalog port 3629. Microsoft will not be forcing this change via updates to their servers, system administrators must opt in and change their own configuration. To support these new requirements in Ubuntu, changes need to be made to the sssd and adcli packages. Upstream have added a new flag "ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli. If "ad_use_ldaps = True", then sssd will send all communication over port 636, authenticated and encrypted. For adcli, if the server supports GSS-SPNEGO, it will be now be used by default, with the normal LDAP port 389. If the LDAP port is blocked, then "use-ldaps" can now be used, which will use the LDAPS port 636 instead. This is currently reporting the following on Ubuntu 18.04/20.04LTS machines with the following error: "[sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed in section 'domain/test.com'. Check for typos." These patches are needed to stay in line with Microsoft security advisories, since security conscious system administrators would like to firewall off the LDAP port 389 in their environments, and use LDAPS port 636 only. [Testcase] To test these changes, you will need to set up a Windows Server 2019 box, install and configure Active Directory, import the AD certificate to the Ubuntu clients, and create some users in Active Directory. From there, you can try do a user search from the client to the AD server, and check what ports are used for communication. Currently, you should see port 389 in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test Instructions to install (on a bionic or focal system): 1) sudo add-apt-repository ppa:mruffell/sf294530-test 2) sudo apt update 3) sudo apt install adcli sssd Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart sssd. Add a firewall rule to block traffic to LDAP port 389 and Global Catalog 3268. $ sudo ufw deny 389 $ sudo ufw deny 3268 Then do another user lookup, and check ports in use: $ sudo netstat -tanp |grep sssd tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be We see LDAPS port 636, and Global Catalog port 3629 in use. The user lookup will succeed even with ports 389 and 3268 blocked, since it uses their authenticated and encrypted variants instead. [Where problems could occur] Firstly, the adcli and sssd packages will continue to work with AD servers that haven't had
[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)
Attached is a revised debdiff for sssd for Bionic. ** Patch added: "sssd debdiff for Bionic v2" https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703/+attachment/5432867/+files/lp1868703_sssd_bionic_v2.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1868703 Title: Support "ad_use_ldaps" flag for new AD requirements (ADV190023) To manage notifications about this bug go to: https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs