from:"Ruffell"

[Kernel-packages] [Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure

2021-01-02 Thread Matthew Ruffell

** Changed in: linux (Ubuntu Focal)
   Status: New => In Progress

** Changed in: linux (Ubuntu Groovy)
   Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Groovy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Focal)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Changed in: linux (Ubuntu Groovy)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Summary changed:

- Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure
+ qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP 
tx csum offload

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1909062

Title:
  qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not
  supporting IPIP tx csum offload

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  In Progress

Bug description:
  With QL41xxx and Ubuntu DNS server DNS failures are seen when updated
  to the latest Ubuntu kernel 20.04.1 LTS version 5.4.0-52-generic.
  Issue was not observed with 4.5 ubuntu-linux.

  Problem Definition:
  OS Version: /etc/os-release shows Ubuntu 18.04.4 LTS, but Booted kernel is 
the latest Ubuntu 20.04.1 LTS version 5.4.0-52-generic
  NIC: 2 dual-port (4) QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE 
Controller [1077:8070] (rev 02)
  Inbox driver qede v8.37.0.20

  Complete Detailed Problem Description:
  Anything that uses the internal Kubernetes DNS server fails. If an external 
DNS server is used resolution works for non-Kubernetes IPs.

  Similar issue is described in this article.
  https://github.com/kubernetes/kubernetes/issues/95365

  Below patch recently on upstream fixes this -
  [Note that issue was introduced by driver's tunnel offload support which was 
added in after 4.5 kernel]

  commit 5d5647dad259bb416fd5d3d87012760386d97530
  Author: Manish Chopra 
  Date:   Mon Dec 21 06:55:30 2020 -0800

  qede: fix offload for IPIP tunnel packets

  IPIP tunnels packets are unknown to device,
  hence these packets are incorrectly parsed and
  caused the packet corruption, so disable offlods
  for such packets at run time.

  Signed-off-by: Manish Chopra 
  Signed-off-by: Sudarsana Kalluru 
  Signed-off-by: Igor Russkikh 
  Link: https://lore.kernel.org/r/20201221145530.7771-1-mani...@marvell.com
  Signed-off-by: Jakub Kicinski 

  Thanks,
  Manish

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure

2021-01-02 Thread Matthew Ruffell

** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1909062

Title:
  Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS
  failure

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Focal:
  New
Status in linux source package in Groovy:
  New

Bug description:
  With QL41xxx and Ubuntu DNS server DNS failures are seen when updated
  to the latest Ubuntu kernel 20.04.1 LTS version 5.4.0-52-generic.
  Issue was not observed with 4.5 ubuntu-linux.

  Problem Definition:
  OS Version: /etc/os-release shows Ubuntu 18.04.4 LTS, but Booted kernel is 
the latest Ubuntu 20.04.1 LTS version 5.4.0-52-generic
  NIC: 2 dual-port (4) QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE 
Controller [1077:8070] (rev 02)
  Inbox driver qede v8.37.0.20

  Complete Detailed Problem Description:
  Anything that uses the internal Kubernetes DNS server fails. If an external 
DNS server is used resolution works for non-Kubernetes IPs.

  Similar issue is described in this article.
  https://github.com/kubernetes/kubernetes/issues/95365

  Below patch recently on upstream fixes this -
  [Note that issue was introduced by driver's tunnel offload support which was 
added in after 4.5 kernel]

  commit 5d5647dad259bb416fd5d3d87012760386d97530
  Author: Manish Chopra 
  Date:   Mon Dec 21 06:55:30 2020 -0800

  qede: fix offload for IPIP tunnel packets

  IPIP tunnels packets are unknown to device,
  hence these packets are incorrectly parsed and
  caused the packet corruption, so disable offlods
  for such packets at run time.

  Signed-off-by: Manish Chopra 
  Signed-off-by: Sudarsana Kalluru 
  Signed-off-by: Igor Russkikh 
  Link: https://lore.kernel.org/r/20201221145530.7771-1-mani...@marvell.com
  Signed-off-by: Jakub Kicinski 

  Thanks,
  Manish

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Bug 1909062] Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure

2021-01-02 Thread Matthew Ruffell

** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1909062

Title:
  Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS
  failure

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1909062/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-12-17 Thread Matthew Ruffell

Thanks Tobias for the testing. Good to hear it functions as intended.

Performing verification for Bionic

I installed adcli 0.8.2-1ubuntu1.2 from -proposed, and joined a domain
without using the --use-ldaps flag.

https://paste.ubuntu.com/p/RByVZRPhCK/

Next, I added the firewall rules from the test section:

# ufw deny out 389
# ufw deny out 3268
# ufw enable

Now, I tried to join, again without --use-ldaps:

https://paste.ubuntu.com/p/KMPNtS5SYK/

I got rejected, due to firewall.

Now, lets try connect with --use-ldaps:

https://paste.ubuntu.com/p/bKzx6K6PXd/

Realm join works, and I checked with strace to see what port is being
used:

connect(3, {sa_family=AF_INET, sin_port=htons(636),
sin_addr=inet_addr("192.168.122.66")}, 16) = 0

We see port 636 as expected.

I am happy with the packages in -proposed, they implement the new
feature properly, and more importantly, fix the regression from bug
1906627. Happy to mark as verified.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1908473] [NEW] rsyslog-relp: imrelp module leaves sockets in CLOSE_WAIT state which leads to file descriptor leak

2020-12-16 Thread Matthew Ruffell

Public bug reported:

[Impact]

In recent versions of rsyslog and librelp, the imrelp module leaks file
descriptors due to a bug where it does not correctly close sockets, and
instead, leaves them in the CLOSE_WAIT state.

This causes rsyslogd on busy servers to eventually hit the limit of
maximum open files allowed, which locks rsyslogd up until it is
restarted.

A workaround is to restart rsyslogd every month or so to manually close
all of the open sockets.

Only users of the imrelp module are affected, and not rsyslog users in
general.

[Testcase]

Install the rsyslog-relp module like so:

$ sudo apt install rsyslog rsyslog-relp

Next, generate a working directory, and make a config file that loads
the relp module.

$ sudo mkdir /workdir
$ cat << EOF >> ./spool.conf
\$LocalHostName spool
\$AbortOnUncleanConfig on
\$PreserveFQDN on

global(
workDirectory="/workdir"
maxMessageSize="256k"
)

main_queue(queue.type="Direct")
module(load="imrelp")
input(
type="imrelp"
name="imrelp"
port="601"
ruleset="spool"
MaxDataSize="256k"
)

ruleset(name="spool" queue.type="direct") {
}

# Just so rsyslog doesn't whine that we do not have outputs
ruleset(name="noop" queue.type="direct") {
action(
type="omfile"
name="omfile"
file="/workdir/spool.log"
)
}
EOF

Verify that the config is valid, then start a rsyslog server.

$ sudo rsyslogd -f ./spool.conf -N9
$ sudo rsyslogd -f ./spool.conf -i /workdir/rsyslogd.pid

Fetch the rsyslogd PID and check for open files.

$ RLOGPID=$(cat /workdir/rsyslogd.pid)
$ sudo ls -l /proc/$RLOGPID/fd
total 0
lr-x-- 1 root root 64 Dec 17 01:22 0 -> /dev/urandom
lrwx-- 1 root root 64 Dec 17 01:22 1 -> 'socket:[41228]'
lrwx-- 1 root root 64 Dec 17 01:22 3 -> 'socket:[41222]'
lrwx-- 1 root root 64 Dec 17 01:22 4 -> 'socket:[41223]'
lrwx-- 1 root root 64 Dec 17 01:22 7 -> 'anon_inode:[eventpoll]'

We have 3 sockets open by default. Next, use netcat to open 100
connections:

$ for i in {1..100} ; do nc -z 127.0.0.1 601 ; done

Now check for open file descriptors, and there will be an extra 100 sockets
in the list:

$ sudo ls -l /proc/$RLOGPID/fd

https://paste.ubuntu.com/p/f6NQVNbZcR/

We can check the state of these sockets with:

$ ss -t

https://paste.ubuntu.com/p/7Ts2FbxJrg/

The listening sockets will be in CLOSE-WAIT, and the netcat sockets will
be in FIN-WAIT-2.

If you install the test package available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf299578-test

When you open connections with netcat, these will be closed properly,
and the file descriptor leak will be fixed.

[Where problems could occur]

If a regression were to occur, it would be limited to users of the
imrelp module, which is a part of the rsyslogd-relp package, and depends
on librelp.

rsyslog-relp is not part of a default installation of rsyslog, and is
opt in by changing a configuration file to enable imrelp.

The changes to rsyslog implement a testcase which exercises the
problematic code to ensure things are working as expected, and should
run during autopkgtest time.

[Other]

Upstream bug list:

https://github.com/rsyslog/rsyslog/issues/4350
https://github.com/rsyslog/rsyslog/issues/4005
https://github.com/rsyslog/librelp/issues/188

The following commits fix the problem:

rsyslogd


commit baee0bd5420649329793746f0daf87c4f59fe6a6
Author: Andre lorbach 
Date:   Thu Apr 9 13:00:35 2020 +0200
Subject: testbench: Add test for imrelp to check broken session handling.
Link: 
https://github.com/rsyslog/rsyslog/commit/baee0bd5420649329793746f0daf87c4f59fe6a6

librelp
===

commit 7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0
Author: Andre lorbach 
Date:   Mon May 11 14:59:55 2020 +0200
Subject: fix memory leak on session break.
Link: 
https://github.com/rsyslog/librelp/commit/7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0

commit 4a6ad8637c244fd3a1caeb9a93950826f58e956a
Author: Andre lorbach 
Date:   Wed Apr 8 15:55:32 2020 +0200
Subject: replsess: fix double free of sendbuf in some cases.
Link: 
https://github.com/rsyslog/librelp/commit/4a6ad8637c244fd3a1caeb9a93950826f58e956a

** Affects: librelp (Ubuntu)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog (Ubuntu)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: librelp (Ubuntu Focal)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog (Ubuntu Focal)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: librelp (Ubuntu Groovy)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog

[Touch-packages] [Bug 1908473] [NEW] rsyslog-relp: imrelp module leaves sockets in CLOSE_WAIT state which leads to file descriptor leak

2020-12-16 Thread Matthew Ruffell

Public bug reported:

[Impact]

In recent versions of rsyslog and librelp, the imrelp module leaks file
descriptors due to a bug where it does not correctly close sockets, and
instead, leaves them in the CLOSE_WAIT state.

This causes rsyslogd on busy servers to eventually hit the limit of
maximum open files allowed, which locks rsyslogd up until it is
restarted.

A workaround is to restart rsyslogd every month or so to manually close
all of the open sockets.

Only users of the imrelp module are affected, and not rsyslog users in
general.

[Testcase]

Install the rsyslog-relp module like so:

$ sudo apt install rsyslog rsyslog-relp

Next, generate a working directory, and make a config file that loads
the relp module.

$ sudo mkdir /workdir
$ cat << EOF >> ./spool.conf
\$LocalHostName spool
\$AbortOnUncleanConfig on
\$PreserveFQDN on

global(
workDirectory="/workdir"
maxMessageSize="256k"
)

main_queue(queue.type="Direct")
module(load="imrelp")
input(
type="imrelp"
name="imrelp"
port="601"
ruleset="spool"
MaxDataSize="256k"
)

ruleset(name="spool" queue.type="direct") {
}

# Just so rsyslog doesn't whine that we do not have outputs
ruleset(name="noop" queue.type="direct") {
action(
type="omfile"
name="omfile"
file="/workdir/spool.log"
)
}
EOF

Verify that the config is valid, then start a rsyslog server.

$ sudo rsyslogd -f ./spool.conf -N9
$ sudo rsyslogd -f ./spool.conf -i /workdir/rsyslogd.pid

Fetch the rsyslogd PID and check for open files.

$ RLOGPID=$(cat /workdir/rsyslogd.pid)
$ sudo ls -l /proc/$RLOGPID/fd
total 0
lr-x-- 1 root root 64 Dec 17 01:22 0 -> /dev/urandom
lrwx-- 1 root root 64 Dec 17 01:22 1 -> 'socket:[41228]'
lrwx-- 1 root root 64 Dec 17 01:22 3 -> 'socket:[41222]'
lrwx-- 1 root root 64 Dec 17 01:22 4 -> 'socket:[41223]'
lrwx-- 1 root root 64 Dec 17 01:22 7 -> 'anon_inode:[eventpoll]'

We have 3 sockets open by default. Next, use netcat to open 100
connections:

$ for i in {1..100} ; do nc -z 127.0.0.1 601 ; done

Now check for open file descriptors, and there will be an extra 100 sockets
in the list:

$ sudo ls -l /proc/$RLOGPID/fd

https://paste.ubuntu.com/p/f6NQVNbZcR/

We can check the state of these sockets with:

$ ss -t

https://paste.ubuntu.com/p/7Ts2FbxJrg/

The listening sockets will be in CLOSE-WAIT, and the netcat sockets will
be in FIN-WAIT-2.

If you install the test package available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf299578-test

When you open connections with netcat, these will be closed properly,
and the file descriptor leak will be fixed.

[Where problems could occur]

If a regression were to occur, it would be limited to users of the
imrelp module, which is a part of the rsyslogd-relp package, and depends
on librelp.

rsyslog-relp is not part of a default installation of rsyslog, and is
opt in by changing a configuration file to enable imrelp.

The changes to rsyslog implement a testcase which exercises the
problematic code to ensure things are working as expected, and should
run during autopkgtest time.

[Other]

Upstream bug list:

https://github.com/rsyslog/rsyslog/issues/4350
https://github.com/rsyslog/rsyslog/issues/4005
https://github.com/rsyslog/librelp/issues/188

The following commits fix the problem:

rsyslogd


commit baee0bd5420649329793746f0daf87c4f59fe6a6
Author: Andre lorbach 
Date:   Thu Apr 9 13:00:35 2020 +0200
Subject: testbench: Add test for imrelp to check broken session handling.
Link: 
https://github.com/rsyslog/rsyslog/commit/baee0bd5420649329793746f0daf87c4f59fe6a6

librelp
===

commit 7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0
Author: Andre lorbach 
Date:   Mon May 11 14:59:55 2020 +0200
Subject: fix memory leak on session break.
Link: 
https://github.com/rsyslog/librelp/commit/7907c9c57f6ed94c8ce5a4e63c3c4e019f71cff0

commit 4a6ad8637c244fd3a1caeb9a93950826f58e956a
Author: Andre lorbach 
Date:   Wed Apr 8 15:55:32 2020 +0200
Subject: replsess: fix double free of sendbuf in some cases.
Link: 
https://github.com/rsyslog/librelp/commit/4a6ad8637c244fd3a1caeb9a93950826f58e956a

** Affects: librelp (Ubuntu)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog (Ubuntu)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: librelp (Ubuntu Focal)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog (Ubuntu Focal)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: librelp (Ubuntu Groovy)
 Importance: Medium
 Assignee: Matthew Ruffell (mruffell)
 Status: In Progress

** Affects: rsyslog

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-12-15 Thread Matthew Ruffell

Hi Tobias,

If you have a moment, could you please help test the new adcli package
in -proposed? Mainly focusing on testing Bionic, to ensure the
regression has been fixed.

Can you run through some tests with and without the --use-ldaps flag?

You can install the new adcli package in -proposed like so:

Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release 
-cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update 
3) sudo apt install adcli
4) sudo apt-cache policy adcli | grep Installed
Installed: 0.8.2-1ubuntu1.2
5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed
Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3
6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
7) sudo apt update

In my testing, everything works as intended. This new version fixes the
regression from bug 1906627, as GSS-SPNEGO is now compatible with the
one in Active Directory.

I will be marking this bug as verified in the coming days, once I am
satisfied with my own testing.

Thanks,
Matthew

** Tags removed: verification-done verification-failed-bionic
** Tags added: verification-needed verification-needed-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-15 Thread Matthew Ruffell

To anyone following this bug:

As we get ready to re-release the new adcli package which implements the
--use-ldaps flag, if you are happy to spend a few moments testing the
new package, I would really appreciate it. I really don't want to cause
another regression again.

You can install the new adcli package in -proposed like so:

Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release 
-cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update 
3) sudo apt install adcli
4) sudo apt-cache policy adcli | grep Installed
Installed: 0.8.2-1ubuntu1.2
5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed
Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3
6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
7) sudo apt update

>From there, join your domain like normal, and if you like, try out other
adcli or realm commands to ensure they work.

Let me know how the new adcli package in -proposed goes. In my testing,
it fixes the regression, and works as intended.

To Jason Alavaliant, thanks! I really appreciate the help testing.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-15 Thread Matthew Ruffell

To anyone following this bug:

As we get ready to re-release the new adcli package which implements the
--use-ldaps flag, if you are happy to spend a few moments testing the
new package, I would really appreciate it. I really don't want to cause
another regression again.

You can install the new adcli package in -proposed like so:

Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release 
-cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update 
3) sudo apt install adcli
4) sudo apt-cache policy adcli | grep Installed
Installed: 0.8.2-1ubuntu1.2
5) sudo apt-cache policy libsasl2-modules-gssapi-mit | grep Installed
Installed: 2.1.27~101-g0780600+dfsg-3ubuntu2.3
6) sudo rm /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
7) sudo apt update

>From there, join your domain like normal, and if you like, try out other
adcli or realm commands to ensure they work.

Let me know how the new adcli package in -proposed goes. In my testing,
it fixes the regression, and works as intended.

To Jason Alavaliant, thanks! I really appreciate the help testing.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  Fix Committed
Status in cyrus-sasl2 source package in Bionic:
  Fix Committed

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package from -proposed

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
    Depends: adcli
    Conflicts: libsasl2-modules-gssapi-heimdal

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-15 Thread Matthew Ruffell

Performing verification for Bionic

Firstly, I installed adcli and libsasl2-modules-gssapi-mit from
-updates:

adcli 0.8.2-1
libsasl2-modules-gssapi-mit 2.1.27~101-g0780600+dfsg-3ubuntu2.1

>From there, I joined a Active Directory realm:

https://paste.ubuntu.com/p/zJhvpRzktk/
 
Next, I enabled -proposed and installed the fixed cyrus-sasl2 and adcli 
packages:

https://paste.ubuntu.com/p/cRrbkjjFmw/

We see that installing adcli 0.8.2-1ubuntu1.2 automatically pulls in the
fixed cyrus-sasl2 2.1.27~101-g0780600+dfsg-3ubuntu2.3 packages because
of the depends we set.

Next, I joined a Active Directory realm, using the same commands as
previous, i.e. not using the new --use-ldaps flag, but instead, falling
back to GSS-API and the new GSS-SPNEGO changes:

https://paste.ubuntu.com/p/WdKYxxDBQm/
 
The join succeeds, and does not get stuck. This shows that the implementation 
of GSS-SPNEGO is now compatible with Active Directory, and that the new adcli 
package is using the new implementation.

Looking at the packet trace, we see the full 30 or so packets exchanged,
which matches the expect count.

https://paste.ubuntu.com/p/k9njh3jYHh/

With these changes, the adcli and cyrus-sasl2 packages in -proposed can
join realms in the same ways that the initial packages in -updates can.

These changes fix the recent adcli regression. Happy to mark verified.

** Tags removed: regression-update verification-needed 
verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-15 Thread Matthew Ruffell

Performing verification for Bionic

Firstly, I installed adcli and libsasl2-modules-gssapi-mit from
-updates:

adcli 0.8.2-1
libsasl2-modules-gssapi-mit 2.1.27~101-g0780600+dfsg-3ubuntu2.1

>From there, I joined a Active Directory realm:

https://paste.ubuntu.com/p/zJhvpRzktk/
 
Next, I enabled -proposed and installed the fixed cyrus-sasl2 and adcli 
packages:

https://paste.ubuntu.com/p/cRrbkjjFmw/

We see that installing adcli 0.8.2-1ubuntu1.2 automatically pulls in the
fixed cyrus-sasl2 2.1.27~101-g0780600+dfsg-3ubuntu2.3 packages because
of the depends we set.

Next, I joined a Active Directory realm, using the same commands as
previous, i.e. not using the new --use-ldaps flag, but instead, falling
back to GSS-API and the new GSS-SPNEGO changes:

https://paste.ubuntu.com/p/WdKYxxDBQm/
 
The join succeeds, and does not get stuck. This shows that the implementation 
of GSS-SPNEGO is now compatible with Active Directory, and that the new adcli 
package is using the new implementation.

Looking at the packet trace, we see the full 30 or so packets exchanged,
which matches the expect count.

https://paste.ubuntu.com/p/k9njh3jYHh/

With these changes, the adcli and cyrus-sasl2 packages in -proposed can
join realms in the same ways that the initial packages in -updates can.

These changes fix the recent adcli regression. Happy to mark verified.

** Tags removed: regression-update verification-needed 
verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  Fix Committed
Status in cyrus-sasl2 source package in Bionic:
  Fix Committed

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package from -proposed

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-14 Thread Matthew Ruffell

Hi @hloeung, these patches are available in 4.15.0-128-generic, and
5.4.0-58-generic.

They are both re-spins of 4.15.0-126-generic and 5.4.0-56-generic,
respectively.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following commits fix the problem, and all landed in 5.6-rc1:

  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for
  cache eviction. The commit mentions that under heavy IO, all caches
  will end up being recently accessed, and nothing will ever be shrunk.

  The second commit changes a previous design decision of skipping the
  first 3 caches to shrink, since it is a common case to call
  bch_mca_scan() with nr being 1, or 2, just like 0 was common in the
  very first commit I mentioned. This time, if we land on 1 or 2, the
  loop exits and nothing happens, and we waste time waiting on locks,
  just like the very first commit I mentioned. The fix is to try shrink
  caches from the tail of the list, and not the beginning.

  The third commit fixes a minor issue where we don't want to re-arrange
  the linked list c->btree_cache, which is what the second commit ended
  up doing, and instead, just shrink the cache at the end of the linked
  list, and not change the order.

  One minor backport / context adjustment was required in the first
  commit for Bionic, and the rest are all clean cherry picks to Bionic
  and Focal.

  [Testcase]

  This is kind of hard to test, since the problem shows up in production
  environments that are under constant IO pressure, with many different
  items entering and leaving the cache.

  The Launchpad git server is currently suffering this issue, and has
  been sitting at a constant IO wait of 1 second / slightly over 1
  second which was causing slow response times, which was leading to
  build failures when git clones ended up timing out.

  We installed a test kernel, which is available in the following PPA:

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-14 Thread Matthew Ruffell

Hi @hloeung, these patches are available in 4.15.0-128-generic, and
5.4.0-58-generic.

They are both re-spins of 4.15.0-126-generic and 5.4.0-56-generic,
respectively.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-13 Thread Matthew Ruffell

Hi Benjamin,

The respun kernel has now landed in -updates, and is version
4.15.0-128-generic.

Please re-schedule the maintenance window for the Launchpad git server,
and re-attempt moving to the fixed kernel.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-13 Thread Matthew Ruffell

Hi Benjamin,

The respun kernel has now landed in -updates, and is version
4.15.0-128-generic.

Please re-schedule the maintenance window for the Launchpad git server,
and re-attempt moving to the fixed kernel.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following commits fix the problem, and all landed in 5.6-rc1:

  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for
  cache eviction. The commit mentions that under heavy IO, all caches
  will end up being recently accessed, and nothing will ever be shrunk.

  The second commit changes a previous design decision of skipping the
  first 3 caches to shrink, since it is a common case to call
  bch_mca_scan() with nr being 1, or 2, just like 0 was common in the
  very first commit I mentioned. This time, if we land on 1 or 2, the
  loop exits and nothing happens, and we waste time waiting on locks,
  just like the very first commit I mentioned. The fix is to try shrink
  caches from the tail of the list, and not the beginning.

  The third commit fixes a minor issue where we don't want to re-arrange
  the linked list c->btree_cache, which is what the second commit ended
  up doing, and instead, just shrink the cache at the end of the linked
  list, and not change the order.

  One minor backport / context adjustment was required in the first
  commit for Bionic, and the rest are all clean cherry picks to Bionic
  and Focal.

  [Testcase]

  This is kind of hard to test, since the problem shows up in production
  environments that are under constant IO pressure, with many different
  items entering and leaving the cache.

  The Launchpad git server is currently suffering this issue, and has
  been sitting at a constant IO wait of 1 second / slightly over 1
  second which was causing slow response times, which was leading to
  build failures when git clones ended up timing out.

  We installed a test kernel, which is available in the

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-10 Thread Matthew Ruffell

Performing verification for Focal.

I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe
drives that support block discard.

I enabled -proposed, and installed the 5.4.0-58-generic kernel.

The following is the repro session running through the full testcase:

https://paste.ubuntu.com/p/Zr4C2pMbrk/

A 2 disk Raid10 array was created, LVM created and formatted ext4. I let
the consistency checks finish, and created, then deleted a file. Did
another consistency check, then performed a fstrim. After another
consistency check, we unmount and perform a fsck on each individual
disk.

root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

Both of them pass, there is no corruption to the filesystem.

5.4.0-58-generic fixes the problem, the revert is effective.

Marking bug as verified for Focal.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Invalid
Status in linux source package in Xenial:
  Invalid
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe :

[Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-10 Thread Matthew Ruffell

Performing verification for Focal.

I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe
drives that support block discard.

I enabled -proposed, and installed the 5.4.0-58-generic kernel.

The following is the repro session running through the full testcase:

https://paste.ubuntu.com/p/Zr4C2pMbrk/

A 2 disk Raid10 array was created, LVM created and formatted ext4. I let
the consistency checks finish, and created, then deleted a file. Did
another consistency check, then performed a fstrim. After another
consistency check, we unmount and perform a fsck on each individual
disk.

root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

root@ip-172-31-1-147:/home/ubuntu# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

Both of them pass, there is no corruption to the filesystem.

5.4.0-58-generic fixes the problem, the revert is effective.

Marking bug as verified for Focal.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-10 Thread Matthew Ruffell

Performing verification for Bionic.

I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe
drives that support block discard.

I enabled -proposed, and installed the 4.15.0-128-generic kernel.

The following is the repro session running through the full testcase:

https://paste.ubuntu.com/p/VpwjbRRcy6/

A 2 disk Raid10 array was created, LVM created and formatted ext4. I let
the consistency checks finish, and created, then deleted a file. Did
another consistency check, then performed a fstrim. After another
consistency check, we unmount and perform a fsck on each individual
disk.

root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

Both of them pass, there is no corruption to the filesystem.

4.15.0-128-generic fixes the problem, the revert is effective.

Marking bug as verified for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-10 Thread Matthew Ruffell

Performing verification for Bionic.

I spun up a m5d.4xlarge instance on AWS, to utilise the 2x 300GB NVMe
drives that support block discard.

I enabled -proposed, and installed the 4.15.0-128-generic kernel.

The following is the repro session running through the full testcase:

https://paste.ubuntu.com/p/VpwjbRRcy6/

A 2 disk Raid10 array was created, LVM created and formatted ext4. I let
the consistency checks finish, and created, then deleted a file. Did
another consistency check, then performed a fstrim. After another
consistency check, we unmount and perform a fsck on each individual
disk.

root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

root@ip-172-31-10-77:~# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks

Both of them pass, there is no corruption to the filesystem.

4.15.0-128-generic fixes the problem, the revert is effective.

Marking bug as verified for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Invalid
Status in linux source package in Xenial:
  Invalid
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe :

Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression

2020-12-09 Thread Matthew Ruffell

Hi Lukasz,

I think you understand the plan correctly. Here it is in bullet points:

1) Re-instate Bionic sssd 1.16.1-1ubuntu1.7 and Focal sssd
2.2.3-3ubuntu0.1 to -updates.

Their [what could go wrong] still holds, as their changes are behind an opt-in
configuration file option, and it has been tested by me, the customer, and the
original bug reporter. Unlikely to cause regressions, and if they do, they will
be opt in via intentional configuration file change.

2) Re-instate Groovy adcli 0.9.0-1ubuntu1.2 to -updates.

Changes to adcli on Groovy are minimal, and will not cause any problems.

3) Build (likely in special security ppa), and accept cyrus-sasl2
upload to bionic-proposed.

We need to start the ball rolling on fixing the root cause, which is the bad
GSS-SPNEGO implementation in Bionic.

4) Delete adcli 0.8.2-1ubuntu2 from bionic-proposed upload queue.

It is likely a bit late for a revert package now, affected users would have
downgraded to adcli from -release. We will push for a fix instead.

5) Go with option one from the previous email, build, and accept adcli
0.8.2-1ubuntu2.1 to bionic-proposed.

This builds on 0.8.2-1ubuntu1 with the SRU changes, and depends on the fixed
cyrus-sasl2 package.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

6) Although adcli for Focal should be safe for release, we will play it safe,
and only release it when adcli for Bionic is ready.

7) I will re-test and verify adcli on both Bionic and Focal, as well as test
and verify cyrus-sasl2. I will also get the customer to perform some testing.

8) Once all testing has been completed, we will release adcli for Bionic and
Focal and cyrus-sasl2 to -updates.

I hope this action plan is okay. Feel free to ask for clarifications before we
put the plan into action.

Thanks,
Matthew

On Thu, Dec 10, 2020 at 5:29 AM Lukasz Zemczak
 wrote:
>
> Ok, thanks for the clarification!
>
> So, if I understand correctly, we should reinstate the reverted sssd
> for all the series, and adcli for focal and groovy? Then for bionic
> accept the cyrus-sasl2 upload + possibly an adcli with the changes
> that were reverted? I suppose adcli would need a breaks statement in
> that case.
>
> Anyway, I'm around if any SRU reviews or package copying is needed.
> Let me reach out to Eric.
>
> Cheers,
>
> On Wed, 9 Dec 2020 at 05:13, Matthew Ruffell
>  wrote:
> >
> > > Ok, so there was a LOT happening in this thread, so I'd use some quick 
> > > summary.
> > > Since what I'd like to know:
> >
> > > 1) Does this cyrus-sasl2 fix both the adcli and sssd regressions?
> > > Since we reverted both as people were reporting regressions first for sssd
> > > and then for adcli - not sure which one was the actual cause of it though
> >
> > The cyrus-sasl2 fix fixes the adcli regression, due to adcli changing to 
> > using
> > GSS-SPNEGO by default, which was broken.
> >
> > sssd never had a regression in the first place, due to the changes having
> > nothing to do with GSS-SPNEGO.
> >
> > The confusion with sssd came from confused users who did not know that adcli
> > is the program under the hood of realm, and thought that sssd had broken, 
> > when
> > in reality, it was adcli.
> >
> > > 2) Does it need fixing for all the stable series where we updated adcli 
> > > and
> > > (additionally) sssd?
> >
> > cyrus-sasl2 is only broken in Bionic. Focal onward already have the patch 
> > and
> > work fine.
> >
> > Let me know if you have any more questions, happy to answer.
> >
> > Thanks,
> > Matthew
> >
> > On Tue, Dec 8, 2020 at 4:57 PM Matthew Ruffell
> >  wrote:
> > >
> > > Hello Eric and Lukasz,
> > >
> > > I have created new debdiffs for adcli. Please review and also sponsor one
> > > of them to -proposed.
> > >
> > > Since there are multiple versions of adcli floating around I made two 
> > > debdiffs.
> > >
> > > Please choose the one most convenient / cleanest to apply.
> > >
> > > The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, 
> > > and is
> > > the version pull-lp-source pulls down. It simply adds the dependency
> > > to the fixed
> > > libsasl2-modules-gssapi-mit package with a greater than or equal to
> > > relationship.
> > >
> > > Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload 
> > > queue,
> > > and treated as 0.8.2-1ubuntu2 never existed.
> > >
> > > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachm

[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-12-09 Thread Matthew Ruffell

Hi Markus,

I am deeply sorry for causing the regression. We are aware, and tracking
the issue in bug 1907262.

The kernel team have started an emergency revert and you can expect
fixed kernels to be released in the next day or so.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-12-09 Thread Matthew Ruffell

Hi Markus,

I am deeply sorry for causing the regression. We are aware, and tracking
the issue in bug 1907262.

The kernel team have started an emergency revert and you can expect
fixed kernels to be released in the next day or so.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released
Status in linux source package in Groovy:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896578

  [Impact]

  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to
  take a very long time.

  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on
  Raid 0, takes 4 seconds.

  The bigger the devices, the longer it takes.

  The cause is that Raid10 currently uses a 512k chunk size, and uses
  this for the discard_max_bytes value. If we need to discard 1.9TB, the
  kernel splits the request into millions of 512k bio requests, even if
  the underlying device supports larger requests.

  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard
  at once:

  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040

  Where the Raid10 md device only supports 512k:

  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288

  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()

  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>] blk_ioctl_discard+0xc4/0x110
  [<0>] blkdev_common_ioctl+0x56c/0x840
  [<0>] blkdev_ioctl+0xeb/0x270
  [<0>] block_ioctl+0x3d/0x50
  [<0>] __x64_sys_ioctl+0x91/0xc0
  [<0>] do_syscall_64+0x38/0x90
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [Fix]

  Xiao Ni has developed a patchset which resolves the block discard
  performance problems. These commits have now landed in 5.10-rc1.

  commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0
  Author: Xiao Ni 
  Date: Tue Aug 25 13:42:59 2020 +0800
  Subject: md: add md_submit_discard_bio() for submitting discard bio
  Link: 
https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0

  commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3
  Author: Xiao Ni 
  Date: Tue Aug 25 13:43:00 2020 +0800
  Subject: md/raid10: extend r10bio devs to raid disks
  Link: 
https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3

  commit f046f5d0d79cdb968f219ce249e497fd1accf484
  Author: Xiao Ni 
  Date: Tue Aug 25 13:43:01 2020 +0800
  Subject: md/raid10: pull codes that wait for blocked dev into one function
  Link: 
https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484

  commit bcc90d280465ebd51ab8688be86e1f00c62dccf9
  Author: Xiao Ni 
  Date: Wed Sep 2 20:00:22 2020 +0800
  Subject: md/raid10: improve raid10 discard request
  Link: 
https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9

  commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359
  Author: Xiao Ni 
  Date: Wed Sep 2 20:00:23 2020 +0800
  Subject: md/raid10: improve discard request for far layout
  Link: 
https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359

  There is also an additional commit which is required, and was merged
  after "md/raid10: improve raid10 discard request" was merged. The
  following commits enable Radid10 to use large discards, instead of
  splitting into many bios, since the technical hurdles have now been
  removed.

  commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512
  Author: Mike Snitzer 
  Date: Thu Sep 24 13:14:52 2020 -0400
  Subject: dm raid: fix discard limits for raid1 and raid10
  Link: 
https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512

  commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28
  Author: Mike Snitzer 
  Date:   Thu Sep 24 16:40:12 2020 -0400
  Subject: dm raid: remove unnecessary discard limits for raid10
  Link:

[Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

Hi Thimo,

Firstly, thank you for your bug report, we really, really appreciate it.

You are correct, the recent raid10 patches appear to cause filesystem
corruption on raid10 arrays.

I have spent the day reproducing, and I can confirm that the
4.15.0-126-generic, 5.4.0-56-generic and 5.8.0-31-generic kernels are
affected.

The kernel team are aware of the situation, and we have begun an
emergency revert of the patches, and we should have new kernels
available in the next few hours / day or so.

The current mainline kernel is affected, so I have written to the raid
subsystem maintainer, and the original author of the raid10 block
discard patches, to aid with debugging and fixing the problem.

You can follow the upstream thread here:

https://www.spinics.net/lists/kernel/msg3765302.html

As for the data corruption on your servers, I am deeply sorry for
causing this regression.

When I was testing the raid10 block discard patches on the Ubuntu stable
kernels, I did not think to fsck each of the disks in the array,
instead, I was contempt with the speed of creating new arrays, writing a
basic dataset to the disks, and rebooting the server to ensure the array
came up again with those same files.

Since the first disk seems to be okay, there is at least a small window
of opportunity for you to restore any data that you have not backed up.

I will keep you informed of getting the patches reverted, and getting
the root cause fixed upstream. If you have any questions, feel free to
ask, and if you have any more details from your own debugging, feel free
to share in this bug, or on the upstream mailing list discussion.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

Hi Thimo,

Firstly, thank you for your bug report, we really, really appreciate it.

You are correct, the recent raid10 patches appear to cause filesystem
corruption on raid10 arrays.

I have spent the day reproducing, and I can confirm that the
4.15.0-126-generic, 5.4.0-56-generic and 5.8.0-31-generic kernels are
affected.

The kernel team are aware of the situation, and we have begun an
emergency revert of the patches, and we should have new kernels
available in the next few hours / day or so.

The current mainline kernel is affected, so I have written to the raid
subsystem maintainer, and the original author of the raid10 block
discard patches, to aid with debugging and fixing the problem.

You can follow the upstream thread here:

https://www.spinics.net/lists/kernel/msg3765302.html

As for the data corruption on your servers, I am deeply sorry for
causing this regression.

When I was testing the raid10 block discard patches on the Ubuntu stable
kernels, I did not think to fsck each of the disks in the array,
instead, I was contempt with the speed of creating new arrays, writing a
basic dataset to the disks, and rebooting the server to ensure the array
came up again with those same files.

Since the first disk seems to be okay, there is at least a small window
of opportunity for you to restore any data that you have not backed up.

I will keep you informed of getting the patches reverted, and getting
the root cause fixed upstream. If you have any questions, feel free to
ask, and if you have any more details from your own debugging, feel free
to share in this bug, or on the upstream mailing list discussion.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  In Progress

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   :

[Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Bionic)
   Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
   Status: New => In Progress

** Changed in: linux (Ubuntu Groovy)
   Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Groovy)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Bionic)
   Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
   Status: New => In Progress

** Changed in: linux (Ubuntu Groovy)
   Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Groovy)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  In Progress

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression

2020-12-08 Thread Matthew Ruffell

> Ok, so there was a LOT happening in this thread, so I'd use some quick 
> summary.
> Since what I'd like to know:

> 1) Does this cyrus-sasl2 fix both the adcli and sssd regressions?
> Since we reverted both as people were reporting regressions first for sssd
> and then for adcli - not sure which one was the actual cause of it though

The cyrus-sasl2 fix fixes the adcli regression, due to adcli changing to using
GSS-SPNEGO by default, which was broken.

sssd never had a regression in the first place, due to the changes having
nothing to do with GSS-SPNEGO.

The confusion with sssd came from confused users who did not know that adcli
is the program under the hood of realm, and thought that sssd had broken, when
in reality, it was adcli.

> 2) Does it need fixing for all the stable series where we updated adcli and
> (additionally) sssd?

cyrus-sasl2 is only broken in Bionic. Focal onward already have the patch and
work fine.

Let me know if you have any more questions, happy to answer.

Thanks,
Matthew

On Tue, Dec 8, 2020 at 4:57 PM Matthew Ruffell
 wrote:
>
> Hello Eric and Lukasz,
>
> I have created new debdiffs for adcli. Please review and also sponsor one
> of them to -proposed.
>
> Since there are multiple versions of adcli floating around I made two 
> debdiffs.
>
> Please choose the one most convenient / cleanest to apply.
>
> The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, and is
> the version pull-lp-source pulls down. It simply adds the dependency
> to the fixed
> libsasl2-modules-gssapi-mit package with a greater than or equal to
> relationship.
>
> Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload 
> queue,
> and treated as 0.8.2-1ubuntu2 never existed.
>
> https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff
>
> Option two builds upon 0.8.2-1ubuntu2, and re-applies all of the --use-ldaps
> patches from the previous SRU which 0.8.2-1ubuntu2 reverts. It also adds the
> dependency to the fixed libsasl2-modules-gssapi-mit package with a
> greater than
> or equal to relationship.
>
> https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff
>
> My preference is for option one, but use whatever is required. I only made 
> both
> of these to lower round trip time due to timezones if you don't like the 
> option
> one idea.
>
> Thanks,
> Matthew
>
> On Mon, Dec 7, 2020 at 3:25 PM Matthew Ruffell
>  wrote:
> >
> > Hi Eric, Lukasz,
> >
> > Please review and potentially sponsor the cyrus-sasl2 debdff attached
> > to LP1906627.
> >
> > [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627
> >
> > It fixes the root cause of the GSS-SPNEGO implementation being incompatible 
> > with
> > Microsoft's implementation in Active Directory.
> >
> > If you are still planning to re-release adcli and sssd to -security, then 
> > you
> > should also build cyrus-sasl2 in the same way:
> >
> > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages
> >
> > Again, I am sorry for causing the regression and these patches should fix 
> > the
> > underlying cause.
> >
> > Thanks,
> > Matthew

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

PROBLEM: Recent raid10 block discard patchset causes filesystem corruption on fstrim

2020-12-08 Thread Matthew Ruffell

block/md0/md/mismatch_cnt
205324928

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
[raid10] 
md0 : active raid10 nvme1n1[1] nvme2n1[0]
  292836352 blocks super 1.2 2 near-copies [2/2] [UU]
  bitmap: 0/3 pages [0KB], 65536KB chunk

unused devices: 
# cat /sys/block/md0/md/mismatch_cnt
205324928

Now, we need to take the raid10 array down, and perform a fsck on one disk at
a time:

# umount /mnt
# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

Let's do first disk;

# mdadm --assemble /dev/md127 /dev/nvme1n1 
mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to 
insist).
# mdadm --run /dev/md127
mdadm: started array /dev/md/lv-raid
# vgchange -a y /dev/VolGroup
  1 logical volume(s) in volume group "VolGroup" now active
# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/VolGroup/root: 11/6553600 files (0.0% non-contiguous), 557848/26214400 
blocks
# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md127
mdadm: stopped /dev/md127

The second disk:

# mdadm --assemble /dev/md127 /dev/nvme2n1
mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to 
insist).
# mdadm --run /dev/md127
mdadm: started array /dev/md/lv-raid
# vgchange -a y /dev/VolGroup
  1 logical volume(s) in volume group "VolGroup" now active
# fsck.ext4 -n -f /dev/VolGroup/root
e2fsck 1.44.1 (24-Mar-2018)
Resize inode not valid.  Recreate? no

Pass 1: Checking inodes, blocks, and sizes
Inode 7 has illegal block(s).  Clear? no

Illegal indirect block (1714656753) in inode 7.  IGNORED.
Error while iterating over blocks in inode 7: Illegal indirect block found

/dev/VolGroup/root: ** WARNING: Filesystem still has errors **

e2fsck: aborted

/dev/VolGroup/root: ** WARNING: Filesystem still has errors **

# vgchange -a n /dev/VolGroup
  0 logical volume(s) in volume group "VolGroup" now active
# mdadm --stop /dev/md127
mdadm: stopped /dev/md127

There are no panics or anything in dmesg. The directory structure of the first
disk is intact, but the second disk only has Lost+Found present.

I can confirm it is the patches listed at the top of the email, but I have not
had an opportunity to bisect to find the exact root cause. I will do that once
we confirm what Ubuntu stable kernels are affected and begin reverting the
patches.

Let me know if you need any more details.

Thanks,
Matthew Ruffell

[Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

Hi Thimo,

Thank you for the very detailed bug report. I will start investigating this
immediately.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Matthew Ruffell

Hi Thimo,

Thank you for the very detailed bug report. I will start investigating this
immediately.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Re: Bug Triage - Friday 4th December

2020-12-08 Thread Matthew Ruffell

Hi Christian,

> Maybe when you go for adcli and sssd in LP #1868703 again - they might
> have their dependency to libsasl2-modules-gssapi-mit be versioned to
> be greater or equal the fixed cyrus_sasl2?

That is an excellent idea. I will do exactly that.

I have prepared a new debdiff for adcli which adds a dependency to
libsasl2-modules-gssapi-mit at the new upload version of
2.1.27~101-g0780600+dfsg-3ubuntu2.2.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

Thanks for suggesting!

Matthew

On Tue, Dec 8, 2020 at 12:28 AM Christian Ehrhardt
 wrote:
>
> On Mon, Dec 7, 2020 at 3:45 AM Matthew Ruffell
>  wrote:
> >
> ...
> > Again, I apologise for the regression, and things are on their way to being
> > fixed.
>
> Thanks for jumping on it once it was identified.
>
> One suggestion for the coming related uploads.
> Do you think it would make sense to ensure that the now-known-bad
> combinations of packages won't be allowed together.
> Maybe when you go for adcli and sssd in LP #1868703 again - they might
> have their dependency to libsasl2-modules-gssapi-mit be versioned to
> be greater or equal the fixed cyrus_sasl2?
>
>
> > [1] 
> > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff
> > [2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

-- 
ubuntu-server mailing list
ubuntu-server@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server
More info: https://wiki.ubuntu.com/ServerTeam

Re: [Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression

2020-12-07 Thread Matthew Ruffell

Hello Eric and Lukasz,

I have created new debdiffs for adcli. Please review and also sponsor one
of them to -proposed.

Since there are multiple versions of adcli floating around I made two debdiffs.

Please choose the one most convenient / cleanest to apply.

The first simply builds ontop of 0.8.2-1ubuntu1 currently in -proposed, and is
the version pull-lp-source pulls down. It simply adds the dependency
to the fixed
libsasl2-modules-gssapi-mit package with a greater than or equal to
relationship.

Use of this debdiff requires 0.8.2-1ubuntu2 to be deleted from the upload queue,
and treated as 0.8.2-1ubuntu2 never existed.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

Option two builds upon 0.8.2-1ubuntu2, and re-applies all of the --use-ldaps
patches from the previous SRU which 0.8.2-1ubuntu2 reverts. It also adds the
dependency to the fixed libsasl2-modules-gssapi-mit package with a
greater than
or equal to relationship.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff

My preference is for option one, but use whatever is required. I only made both
of these to lower round trip time due to timezones if you don't like the option
one idea.

Thanks,
Matthew

On Mon, Dec 7, 2020 at 3:25 PM Matthew Ruffell
 wrote:
>
> Hi Eric, Lukasz,
>
> Please review and potentially sponsor the cyrus-sasl2 debdff attached
> to LP1906627.
>
> [1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627
>
> It fixes the root cause of the GSS-SPNEGO implementation being incompatible 
> with
> Microsoft's implementation in Active Directory.
>
> If you are still planning to re-release adcli and sssd to -security, then you
> should also build cyrus-sasl2 in the same way:
>
> https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages
>
> Again, I am sorry for causing the regression and these patches should fix the
> underlying cause.
>
> Thanks,
> Matthew

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option two: a debdiff for adcli, which builds on
0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and
also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater
or equal to relationship. Use this if option 1 is a no go.

** Patch added: "debdiff for adcli on Bionic option two"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option two: a debdiff for adcli, which builds on
0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and
also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater
or equal to relationship. Use this if option 1 is a no go.

** Patch added: "debdiff for adcli on Bionic option two"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Sts-sponsors] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option two: a debdiff for adcli, which builds on
0.8.2-1ubuntu2, which re-introduces all of the --use-ldaps patches, and
also adds a depends to the fixed libsasl2-modules-gssapi-mit at greater
or equal to relationship. Use this if option 1 is a no go.

** Patch added: "debdiff for adcli on Bionic option two"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441873/+files/lp1906627_adcli_option_two.debdiff

-- 
You received this bug notification because you are a member of STS
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity modes. This shouldn't

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option one: a debdiff for adcli, which builds on
0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules-
gssapi-mit at greater or equal to relationship. This will require the
0.8.2-1ubuntu2 package in -unapproved queue to be deleted.

** Patch added: "debdiff for adcli on Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option one: a debdiff for adcli, which builds on
0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules-
gssapi-mit at greater or equal to relationship. This will require the
0.8.2-1ubuntu2 package in -unapproved queue to be deleted.

** Patch added: "debdiff for adcli on Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity modes. This

[Sts-sponsors] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-07 Thread Matthew Ruffell

Attached is option one: a debdiff for adcli, which builds on
0.8.2-1ubuntu1 and simply adds a depends to the fixed libsasl2-modules-
gssapi-mit at greater or equal to relationship. This will require the
0.8.2-1ubuntu2 package in -unapproved queue to be deleted.

** Patch added: "debdiff for adcli on Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441872/+files/lp1906627_adcli_option_one.debdiff

-- 
You received this bug notification because you are a member of STS
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity modes. This shouldn't be a problem

Re: Bug Triage - Friday 4th December

2020-12-07 Thread Matthew Ruffell

Status update:

- There is a new build of adcli, version 0.8.2-1ubuntu2 which reverts the
  patches introduced in the previous build, on the -unapproved queue in
  -proposed. This is likely to be released to fix anyone using the faulty
  0.8.2-1ubuntu1 package.
- As mentioned in previous messages, I have determined the root cause of the
  failure to be an incompatible implementation of GSS-SPNEGO in cyrus-sasl2,
  and I have created a debdiff which fixes the problem [1].
- I have added a SRU template for cyrus-sasl2 in [2], and asked for the changes
  to be sponsored and placed into -proposed.

This regression will be resolved when either the cyrus-sasl2 fixes have made
their way to -updates, likely in a week's time, or when the adcli package with
the reverted patches is released.

Once the fixed cyrus-sasl2 is released, we will re-perform verification on the
changes to adcli and sssd in LP #1868703, and hopefully go for release again.

Again, I apologise for the regression, and things are on their way to being
fixed.

Thanks,
Matthew

[1] 
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff
[2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

On Sat, Dec 5, 2020 at 3:32 PM Matthew Ruffell
 wrote:
>
> Status update:
>
> - all recent releases of sssd and adcli have been pulled from -updates and
>   -security, and placed back into -proposed.
>
> - I made a debdiff to revert the problematic patches for adcli in Bionic,
>   Lukasz has built it in
> https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages
>
> - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above
>   ppa to bionic-proposed for testing.
>
> - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after.
>
> - I have written to customers and confirmed the regression to be limited to
>   adcli on Bionic, and given them instructions to dowongrade to the version in
>   the -release pocket.
>
> Again, I am sorry for causing the regression. On Monday I will begin fixing up
> cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation.
>
> Thanks,
> Matthew
>
> On Sat, Dec 5, 2020 at 12:33 PM Matthew Ruffell
>  wrote:
> >
> > Hi everyone,
> >
> > Firstly, I deeply apologise for causing the regression.
> >
> > Even with three separate people testing the test packages and the packages 
> > in
> > -proposed, the failure still went unnoticed. I should have considered
> > the impacts
> > of changing the default behaviour of adcli a little more deeply than 
> > treating it
> > like a normal SRU.
> >
> > Here are the facts:
> >
> > The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the 
> > time
> > of writing, it is still in the archive. To archive admins, this needs
> > to be pulled.
> >
> > adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy 
> > and
> > 0.9.0-1ubuntu2 in Hirsute are not affected.
> >
> > sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are
> > not affected.
> >
> > Bug Reports:
> >
> > There are two launchpad bugs open:
> >
> > LP #1906627 "adcli fails, can't contact LDAP server"
> > https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627
> >
> > LP #1906673 "Realm join hangs"
> > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673
> >
> > Customer Cases:
> >
> > SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain"
> > https://canonical.my.salesforce.com/5004K03u9EW
> >
> > SF 00299039 "Regression Issue due to
> > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673;
> > https://canonical.my.salesforce.com/5004K03uAkL
> >
> > Root Cause:
> >
> > The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD
> > requirements (ADV190023)"
> > https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703
> >
> > introduced two changes for adcli on Bionic. The first, was to change from
> > GSS-API to GSS-SPNEGO, and the second was to implement support for the flag
> > --use-ldaps.
> >
> > I built a upstream master of adcli, and it still fails on Ubuntu. This 
> > indicates
> > that the failure is not actually in the adcli package. adcli does not 
> > implement
> > GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package,
> > which is a part of cyrus-sasl2.
> >
> > I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it
> > works with the proble

Re: [Sts-sponsors] sssd/adcli regression after last upload

2020-12-06 Thread Matthew Ruffell

Status update:

- There is a new build of adcli, version 0.8.2-1ubuntu2 which reverts the
  patches introduced in the previous build, on the -unapproved queue in
  -proposed. This is likely to be released to fix anyone using the faulty
  0.8.2-1ubuntu1 package.
- As mentioned in previous messages, I have determined the root cause of the
  failure to be an incompatible implementation of GSS-SPNEGO in cyrus-sasl2,
  and I have created a debdiff which fixes the problem [1].
- I have added a SRU template for cyrus-sasl2 in [2], and asked for the changes
  to be sponsored and placed into -proposed.

This regression will be resolved when either the cyrus-sasl2 fixes have made
their way to -updates, likely in a week's time, or when the adcli package with
the reverted patches is released.

Once the fixed cyrus-sasl2 is released, we will re-perform verification on the
changes to adcli and sssd in LP #1868703, and hopefully go for release again.

Again, I apologise for the regression, and things are on their way to being
fixed.

Thanks,
Matthew

[1] 
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff
[2] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

On Sat, Dec 5, 2020 at 3:32 PM Matthew Ruffell
 wrote:
>
> Status update:
>
> - all recent releases of sssd and adcli have been pulled from -updates and
>   -security, and placed back into -proposed.
>
> - I made a debdiff to revert the problematic patches for adcli in Bionic,
>   Lukasz has built it in
> https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages
>
> - Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above
>   ppa to bionic-proposed for testing.
>
> - We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after.
>
> - I have written to customers and confirmed the regression to be limited to
>   adcli on Bionic, and given them instructions to dowongrade to the version in
>   the -release pocket.
>
> Again, I am sorry for causing the regression. On Monday I will begin fixing up
> cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation.
>
> Thanks,
> Matthew
>
> On Sat, Dec 5, 2020 at 12:23 PM Sergio Durigan Junior
>  wrote:
> >
> > On Friday, December 04 2020, Matthew Ruffell wrote:
> >
> > > Hi everyone,
> > >
> > > Firstly, I deeply apologise for causing the regression.
> >
> > Thanks for working on this and for the detailed analysis, Matthew.
> >
> > --
> > Sergio
> > GPG key ID: E92F D0B3 6B14 F1F4 D8E0  EB2F 106D A1C8 C3CB BF14

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

** Tags added: sts-sponsor

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

** Tags added: sts-sponsor

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity modes. This shouldn't be a problem 
as the krb5 implementation signals its intentions by setting the correct flags 
during handshake, which these patches to cyrus-sasl2 should now parse correctly.

  [Other Info]

  The below two commits are needed. The first fixes the problem, the second 
fixes
  some unused parameter warnings.

  commit 816e529043de08f3f9dcc4097380de39478b0b16
  Author: Simo Sorce 
  Date:   Thu Feb

[Sts-sponsors] Please review and consider sponsoring LP #1906627 for cyrus-sasl2, which fixes adcli regression

2020-12-06 Thread Matthew Ruffell

Hi Eric, Lukasz,

Please review and potentially sponsor the cyrus-sasl2 debdff attached
to LP1906627.

[1] https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

It fixes the root cause of the GSS-SPNEGO implementation being incompatible with
Microsoft's implementation in Active Directory.

If you are still planning to re-release adcli and sssd to -security, then you
should also build cyrus-sasl2 in the same way:

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages

Again, I am sorry for causing the regression and these patches should fix the
underlying cause.

Thanks,
Matthew

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

Attached is a debdiff for cyrus-sasl2 on Bionic, which resolves the
incompatibilities of the GSS-SPNEGO implementation with the one in
Active Directory.

** Patch added: "cyrus-sasl2 debdiff for Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

Attached is a debdiff for cyrus-sasl2 on Bionic, which resolves the
incompatibilities of the GSS-SPNEGO implementation with the one in
Active Directory.

** Patch added: "cyrus-sasl2 debdiff for Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441530/+files/lp1906627_cyrus_sasl2_bionic.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active
  Directory, causing recent adcli regression

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  [Impact]

  A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
  regression for some users when attempting to join a Active Directory
  realm. adcli introduced a default behaviour change, moving from GSS-
  API to GSS-SPNEGO as the default channel encryption algorithm.

  adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
  mit, a part of cyrus-sasl2. The implementation seems to have some
  compatibility issues with particular configurations of Active
  Directory on recent Windows Server systems.

  Particularly, adcli sends a ldap query to the domain controller, which
  responds with a tcp ack, but never returns a ldap response. The
  connection just hangs at this point and no more traffic is sent.

  You can see it on the packet trace below:

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  On Focal, where the implementation of GSS-SPNEGO is working, we see a
  full exchange, and adcli works as expected:

  https://paste.ubuntu.com/p/8668pJrr2m/

  The fix is to not assume use of confidentiality and integrity modes,
  and instead use the flags negotiated by GSS-API during the initial
  handshake, as required by Microsoft's implementation.

  [Testcase]

  You will need to set up a Windows Server 2019 system, install and
  configure Active Directory and enable LDAP extensions and configure
  LDAPS and import the AD SSL certificate to the Ubuntu client. Create
  some users in Active Directory.

  On the Ubuntu client, set up /etc/hosts with the hostname of the
  Windows Server machine, if your system isn't configured for AD DNS.

  From there, install adcli 0.8.2-1 from -release.

  $ sudo apt install adcli

  Set up a packet trace with tcpdump:

  $ sudo tcpdump -i any port '(389 or 3268 or 636 or 3269)'

  Next, join the AD realm using the normal GSS-API:

  # adcli join --verbose -U Administrator --domain WIN-
  SB6JAS7PH22.testing.local --domain-controller WIN-
  SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL

  You will be prompted for Administrator's passowrd.

  The output should look like the below:

  https://paste.ubuntu.com/p/NWHGQn746D/

  Next, enable -proposed, and install adcli 0.8.2-1ubuntu1 which caused the 
regression.
  Repeat the above steps. Now you should see the connection hang.

  https://paste.ubuntu.com/p/WRnnRMGBPm/

  Finally, install the fixed cyrus-sasl2 package, which is available from the
  below ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/lp1906627-test

  $ sudo add-apt-repository ppa:mruffell/lp1906627-test
  $ sudo apt-get update
  $ sudo apt install libsasl2-2 libsasl2-modules libsasl2-modules-db 
libsasl2-modules-gssapi-mit

  Repeat the steps. GSS-SPNEGO should be working as intended, and you
  should get output like below:

  https://paste.ubuntu.com/p/W5cJNGvCsx/

  [Where problems could occur]

  Since we are changing the implementation of GSS-SPNEGO, and cyrus-
  sasl2 is the library which provides it, we can potentially break any
  package which depends on libsasl2-modules-gssapi-mit for GSS-SPNEGO.

  $ apt rdepends libsasl2-modules-gssapi-mit
  libsasl2-modules-gssapi-mit
  Reverse Depends:
   |Suggests: ldap-utils
Depends: adcli
Conflicts: libsasl2-modules-gssapi-heimdal
   |Suggests: libsasl2-modules
Conflicts: libsasl2-modules-gssapi-heimdal
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   |Suggests: ldap-utils
   |Depends: msktutil
Conflicts: libsasl2-modules-gssapi-heimdal
   |Depends: libapache2-mod-webauthldap
Depends: freeipa-server
Depends: freeipa-client
Depends: adcli
Depends: 389-ds-base
   |Recommends: sssd-krb5-common
   |Suggests: slapd
   |Suggests: libsasl2-modules
   
  While this SRU makes cyrus-sasl2 work with Microsoft implementations of 
GSS-SPNEGO, which will be the more common usecase, it may change the behaviour  
when connecting to a MIT krb5 server with the GSS-SPNEGO protocol, as krb5 
assumes use of confidentiality and integrity modes. This shouldn't be a problem 
as the krb5 implementation signals its intentions by setting the correct flags

[Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

** Summary changed:

- adcli fails, can't contact LDAP server
+ GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active 
Directory, causing recent adcli regression

** Description changed:

- Package: adcli
- Version: 0.8.2-1ubuntu1
- Release: Ubuntu 18.04 LTS
+ [Impact]
  
- When trying to join the domain with this new version of adcli, it gets
- to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do
- anything for 10 minutes. It will then fail, complaining it can't reach
- the LDAP server.
+ A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
+ regression for some users when attempting to join a Active Directory
+ realm. adcli introduced a default behaviour change, moving from GSS-API
+ to GSS-SPNEGO as the default channel encryption algorithm.
  
- Logs:
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
+ adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
+ mit, a part of cyrus-sasl2. The implementation seems to have some
+ compatibility issues with particular configurations of Active Directory
+ on recent Windows Server systems.
  
- On the network level, adcli gets to the point of send an ldap query to
- the domain controller and the domain controller returns an ack tcp
- packet, but then there is no more traffic between the domain controller
- and the server except for ntp packets until it fails.
+ Particularly, adcli sends a ldap query to the domain controller, which
+ responds with a tcp ack, but never returns a ldap response. The
+ connection just hangs at this point and no more traffic is sent.
  
- The domain controller traffic also shows that it is receiving the ldap
- query packet from the server but it never sends a

[Touch-packages] [Bug 1906627] Re: GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active Directory, causing recent adcli regression

2020-12-06 Thread Matthew Ruffell

** Summary changed:

- adcli fails, can't contact LDAP server
+ GSS-SPNEGO implementation in cyrus-sasl2 is incompatible with Active 
Directory, causing recent adcli regression

** Description changed:

- Package: adcli
- Version: 0.8.2-1ubuntu1
- Release: Ubuntu 18.04 LTS
+ [Impact]
  
- When trying to join the domain with this new version of adcli, it gets
- to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not do
- anything for 10 minutes. It will then fail, complaining it can't reach
- the LDAP server.
+ A recent release of adcli 0.8.2-1ubuntu1 to bionic-updates caused a
+ regression for some users when attempting to join a Active Directory
+ realm. adcli introduced a default behaviour change, moving from GSS-API
+ to GSS-SPNEGO as the default channel encryption algorithm.
  
- Logs:
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
- Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
- Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
- Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
- Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
+ adcli uses the GSS-SPNEGO implementation from libsasl2-modules-gssapi-
+ mit, a part of cyrus-sasl2. The implementation seems to have some
+ compatibility issues with particular configurations of Active Directory
+ on recent Windows Server systems.
  
- On the network level, adcli gets to the point of send an ldap query to
- the domain controller and the domain controller returns an ack tcp
- packet, but then there is no more traffic between the domain controller
- and the server except for ntp packets until it fails.
+ Particularly, adcli sends a ldap query to the domain controller, which
+ responds with a tcp ack, but never returns a ldap response. The
+ connection just hangs at this point and no more traffic is sent.
  
- The domain controller traffic also shows that it is receiving the ldap
- query packet from the server but it never sends a

Re: Bug Triage - Friday 4th December

2020-12-05 Thread Matthew Ruffell

Hi everyone,

Firstly, I deeply apologise for causing the regression.

Even with three separate people testing the test packages and the packages in
-proposed, the failure still went unnoticed. I should have considered
the impacts
of changing the default behaviour of adcli a little more deeply than treating it
like a normal SRU.

Here are the facts:

The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time
of writing, it is still in the archive. To archive admins, this needs
to be pulled.

adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and
0.9.0-1ubuntu2 in Hirsute are not affected.

sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are
not affected.

Bug Reports:

There are two launchpad bugs open:

LP #1906627 "adcli fails, can't contact LDAP server"
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

LP #1906673 "Realm join hangs"
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673

Customer Cases:

SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain"
https://canonical.my.salesforce.com/5004K03u9EW

SF 00299039 "Regression Issue due to
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673;
https://canonical.my.salesforce.com/5004K03uAkL

Root Cause:

The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD
requirements (ADV190023)"
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703

introduced two changes for adcli on Bionic. The first, was to change from
GSS-API to GSS-SPNEGO, and the second was to implement support for the flag
--use-ldaps.

I built a upstream master of adcli, and it still fails on Ubuntu. This indicates
that the failure is not actually in the adcli package. adcli does not implement
GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package,
which is a part of cyrus-sasl2.

I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it
works with the problematic adcli package.

The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on
Bionic is broken, and has never worked.

There is more details about commits which the cyrus-sasl2 package in Bionic is
missing in comment #5 in LP #1906627.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5

Steps taken yesterday:

I added regression-update to LP #1906627, and I pinged ubuntu-archive in
#ubuntu-release with these details, but they seem to have been lost in the
noise.

Located root cause to cryus-sasl2 on Bionic.

Next steps:

We don't need to revert any changes for adcli or sssd on Focal onward.

We don't need to revert any changes on sssd on Bionic.

We need to push a new adcli into Bionic with the recent patches reverted.

We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic.

We need to re-release all the SRUs from LP #1868703 after some very thorough
testing and validation.

Again, I am deeply sorry for causing this regression. I will fix it, starting
with getting adcli removed from the Bionic archive.

Thanks,
Matthew

On Fri, Dec 4, 2020 at 10:40 PM Lukasz Zemczak
 wrote:
>
> Hey!
>
> I prefer broken upgrades to get pulled anyway. Besides, packages are
> updated by unattended-upgrades in up-to 24 hours, so some users might
> have not gotten it yet. And there's also those not using
> undattended-upgrades. Let me demote it back to -proposed from -updates
> as well.
>
> On Fri, 4 Dec 2020 at 10:00, Christian Ehrhardt
>  wrote:
> >
> > On Fri, Dec 4, 2020 at 9:49 AM Lukasz Zemczak
> >  wrote:
> > >
> > > Hey Christian!
> > >
> > > This sounds bad indeed, let's see what Matthew has to say. In the
> > > meantime I have backed it out from both bionic-security and
> > > focal-security.
> >
> > Thank you
> >
> > > Should we also consider dropping it from -updates?
> >
> > Well, compared to other cases in this case we don't even yet have a
> > "ok this is a mess, but this is how you can resolve it afterwards to
> > work again".
> > Therefore I think pulling it from -updates as well makes sense until
> > Matthew had time to look at it in detail and give all-clear (or not).
> >
> > P.S.: you slightly raced vorlon who had a different assessment
> >   [09:30]  cpaelzer: well, by this point almost everyone will
> > have picked it up from security via unattended-upgrades so there's not
> > much point
> > But having it pulled for now is on the safe-side and we can re-instate
> > it at any time once we know more.
> >
> > > Cheers,
> > >
> > > On Fri, 4 Dec 2020 at 09:01, Christian Ehrhardt
> > >  wrote:
> > > >
> > > > I was looking at 16 recently touched bugs. Of these a few needed a 
> > > > comment or
> > > > task update but not a lot of work. Worth to mention are two of them.
> > > >
> > > > First we've had "one more" kind of conflicting mysql packages from
> > > > third party breaking install/upgrade of the one provided by Ubuntu. I
> > > > dupped it onto bug 1771630 which is our single place to unite all
> > > > those.
> > > >
> > > >

Re: Bug Triage - Friday 4th December

2020-12-05 Thread Matthew Ruffell

Status update:

- all recent releases of sssd and adcli have been pulled from -updates and
  -security, and placed back into -proposed.

- I made a debdiff to revert the problematic patches for adcli in Bionic,
  Lukasz has built it in
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages

- Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above
  ppa to bionic-proposed for testing.

- We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after.

- I have written to customers and confirmed the regression to be limited to
  adcli on Bionic, and given them instructions to dowongrade to the version in
  the -release pocket.

Again, I am sorry for causing the regression. On Monday I will begin fixing up
cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation.

Thanks,
Matthew

On Sat, Dec 5, 2020 at 12:33 PM Matthew Ruffell
 wrote:
>
> Hi everyone,
>
> Firstly, I deeply apologise for causing the regression.
>
> Even with three separate people testing the test packages and the packages in
> -proposed, the failure still went unnoticed. I should have considered
> the impacts
> of changing the default behaviour of adcli a little more deeply than treating 
> it
> like a normal SRU.
>
> Here are the facts:
>
> The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time
> of writing, it is still in the archive. To archive admins, this needs
> to be pulled.
>
> adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and
> 0.9.0-1ubuntu2 in Hirsute are not affected.
>
> sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are
> not affected.
>
> Bug Reports:
>
> There are two launchpad bugs open:
>
> LP #1906627 "adcli fails, can't contact LDAP server"
> https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627
>
> LP #1906673 "Realm join hangs"
> https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673
>
> Customer Cases:
>
> SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain"
> https://canonical.my.salesforce.com/5004K03u9EW
>
> SF 00299039 "Regression Issue due to
> https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673;
> https://canonical.my.salesforce.com/5004K03uAkL
>
> Root Cause:
>
> The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD
> requirements (ADV190023)"
> https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703
>
> introduced two changes for adcli on Bionic. The first, was to change from
> GSS-API to GSS-SPNEGO, and the second was to implement support for the flag
> --use-ldaps.
>
> I built a upstream master of adcli, and it still fails on Ubuntu. This 
> indicates
> that the failure is not actually in the adcli package. adcli does not 
> implement
> GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package,
> which is a part of cyrus-sasl2.
>
> I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it
> works with the problematic adcli package.
>
> The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on
> Bionic is broken, and has never worked.
>
> There is more details about commits which the cyrus-sasl2 package in Bionic is
> missing in comment #5 in LP #1906627.
>
> https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5
>
> Steps taken yesterday:
>
> I added regression-update to LP #1906627, and I pinged ubuntu-archive in
> #ubuntu-release with these details, but they seem to have been lost in the
> noise.
>
> Located root cause to cryus-sasl2 on Bionic.
>
> Next steps:
>
> We don't need to revert any changes for adcli or sssd on Focal onward.
>
> We don't need to revert any changes on sssd on Bionic.
>
> We need to push a new adcli into Bionic with the recent patches reverted.
>
> We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic.
>
> We need to re-release all the SRUs from LP #1868703 after some very thorough
> testing and validation.
>
> Again, I am deeply sorry for causing this regression. I will fix it, starting
> with getting adcli removed from the Bionic archive.
>
> Thanks,
> Matthew
>
> On Fri, Dec 4, 2020 at 10:40 PM Lukasz Zemczak
>  wrote:
> >
> > Hey!
> >
> > I prefer broken upgrades to get pulled anyway. Besides, packages are
> > updated by unattended-upgrades in up-to 24 hours, so some users might
> > have not gotten it yet. And there's also those not using
> > undattended-upgrades. Let me demote it back to -proposed from -updates
> > as well.
> >
> > On Fri, 4 Dec 2020 at 10:00, Christian Ehrhardt
> >  wrote:
> &

Re: [Sts-sponsors] sssd/adcli regression after last upload

2020-12-04 Thread Matthew Ruffell

Status update:

- all recent releases of sssd and adcli have been pulled from -updates and
  -security, and placed back into -proposed.

- I made a debdiff to revert the problematic patches for adcli in Bionic,
  Lukasz has built it in
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4336/+packages

- Currently waiting for adcli - 0.8.2-1ubuntu2 to be bin-synced from the above
  ppa to bionic-proposed for testing.

- We need to release adcli - 0.8.2-1ubuntu2 to -updates and -security after.

- I have written to customers and confirmed the regression to be limited to
  adcli on Bionic, and given them instructions to dowongrade to the version in
  the -release pocket.

Again, I am sorry for causing the regression. On Monday I will begin fixing up
cyrus-sasl2 on Bionic to have a working GSS-SPNEGO implementation.

Thanks,
Matthew

On Sat, Dec 5, 2020 at 12:23 PM Sergio Durigan Junior
 wrote:
>
> On Friday, December 04 2020, Matthew Ruffell wrote:
>
> > Hi everyone,
> >
> > Firstly, I deeply apologise for causing the regression.
>
> Thanks for working on this and for the detailed analysis, Matthew.
>
> --
> Sergio
> GPG key ID: E92F D0B3 6B14 F1F4 D8E0  EB2F 106D A1C8 C3CB BF14

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

** Changed in: cyrus-sasl2 (Ubuntu)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

** Changed in: cyrus-sasl2 (Ubuntu)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Fix Released
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  Package: adcli
  Version: 0.8.2-1ubuntu1
  Release: Ubuntu 18.04 LTS

  When trying to join the domain with this new version of adcli, it gets
  to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not
  do anything for 10 minutes. It will then fail, complaining it can't
  reach the LDAP server.

  Logs:
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain

  On the network level, adcli gets to the point of send an ldap query to
  the domain controller and the domain controller returns an ack tcp
  packet, but then there is no more traffic between the domain
  controller and the server except for ntp packets until it fails.

  The domain controller traffic also shows that it is receiving the ldap
  query packet from the server but it never sends a reply and there is
  no log in directory services regarding the query. We also couldn't
  find anything in procmon regarding this query either.

  Workaround/Fix:
  Downgrading the adcli package back to version 0.8.2-1 fixes the issues and 
domain join works properly again.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

--

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
   Status: Confirmed => In Progress

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
   Status: Confirmed => In Progress

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: cyrus-sasl2 (Ubuntu Bionic)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  Confirmed
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  In Progress

Bug description:
  Package: adcli
  Version: 0.8.2-1ubuntu1
  Release: Ubuntu 18.04 LTS

  When trying to join the domain with this new version of adcli, it gets
  to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not
  do anything for 10 minutes. It will then fail, complaining it can't
  reach the LDAP server.

  Logs:
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain

  On the network level, adcli gets to the point of send an ldap query to
  the domain controller and the domain controller returns an ack tcp
  packet, but then there is no more traffic between the domain
  controller and the server except for ntp packets until it fails.

  The domain controller traffic also shows that it is receiving the ldap
  query packet from the server but it never sends a reply and there is
  no log in directory services regarding the query. We also couldn't
  find anything in procmon regarding this query either.

  Workaround/Fix:
  Downgrading the adcli package back to versio

[Bug 1906673] Re: Realm join hangs

2020-12-04 Thread Matthew Ruffell

*** This bug is a duplicate of bug 1906627 ***
https://bugs.launchpad.net/bugs/1906627

** This bug has been marked a duplicate of bug 1906627
   adcli fails, can't contact LDAP server

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906673

Title:
  Realm join hangs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

Attached is a debdiff to revert the changes we made to adcli to restore
functionality to GSS-API.

** Patch added: "Debdiff for adcli on Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441133/+files/lp1906627_adcli_bionic.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-04 Thread Matthew Ruffell

Attached is a debdiff to revert the changes we made to adcli to restore
functionality to GSS-API.

** Patch added: "Debdiff for adcli on Bionic"
   
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+attachment/5441133/+files/lp1906627_adcli_bionic.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  New
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  New

Bug description:
  Package: adcli
  Version: 0.8.2-1ubuntu1
  Release: Ubuntu 18.04 LTS

  When trying to join the domain with this new version of adcli, it gets
  to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not
  do anything for 10 minutes. It will then fail, complaining it can't
  reach the LDAP server.

  Logs:
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using computer account 
name: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain realm: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Calculated computer 
account name from fqdn: EXAMPLE001
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * With user principal: 
host/example001.domain@domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Generated 120 
character computer password
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using keytab: 
FILE:/etc/krb5.keytab
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup 
computer account: EXAMPLE001$: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: adcli: joining domain 
domain.com failed: Couldn't lookup computer account: EXAMPLE001$: Can't contact 
LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]: process exited: 6459
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Failed to join the 
domain

  On the network level, adcli gets to the point of send an ldap query to
  the domain controller and the domain controller returns an ack tcp
  packet, but then there is no more traffic between the domain
  controller and the server except for ntp packets until it fails.

  The domain controller traffic also shows that it is receiving the ldap
  query packet from the server but it never sends a reply and there is
  no log in directory services regarding the query. We also couldn't
  find anything in procmon regarding this query either.

  Workaround/Fix:
  Downgrading the adcli package back to version 0.8.2-1 fixes the

Re: [Sts-sponsors] sssd/adcli regression after last upload

2020-12-04 Thread Matthew Ruffell

Hi everyone,

Firstly, I deeply apologise for causing the regression.

Even with three separate people testing the test packages and the packages in
-proposed, the failure still went unnoticed. I should have considered
the impacts
of changing the default behaviour of adcli a little more deeply than treating it
like a normal SRU.

Here are the facts:

The failure is limited to adcli, version 0.8.2-1ubuntu1 on Bionic. At the time
of writing, it is still in the archive. To archive admins, this needs
to be pulled.

adcli versions 0.9.0-1ubuntu0.20.04.1 in Focal, 0.9.0-1ubuntu1.2 in Groovy and
0.9.0-1ubuntu2 in Hirsute are not affected.

sssd 1.16.1-1ubuntu1.7 in Bionic, and 2.2.3-3ubuntu0.1 in Focal are
not affected.

Bug Reports:

There are two launchpad bugs open:

LP #1906627 "adcli fails, can't contact LDAP server"
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627

LP #1906673 "Realm join hangs"
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673

Customer Cases:

SF 00298839 "Ubuntu Client Not Joining the Nasdaq AD Domain"
https://canonical.my.salesforce.com/5004K03u9EW

SF 00299039 "Regression Issue due to
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673;
https://canonical.my.salesforce.com/5004K03uAkL

Root Cause:

The recent SRU in LP #1868703 "Support "ad_use_ldaps" flag for new AD
requirements (ADV190023)"
https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703

introduced two changes for adcli on Bionic. The first, was to change from
GSS-API to GSS-SPNEGO, and the second was to implement support for the flag
--use-ldaps.

I built a upstream master of adcli, and it still fails on Ubuntu. This indicates
that the failure is not actually in the adcli package. adcli does not implement
GSS-SPNEGO, it is linked in from the libsasl2-modules-gssapi-mit package,
which is a part of cyrus-sasl2.

I built the source of cyrus-sasl2 2.1.27+dfsg-2 from Focal on Bionic, and it
works with the problematic adcli package.

The root cause is that the implementation of GSS-SPNEGO in cyrus-sasl2 on
Bionic is broken, and has never worked.

There is more details about commits which the cyrus-sasl2 package in Bionic is
missing in comment #5 in LP #1906627.

https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/comments/5

Steps taken yesterday:

I added regression-update to LP #1906627, and I pinged ubuntu-archive in
#ubuntu-release with these details, but they seem to have been lost in the
noise.

Located root cause to cryus-sasl2 on Bionic.

Next steps:

We don't need to revert any changes for adcli or sssd on Focal onward.

We don't need to revert any changes on sssd on Bionic.

We need to push a new adcli into Bionic with the recent patches reverted.

We need to fix the GSS-SPNEGO implementation in cyrus-sasl2 in Bionic.

We need to re-release all the SRUs from LP #1868703 after some very thorough
testing and validation.

Again, I am deeply sorry for causing this regression. I will fix it, starting
with getting adcli removed from the Bionic archive.

Thanks,
Matthew

On Sat, Dec 5, 2020 at 10:37 AM Jamie Strandboge  wrote:
>
> Looping in security@
> On Fri, 04 Dec 2020, Sergio Durigan Junior wrote:
>
> > Hi Matthew,
> >
> > How are things?  I'm writing to you because the last upload to
> > sssd/adcli introduced a regression that is causing "realm join" to
> > hang.  The bug in question is this one:
> >
> >   https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1906673
> >
> > There is also a SalesForce case opened from AWS:
> >
> >   https://canonical.my.salesforce.com/5004K03uAkLQAU
> >
> > (I don't have access to it, but cnewcomer said it's basically the same
> > issue, but that AWS is actually reporting it against adcli).
> >
> > I am not entirely sure whether this bug affects both sssd and adcli, or
> > just one of them.  It is possible that this is just affecting adcli,
> > based on input from Tobias Karnat, but we have to investigate this
> > further.
> >
> > This regression was introduced because of the work done here:
> >
> >   https://bugs.launchpad.net/ubuntu/+source/sssd/+bug/1868703
> >
> > Lukasz (sil2100) has already pulled the sssd package from the
> > -security/-update pockets.  I've asked him to also pull the adcli
> > package.  At the time of this writing, he hasn't done that yet (he had
> > to go AFK), but he told me he would.  In any case, this is not going to
> > help much because by now most systems probably have the updates already
> > because of unattended-upgrades.
> >
> > Having said all that, would it be possible for you to handle this issue?
> > I can offer any help you need, of course, but I feel like you already
> > have all the context in your head and would be able to make progress
> > much faster.
> >
> > Thanks in advance,
> >
> > --
> > Sergio
> > GPG key ID: E92F D0B3 6B14 F1F4 D8E0  EB2F 106D A1C8 C3CB BF14
>
>
> --
> Jamie Strandboge | http://www.canonical.com

-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post

[Touch-packages] [Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

Yes, when --use-ldaps is specified, adcli will make a TLS connection to
the domain controller, and speak LDAPS. This works, and is the reason
why this bug slipped through our regression testing. I should have
tested without the --use-ldaps flag as well.

Regardless, this bug seems to be caused by the GSS-SPNEGO implementation
in the cyrus-sasl2 package being broken. adcli links to libsasl2
-modules-gssapi-mit, which is a part of cyrus-sasl2, since adcli does
not implement GSS-SPNEGO itself, and relies on cyrus-sasl libraries.

I downloaded the source package of cyrus-sasl2 2.1.27+dfsg-2 from Focal,
and I built it on Bionic, and installed it. I then tried a adcli join,
and it worked:

https://paste.ubuntu.com/p/R8PyHJMNtT/

Looking at the cyrus-sasl2 source repo, it seems the Bionic version is
missing a lot of commits related to GSS-SPNEGO support.

Commit 816e529043de08f3f9dcc4097380de39478b0b16
From: Simo Sorce 
Date: Thu, 16 Feb 2017 15:25:56 -0500
Subject: Fix GSS-SPNEGO mechanism's incompatible behavior
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/816e529043de08f3f9dcc4097380de39478b0b16

Commit 4b0306dcd76031460246b2dabcb7db766d6b04d8
From: Simo Sorce 
Date: Mon, 10 Apr 2017 19:54:19 -0400
Subject: Add support for retrieving the mech_ssf
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/4b0306dcd76031460246b2dabcb7db766d6b04d8

Commit 31b68a9438c24fc9e3e52f626462bf514de31757
From: Ryan Tandy 
Date: Mon, 24 Dec 2018 15:07:02 -0800
Subject: Restore LIBS after checking gss_inquire_sec_context_by_oid
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/31b68a9438c24fc9e3e52f626462bf514de31757

This doesn't even seem to be a complete list either, and if we backport
these patches to the Bionic cyrus-sasl2 package, it fails to build for
numerous reasons.

I also found a similar bug report in Debian, which features the above third 
commit: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129

>From what I can tell, GSS-SPNEGO in cyrus-sasl2 for Bionic has never
worked, and changing it to the default was a bad idea.

So, we have a decision to make. If supporting the new Active Directory
requirements in ADV190023 [1][2] which adds --use-ldaps for adcli, as a
part of bug 1868703 is important, and something the community wants, we
need to fix up cyrus-sasl2 to have a working GSS-SPNEGO implementation.

[1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023
[2] 
https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows

If we don't want --use-ldaps for adcli, then we can revert the patches
for adcli on Bionic, and go back to what was working previously, with
GSS-API.

** Bug watch added: Debian Bug tracker #917129
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129

** Also affects: cyrus-sasl2 (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to cyrus-sasl2 in Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

Status in adcli package in Ubuntu:
  Fix Released
Status in cyrus-sasl2 package in Ubuntu:
  New
Status in adcli source package in Bionic:
  In Progress
Status in cyrus-sasl2 source package in Bionic:
  New

Bug description:
  Package: adcli
  Version: 0.8.2-1ubuntu1
  Release: Ubuntu 18.04 LTS

  When trying to join the domain with this new version of adcli, it gets
  to the point of 'Using GSS-SPNEGO for SASL bind' and then it will not
  do anything for 10 minutes. It will then fail, complaining it can't
  reach the LDAP server.

  Logs:
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Authenticated as user: 
domain-join-acco...@domain.com
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com realmd[6419]:  * Using GSS-SPNEGO for 
SASL bind
  Dec 03 01:39:50 example001.domain.com adcli[6459]: GSSAPI client step 1
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  ! Couldn't lookup domain 
short name: Can't contact LDAP server
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using fully qualified 
name: example001.domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using domain name: 
domain.com
  Dec 03 01:55:27 example001.domain.com realmd[6419]:  * Using

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

Yes, when --use-ldaps is specified, adcli will make a TLS connection to
the domain controller, and speak LDAPS. This works, and is the reason
why this bug slipped through our regression testing. I should have
tested without the --use-ldaps flag as well.

Regardless, this bug seems to be caused by the GSS-SPNEGO implementation
in the cyrus-sasl2 package being broken. adcli links to libsasl2
-modules-gssapi-mit, which is a part of cyrus-sasl2, since adcli does
not implement GSS-SPNEGO itself, and relies on cyrus-sasl libraries.

I downloaded the source package of cyrus-sasl2 2.1.27+dfsg-2 from Focal,
and I built it on Bionic, and installed it. I then tried a adcli join,
and it worked:

https://paste.ubuntu.com/p/R8PyHJMNtT/

Looking at the cyrus-sasl2 source repo, it seems the Bionic version is
missing a lot of commits related to GSS-SPNEGO support.

Commit 816e529043de08f3f9dcc4097380de39478b0b16
From: Simo Sorce 
Date: Thu, 16 Feb 2017 15:25:56 -0500
Subject: Fix GSS-SPNEGO mechanism's incompatible behavior
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/816e529043de08f3f9dcc4097380de39478b0b16

Commit 4b0306dcd76031460246b2dabcb7db766d6b04d8
From: Simo Sorce 
Date: Mon, 10 Apr 2017 19:54:19 -0400
Subject: Add support for retrieving the mech_ssf
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/4b0306dcd76031460246b2dabcb7db766d6b04d8

Commit 31b68a9438c24fc9e3e52f626462bf514de31757
From: Ryan Tandy 
Date: Mon, 24 Dec 2018 15:07:02 -0800
Subject: Restore LIBS after checking gss_inquire_sec_context_by_oid
Link: 
https://github.com/cyrusimap/cyrus-sasl/commit/31b68a9438c24fc9e3e52f626462bf514de31757

This doesn't even seem to be a complete list either, and if we backport
these patches to the Bionic cyrus-sasl2 package, it fails to build for
numerous reasons.

I also found a similar bug report in Debian, which features the above third 
commit: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129

>From what I can tell, GSS-SPNEGO in cyrus-sasl2 for Bionic has never
worked, and changing it to the default was a bad idea.

So, we have a decision to make. If supporting the new Active Directory
requirements in ADV190023 [1][2] which adds --use-ldaps for adcli, as a
part of bug 1868703 is important, and something the community wants, we
need to fix up cyrus-sasl2 to have a working GSS-SPNEGO implementation.

[1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023
[2] 
https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows

If we don't want --use-ldaps for adcli, then we can revert the patches
for adcli on Bionic, and go back to what was working previously, with
GSS-API.

** Bug watch added: Debian Bug tracker #917129
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917129

** Also affects: cyrus-sasl2 (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

I built the current upstream master branch of adcli, and it too fails on
Bionic:

https://paste.ubuntu.com/p/vsgfxyb9X7/

This must be why the exact same patches work on Focal. The problem
probably isn't adcli itself, but more likely a library it depends on.

# apt depends adcli
adcli
  Depends: libsasl2-modules-gssapi-mit
  Depends: libc6 (>= 2.14)
  Depends: libgssapi-krb5-2 (>= 1.6.dfsg.2)
  Depends: libk5crypto3 (>= 1.7+dfsg)
  Depends: libkrb5-3 (>= 1.10+dfsg~alpha1)
  Depends: libldap-2.4-2 (>= 2.4.7)
  
I will try upgrading each of these one at a time to see if it improves the 
situation.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

Hi Rolf,

I sincerely apologise for causing this regression, it seems my testing
was not good enough during the recent SRU.

I recently made a change to adcli in bug 1868703 to add the --use-ldaps flag, 
so adcli can communicate with a domain controller over LDAPS.
It also introduced a change where it will use GSS-SPENGO by default, and 
enforce channel signing, over doing everything in cleartext, which was the old 
default.

The good news is that it seems to be limited to Bionic only, and even
though Focal got the exact same patches, Focal seems unaffected.

For anyone experiencing this bug, you can downgrade to a working adcli
with:

$ sudo apt install adcli=0.8.2-1

I am working to fix this now.

Comparison of logging and packet traces from various versions:

Bionic adcli 0.8.2-1
https://paste.ubuntu.com/p/NWHGQn746D/

Bionic adcli 0.8.2-1ubuntu1
https://paste.ubuntu.com/p/WRnnRMGBPm/

Focal adcli 0.9.0-1ubuntu0.20.04.1
https://paste.ubuntu.com/p/8668pJrr2m/

We can see that Bionic 0.8.2-1ubuntu1 stops at Couldn't lookup computer
account: BIONIC$: Can't contact LDAP server

Starting debugging now. Will update soon.

** Changed in: adcli (Ubuntu)
   Status: Confirmed => Fix Released

** Changed in: adcli (Ubuntu Bionic)
   Status: New => In Progress

** Changed in: adcli (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: adcli (Ubuntu Bionic)
 Assignee: (unassigned) => Matthew Ruffell (mruffell)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

** Tags added: regression-update

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1906627] Re: adcli fails, can't contact LDAP server

2020-12-03 Thread Matthew Ruffell

** Also affects: adcli (Ubuntu Bionic)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906627

Title:
  adcli fails, can't contact LDAP server

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/adcli/+bug/1906627/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-02 Thread Matthew Ruffell

Hi Benjamin,

I have good news. The SRU has completed, and the new kernels have now
been released to -updates. Their versions are:

Bionic:
4.15.0-126-generic

Focal:
5.4.0-56-generic

You can go ahead and schedule that maintenance window now, to install
the latest kernel from -updates. These kernels also have full livepatch
support, which is good news for you.

Let me know how the 4.15.0-126-generic kernel goes on the Launchpad git
server, since it should perform just as well as the test kernel you are
currently running.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-12-02 Thread Matthew Ruffell

Hi Benjamin,

I have good news. The SRU has completed, and the new kernels have now
been released to -updates. Their versions are:

Bionic:
4.15.0-126-generic

Focal:
5.4.0-56-generic

You can go ahead and schedule that maintenance window now, to install
the latest kernel from -updates. These kernels also have full livepatch
support, which is good news for you.

Let me know how the 4.15.0-126-generic kernel goes on the Launchpad git
server, since it should perform just as well as the test kernel you are
currently running.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following commits fix the problem, and all landed in 5.6-rc1:

  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for
  cache eviction. The commit mentions that under heavy IO, all caches
  will end up being recently accessed, and nothing will ever be shrunk.

  The second commit changes a previous design decision of skipping the
  first 3 caches to shrink, since it is a common case to call
  bch_mca_scan() with nr being 1, or 2, just like 0 was common in the
  very first commit I mentioned. This time, if we land on 1 or 2, the
  loop exits and nothing happens, and we waste time waiting on locks,
  just like the very first commit I mentioned. The fix is to try shrink
  caches from the tail of the list, and not the beginning.

  The third commit fixes a minor issue where we don't want to re-arrange
  the linked list c->btree_cache, which is what the second commit ended
  up doing, and instead, just shrink the cache at the end of the linked
  list, and not change the order.

  One minor backport / context adjustment was required in the first
  commit for Bionic, and the rest are all clean cherry picks to Bionic
  and Focal.

  [Testcase]

  This is kind of hard to test, since the problem shows up in production
  environments that are under constant IO pressure, with many different
  items entering and leaving the cache.

  The

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-27 Thread Matthew Ruffell

Hi Benjamin,

No worries about being busy.

Now, the kernel is scheduled to be released early next week, around the
30th of November. I think at this stage it is best to wait it out and
install the kernel once it reaches -updates.

That way you will have a fixed kernel that is supported by livepatch,
and you don't have to justify a reboot twice.

I did some regression testing in my comments above, and everything looks
okay. These patches also worked great in your test kernel. We have done
the best can to verify the kernel in the time given, so don't worry
about testing at this stage.

I'll let you know once the kernel has reached -updates, likely Monday or
Tuesday next week.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following commits fix the problem, and all landed in 5.6-rc1:

  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for
  cache eviction. The commit mentions that under heavy IO, all caches
  will end up being recently accessed, and nothing will ever be shrunk.

  The second commit changes a previous design decision of skipping the
  first 3 caches to shrink, since it is a common case to call
  bch_mca_scan() with nr being 1, or 2, just like 0 was common in the
  very first commit I mentioned. This time, if we land on 1 or 2, the
  loop exits and nothing happens, and we waste time waiting on locks,
  just like the very first commit I mentioned. The fix is to try shrink
  caches from the tail of the list, and not the beginning.

  The third commit fixes a minor issue where we don't want to re-arrange
  the linked list c->btree_cache, which is what the second commit ended
  up doing, and instead, just shrink the cache at the end of the linked
  list, and not change the order.

  One minor backport / context adjustment was required in the first
  commit for Bionic, and the rest are all clean cherry picks to Bionic
  and Focal.

  [Testcase]

  This is kind of hard to test,

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-27 Thread Matthew Ruffell

Hi Benjamin,

No worries about being busy.

Now, the kernel is scheduled to be released early next week, around the
30th of November. I think at this stage it is best to wait it out and
install the kernel once it reaches -updates.

That way you will have a fixed kernel that is supported by livepatch,
and you don't have to justify a reboot twice.

I did some regression testing in my comments above, and everything looks
okay. These patches also worked great in your test kernel. We have done
the best can to verify the kernel in the time given, so don't worry
about testing at this stage.

I'll let you know once the kernel has reached -updates, likely Monday or
Tuesday next week.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-26 Thread Matthew Ruffell

Verification for sssd on Bionic:

The customer tested sssd from -updates, version 1.16.1-1ubuntu1.6 and
the package from -proposed, version 1.16.1-1ubuntu1.7.

Begins:

Before applying the patch [package from -proposed] I confirmed open
ports to our domain controllers using ss and grepping for the DC IPs.
Before the patch 389 and 3268 were being actively used.

After the patch [installing the package from -proposed] (and after
running a few user queries with `id`) ports 636 and 3269 were being
used.

Ends.

This matches my testing and testing Tobias has done, so happy to mark
sssd as verified for Bionic.

** Tags removed: verification-needed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-25 Thread Matthew Ruffell

Verification for sssd on Focal:

The customer tested sssd from -updates, version 2.2.3-3 and the package
from -proposed, version 2.2.3-3ubuntu0.1.

Begins:

I have successfully tested the [package from -proposed] on Ubuntu
20.04.1.

Before applying the patch [package from -proposed] I confirmed open
ports to our domain controllers using ss and grepping for the DC IPs.
Before the patch 389 and 3268 were being actively used.

After the patch [installing the package from -proposed] (and after
running a few user queries with `id`) ports 636 and 3269 were being
used.

Ends.

This matches my testing and testing Tobias has done, so happy to mark
sssd as verified for Focal.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell

Performing verification for Bionic.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 4.15.0-124-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1 259:00 46.6G  0 disk 
nvme0n1 259:108G  0 disk 
nvme2n1 259:208G  0 disk 
└─nvme2n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=8
$ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme0n1
$ sudo wipefs -a /dev/nvme1n1
$ sudo make-bcache -C /dev/nvme0n1 -B /dev/nvme1n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/ghkQcyT2sv/

Now we have the baseline, I enabled -proposed and installed
4.15.0-125-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020

https://paste.ubuntu.com/p/DSTnKvXMGZ/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell

Performing verification for Bionic.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 4.15.0-124-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme1n1 259:00 46.6G  0 disk 
nvme0n1 259:108G  0 disk 
nvme2n1 259:208G  0 disk 
└─nvme2n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=512 count=8
$ sudo dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme0n1
$ sudo wipefs -a /dev/nvme1n1
$ sudo make-bcache -C /dev/nvme0n1 -B /dev/nvme1n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/ghkQcyT2sv/

Now we have the baseline, I enabled -proposed and installed
4.15.0-125-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020

https://paste.ubuntu.com/p/DSTnKvXMGZ/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Bionic.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell

Performing verification for Focal.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 5.4.0-54-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme2n1 259:00 46.6G  0 disk 
nvme1n1 259:108G  0 disk 
nvme0n1 259:208G  0 disk 
└─nvme0n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero if=/dev/nvme1n1 bs=512 count=8
$ sudo dd if=/dev/zero if=/dev/nvme2n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme1n1
$ sudo wipefs -a /dev/nvme2n1
$ sudo make-bcache -C /dev/nvme1n1 -B /dev/nvme2n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/HrWGNDJPfv/

Now we have the baseline, I enabled -proposed and installed
5.4.0-55-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020

https://paste.ubuntu.com/p/pDVXnspmvs/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Focal.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-25 Thread Matthew Ruffell

Performing verification for Focal.

Since Benjamin hasn't responded, I will try and verify the best I can.

I made a instance on AWS. I used a c5d.large instance type, and added
8gb extra EBS storage.

I installed the latest kernel from -updates to get a performance
baseline. kernel is 5.4.0-54-generic.

I made a bcache disk with the following.

Note, the 8gb disk was used as the cache disk, and the 50gb disk the
backing disk. Having the cache small is to try force cache evictions
often, and possibly try trigger the bug.

$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme2n1 259:00 46.6G  0 disk 
nvme1n1 259:108G  0 disk 
nvme0n1 259:208G  0 disk 
└─nvme0n1p1 259:308G  0 part /

$ sudo apt install bcache-tools
$ sudo dd if=/dev/zero if=/dev/nvme1n1 bs=512 count=8
$ sudo dd if=/dev/zero if=/dev/nvme2n1 bs=512 count=8
$ sudo wipefs -a /dev/nvme1n1
$ sudo wipefs -a /dev/nvme2n1
$ sudo make-bcache -C /dev/nvme1n1 -B /dev/nvme2n1
UUID:   3f28ca5d-856b-42e9-bbb7-54cae12b5538
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:0
nbuckets:   16384
block_size: 1
bucket_size:1024
nr_in_set:  1
nr_this_dev:0
first_bucket:   1
UUID:   cc3e36fd-3694-4c50-aeac-0b79d2faab4a
Set UUID:   756747bc-f27c-44ca-a9b9-dbd132722838
version:1
block_size: 1
data_offset:16
$ sudo mkfs.ext4 /dev/bcache0
$ sudo mkdir /media/bcache
$ sudo mount /dev/bcache0 /media/bcache
$ echo "/dev/bcache0 /media/bcache ext4 rw 0 0" | sudo tee -a /etc/fstab

>From there, I installed fio to do some benchmarks, and to try apply some
IO pressure to the cache.

$ sudo apt install fio

I used the following fio jobfile:

https://paste.ubuntu.com/p/RNBmXdy3zG/

It is based on the ssd test in:
https://github.com/axboe/fio/blob/master/examples/ssd-test.fio

Running the fio job gives us the following output:

https://paste.ubuntu.com/p/HrWGNDJPfv/

Now we have the baseline, I enabled -proposed and installed
5.4.0-55-generic and rebooted.

I started the fio job again, and got the following output:

# uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020

https://paste.ubuntu.com/p/pDVXnspmvs/

If you compare the two outputs, there really isn't much difference in
latencies / read / write speeds. The bcache patches don't seem to cause
any large impacts.

I managed to set up a bcache disk, and did some IO stress tests. Things
seem to be okay.

Since we had positive test results on the test kernel on the Launchpad
git server, and the above shows we don't appear to have any regressions,
I will mark this bug as verified for Focal.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-24 Thread Matthew Ruffell

Performing verification of adcli on Bionic

The patches for Bionic are a bit more involved, as it adds the whole
--use-ldaps ecosystem.

Firstly, I installed adcli 0.8.2-1 from -updates. The manpage did not
have any mention of --use-ldaps, and if I ran a command with --use-
ldaps, it would complain it was unrecongized.

# adcli join --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL
join: unrecognized option '--use-ldaps'
usage: adcli join

I then enabled -proposed and installed adcli 0.8.2-1ubuntu1.

The man page now talks about --use-ldaps

$ man adcli | grep -i ldaps
   --use-ldaps
   Connect to the domain controller with LDAPS. By default the LDAP 
port is used and SASL GSS-SPNEGO or GSSAPI is used for authentication and to 
establish encryption. This should
   satisfy all requirements set on the server side and LDAPS should 
only be used if the LDAP port is not accessible due to firewalls or other 
reasons.
   $ LDAPTLS_CACERT=/path/to/ad_dc_ca_cert.pem adcli join 
--use-ldaps -D domain.example.com
   
I then enabled a firewall rule to block ldap connections:

# ufw deny 389
# ufw deny 3268

And tried the join command.

# adcli join --use-ldaps --verbose -U Administrator --domain 
WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local 
--domain-realm TESTING.LOCAL
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local
 * Wrote out krb5.conf snippet to 
/tmp/adcli-krb5-ihG1h9/krb5.d/adcli-krb5-conf-bt9nd8
Password for Administrator@TESTING.LOCAL: 
 * Authenticated as user: Administrator@TESTING.LOCAL
 * Using GSS-API for SASL bind
 * Looked up short domain name: TESTING
 * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570
 * Using fully qualified name: ubuntu
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Using computer account name: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Generated 120 character computer password
 * Using keytab: FILE:/etc/krb5.keytab
 * Found computer account for UBUNTU$ at: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Set computer password
 * Retrieved kvno '13' for computer account in directory: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Checking RestrictedKrbHost/ubuntu.testing.local
 *Added RestrictedKrbHost/ubuntu.testing.local
 * Checking host/ubuntu.testing.local
 *Added host/ubuntu.testing.local
 * Checking RestrictedKrbHost/UBUNTU
 *Added RestrictedKrbHost/UBUNTU
 * Checking host/UBUNTU
 *Added host/UBUNTU
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Discovered which keytab salt to use
 * Added the entries to the keytab: UBUNTU$@TESTING.LOCAL: FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: host/UBUNTU@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: RestrictedKrbHost/UBUNTU@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: 
RestrictedKrbHost/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: host/ubuntu.testing.local@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 
I couldn't catch the open port with netstat, so I used strace, and 636 was 
being used:

connect(3, {sa_family=AF_INET, sin_port=htons(636),
sin_addr=inet_addr("192.168.122.66")}, 16) = 0

I then went through all the other sub commands and did a quick test to
ensure they all took --use-ldaps and did not complain about "being
unrecognized". All commands except "info" took the flag fine, and "info"
was never intended to use --use-ldaps anyway.

Everything seems okay. Happy to mark adcli for Bionic verified.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-24 Thread Matthew Ruffell

Performing verification of adcli on Focal

The patches for Focal are a bit more involved, as it adds the whole
--use-ldaps ecosystem.

Firstly, I installed adcli 0.9.0-1 from -updates. The manpage did not
have any mention of --use-ldaps, and if I ran a command with --use-
ldaps, it would complain it was unrecongized.

# adcli join --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local --domain-realm TESTING.LOCAL
join: unrecognized option '--use-ldaps'
usage: adcli join

I then enabled -proposed and installed adcli 0.9.0-1ubuntu0.20.04.1.

The man page now talks about --use-ldaps

$ man adcli | grep -i ldaps
   --use-ldaps
   Connect to the domain controller with LDAPS. By default the LDAP 
port is used and SASL GSS-SPNEGO or GSSAPI is used for authentication and to 
establish encryption. This should
   satisfy all requirements set on the server side and LDAPS should 
only be used if the LDAP port is not accessible due to firewalls or other 
reasons.
   $ LDAPTLS_CACERT=/path/to/ad_dc_ca_cert.pem adcli join 
--use-ldaps -D domain.example.com
   
I then enabled a firewall rule to block ldap connections:

# ufw deny 389
# ufw deny 3268

And tried the join command:

# adcli join --use-ldaps --verbose -U Administrator --domain 
WIN-SB6JAS7PH22.testing.local --domain-controller WIN-SB6JAS7PH22.testing.local 
--domain-realm TESTING.LOCAL
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local
 * Wrote out krb5.conf snippet to 
/tmp/adcli-krb5-ihG1h9/krb5.d/adcli-krb5-conf-bt9nd8
Password for Administrator@TESTING.LOCAL: 
 * Authenticated as user: Administrator@TESTING.LOCAL
 * Using GSS-API for SASL bind
 * Looked up short domain name: TESTING
 * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570
 * Using fully qualified name: ubuntu
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Using computer account name: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Generated 120 character computer password
 * Using keytab: FILE:/etc/krb5.keytab
 * Found computer account for UBUNTU$ at: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Set computer password
 * Retrieved kvno '13' for computer account in directory: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Checking RestrictedKrbHost/ubuntu.testing.local
 *Added RestrictedKrbHost/ubuntu.testing.local
 * Checking host/ubuntu.testing.local
 *Added host/ubuntu.testing.local
 * Checking RestrictedKrbHost/UBUNTU
 *Added RestrictedKrbHost/UBUNTU
 * Checking host/UBUNTU
 *Added host/UBUNTU
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Discovered which keytab salt to use
 * Added the entries to the keytab: UBUNTU$@TESTING.LOCAL: FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: host/UBUNTU@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: RestrictedKrbHost/UBUNTU@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: 
RestrictedKrbHost/ubuntu.testing.local@TESTING.LOCAL: FILE:/etc/krb5.keytab
 * Cleared old entries from keytab: FILE:/etc/krb5.keytab
 * Added the entries to the keytab: host/ubuntu.testing.local@TESTING.LOCAL: 
FILE:/etc/krb5.keytab
 
I couldn't catch the open port with netstat, so I used strace, and 636 was 
being used:

connect(3, {sa_family=AF_INET, sin_port=htons(636),
sin_addr=inet_addr("192.168.122.66")}, 16) = 0

I then went through all the other sub commands and did a quick test to
ensure they all took --use-ldaps and did not complain about "being
unrecognized". All commands except "info" took the flag fine, and "info"
was never intended to use --use-ldaps anyway.

Everything looks good. Happy to mark adcli for Focal verified.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-24 Thread Matthew Ruffell

Performing verification of adcli on Groovy.

Groovy only required one patch, which fixed a missed enablement of
--use-ldaps for the testjoin and update commands.

So, just testing those two.

I installed adcli 0.9.0-1ubuntu1 from -updates, and I set everything up
by issuing a join command. After that, I tried the --use-ldaps flag with
testjoin and update commands:

# adcli testjoin --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local
testjoin: unrecognized option '--use-ldaps'
usage: adcli testjoin

# adcli update --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local
update: unrecognized option '--use-ldaps'
usage: adcli update

I then enabled -proposed, and installed adcli 0.9.0-1ubuntu1.2 and tried
again:

We block port 389 on firewall, so

# ufw deny 389
# ufw deny 3268

Then try testjoin and update:

# adcli testjoin --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local
 * Found realm in keytab: TESTING.LOCAL
 * Found computer name in keytab: UBUNTU
 * Found service principal in keytab: host/UBUNTU
 * Found service principal in keytab: host/ubuntu.testing.local
 * Found host qualified name in keytab: ubuntu.testing.local
 * Found service principal in keytab: RestrictedKrbHost/UBUNTU
 * Found service principal in keytab: RestrictedKrbHost/ubuntu.testing.local
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Wrote out krb5.conf snippet to 
/tmp/adcli-krb5-6SRtqJ/krb5.d/adcli-krb5-conf-YGzgnK
 * Authenticated as default/reset computer account: UBUNTU
 * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local
 * Looked up short domain name: TESTING
 * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570
Sucessfully validated join to domain WIN-SB6JAS7PH22.testing.local

# adcli update --use-ldaps --verbose --domain WIN-SB6JAS7PH22.testing.local 
--domain-controller WIN-SB6JAS7PH22.testing.local
 * Found realm in keytab: TESTING.LOCAL
 * Found computer name in keytab: UBUNTU
 * Found service principal in keytab: host/UBUNTU
 * Found service principal in keytab: host/ubuntu.testing.local
 * Found host qualified name in keytab: ubuntu.testing.local
 * Found service principal in keytab: RestrictedKrbHost/UBUNTU
 * Found service principal in keytab: RestrictedKrbHost/ubuntu.testing.local
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Calculated computer account name from fqdn: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Wrote out krb5.conf snippet to 
/tmp/adcli-krb5-6FQ1ZS/krb5.d/adcli-krb5-conf-LHowkP
 * Authenticated as default/reset computer account: UBUNTU
 * Using LDAPS to connect to WIN-SB6JAS7PH22.testing.local
 * Looked up short domain name: TESTING
 * Looked up domain SID: S-1-5-21-960071060-1417404557-720088570
 * Using fully qualified name: ubuntu
 * Using domain name: WIN-SB6JAS7PH22.testing.local
 * Using computer account name: UBUNTU
 * Using domain realm: WIN-SB6JAS7PH22.testing.local
 * Using fully qualified name: ubuntu.testing.local
 * Enrolling computer name: UBUNTU
 * Generated 120 character computer password
 * Using keytab: FILE:/etc/krb5.keytab
 * Found computer account for UBUNTU$ at: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Retrieved kvno '12' for computer account in directory: 
CN=UBUNTU,CN=Computers,DC=testing,DC=local
 * Password not too old, no change needed
 * Sending NetLogon ping to domain controller: WIN-SB6JAS7PH22.testing.local
 * Received NetLogon info from: WIN-SB6JAS7PH22.testing.local
 * Modifying computer account: dNSHostName
 * Checking RestrictedKrbHost/ubuntu.testing.local
 *Added RestrictedKrbHost/ubuntu.testing.local
 * Checking host/ubuntu.testing.local
 *Added host/ubuntu.testing.local
 * Checking RestrictedKrbHost/UBUNTU
 *Added RestrictedKrbHost/UBUNTU
 * Checking host/UBUNTU
 *Added host/UBUNTU
 
Everything seems fine. Happy to mark Groovy as verified for adcli.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-24 Thread Matthew Ruffell

Hi Tobias, thanks for testing and verifying! I really appreciate it, and
it's good to hear that everything works.

I'll just add some of my own test output below, and we should be good to
go for a release to -updates in about a week's time.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1868703

Title:
  Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cyrus-sasl2/+bug/1868703/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-22 Thread Matthew Ruffell

Hi Benjamin,

The kernel team have built the next kernel update, and they have placed
it in -proposed for verification.

The versions are 4.15.0-125-generic for Bionic, and 5.4.0-55-generic for
Focal.

Can you please schedule a maintenance window for the Launchpad git
server, to install the new kernel in -proposed, and reboot into it, so
we can verify that it fixes the problem.

Instructions to install (On a Bionic system):
Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main
EOF
2) sudo apt update
3) sudo apt install linux-image-4.15.0-125-generic 
linux-modules-4.15.0-125-generic \
linux-modules-extra-4.15.0-125-generic linux-headers-4.15.0-125-generic 
linux-headers-4.15.0-125
4) sudo reboot
5) uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020
6) sudo rm /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
7) sudo apt update

If you get a different uname, you may need to adjust your grub
configuration to boot into the correct kernel. Also, since this is a
production machine, make sure you remove the -proposed software source
once you have installed the kernel.

Let me know how this kernel performs, and if everything seems fine after
a week we will mark the Launchpad bug as verified. The timeline for
release to -updates is still set for the 30th of November, give or take
a few days if any CVEs turn up.

I believe this kernel should be live-patchable, although this may not be
the case if the kernel is respun before release. Hopefully you will only
have to schedule the maintenance window just the once.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1898786] Re: bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled

2020-11-22 Thread Matthew Ruffell

Hi Benjamin,

The kernel team have built the next kernel update, and they have placed
it in -proposed for verification.

The versions are 4.15.0-125-generic for Bionic, and 5.4.0-55-generic for
Focal.

Can you please schedule a maintenance window for the Launchpad git
server, to install the new kernel in -proposed, and reboot into it, so
we can verify that it fixes the problem.

Instructions to install (On a Bionic system):
Enable -proposed by running the following command to make a new sources.list.d 
entry:
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ bionic-proposed main
EOF
2) sudo apt update
3) sudo apt install linux-image-4.15.0-125-generic 
linux-modules-4.15.0-125-generic \
linux-modules-extra-4.15.0-125-generic linux-headers-4.15.0-125-generic 
linux-headers-4.15.0-125
4) sudo reboot
5) uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020
6) sudo rm /etc/apt/sources.list.d/ubuntu-bionic-proposed.list
7) sudo apt update

If you get a different uname, you may need to adjust your grub
configuration to boot into the correct kernel. Also, since this is a
production machine, make sure you remove the -proposed software source
once you have installed the kernel.

Let me know how this kernel performs, and if everything seems fine after
a week we will mark the Launchpad bug as verified. The timeline for
release to -updates is still set for the 30th of November, give or take
a few days if any CVEs turn up.

I believe this kernel should be live-patchable, although this may not be
the case if the kernel is respun before release. Hopefully you will only
have to schedule the maintenance window just the once.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  bcache: Issues with large IO wait in bch_mca_scan() when shrinker is
  enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1898786

  [Impact]

  Systems that utilise bcache can experience extremely high IO wait
  times when under constant IO pressure. The IO wait times seem to stay
  at a consistent 1 second, and never drop as long as the bcache
  shrinker is enabled.

  If you disable the shrinker, then IO wait drops significantly to
  normal levels.

  We did some perf analysis, and it seems we spend a huge amount of time
  in bch_mca_scan(), likely waiting for the mutex ">bucket_lock".

  Looking at the recent commits in Bionic, we found the following commit
  merged in upstream 5.1-rc1 and backported to 4.15.0-87-generic through
  upstream stable:

  commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
  Author: Coly Li 
  Date: Wed Nov 13 16:03:24 2019 +0800
  Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

  It mentions in the description that:

  > If sc->nr_to_scan is smaller than c->btree_pages, after the above
  > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
  > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
  > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
  > and releasing mutex c->bucket_lock.

  This seems to be what is going on here, but the above commit only
  addresses when nr is 0.

  From what I can see, the problems we are experiencing are when nr is 1
  or 2, and again, we just waste time in bch_mca_scan() waiting on
  c->bucket_lock, only to release c->bucket_lock since the shrinker loop
  never executes since there is no work to do.

  [Fix]

  The following commits fix the problem, and all landed in 5.6-rc1:

  commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
  Author: Coly Li 
  Date: Fri Jan 24 01:01:40 2020 +0800
  Subject: bcache: remove member accessed from struct btree
  Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

  commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
  Author: Coly Li 
  Date: Fri Jan 24 01:01:41 2020 +0800
  Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

  commit e3de04469a49ee09c89e80bf821508df458ccee6
  Author: Coly Li 
  Date: Fri Jan 24 01:01:42 2020 +0800
  Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
  Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

  The first commit is a dependency of the other two. The first commit
  removes a "recently accessed" marker, used to indicate if a particular
  cache has been used recently, and if it has been, not consider it for

[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Bionic.

I enabled -proposed and installed 4.15.0-125-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme2n1 259:20  1.7T  0 disk 
nvme3n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=0, rmapbt=0, 
reflink=0
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real0m3.615s
user0m0.002s
sys 0m0.179s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m1.898s
user0m0.002s
sys 0m0.015s

We can see that mkfs.xfs took 3.6 seconds, and fstrim only 2 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 4.15.0-125-generic kernel in -proposed fixes the issue, and I am
happy to mark as verified.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896578

  [Impact]

  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to
  take a very long time.

  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on
  Raid 0, takes 4 seconds.

  The bigger the devices, the longer it takes.

  The cause is that Raid10 currently uses a 512k chunk size, and uses
  this for the discard_max_bytes value. If we need to discard 1.9TB, the
  kernel splits the request into millions of 512k bio requests, even if
  the underlying device supports larger requests.

  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard
  at once:

  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040

  Where the Raid10 md device only supports 512k:

  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288

  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()

  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>] blk_ioctl_discard+0xc4/0x110
  [<0>] blkdev_common_ioctl+0x56c/0x840
  [<0>] blkdev_ioctl+0xeb/0x270
  [<0>] block_ioctl+0x3d/0x50
  [<0>] __x64_sys_ioctl+0x91/0xc0
  [<0>] do_syscall_64+0x38/0x90

[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Bionic.

I enabled -proposed and installed 4.15.0-125-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
4.15.0-125-generic #128-Ubuntu SMP Mon Nov 9 20:51:00 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme2n1 259:20  1.7T  0 disk 
nvme3n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=0, rmapbt=0, 
reflink=0
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real0m3.615s
user0m0.002s
sys 0m0.179s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m1.898s
user0m0.002s
sys 0m0.015s

We can see that mkfs.xfs took 3.6 seconds, and fstrim only 2 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 4.15.0-125-generic kernel in -proposed fixes the issue, and I am
happy to mark as verified.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Focal.

I enabled -proposed and installed 5.4.0-55-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme3n1 259:20  1.7T  0 disk 
nvme2n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=1
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real0m5.350s
user0m0.022s
sys 0m0.179s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m2.944s
user0m0.006s
sys 0m0.013s

We can see that mkfs.xfs took 5.3 seconds, and fstrim only 3 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 5.4.0-55-generic kernel in -proposed fixes the issue, and I am happy
to mark as verified.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896578

  [Impact]

  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to
  take a very long time.

  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on
  Raid 0, takes 4 seconds.

  The bigger the devices, the longer it takes.

  The cause is that Raid10 currently uses a 512k chunk size, and uses
  this for the discard_max_bytes value. If we need to discard 1.9TB, the
  kernel splits the request into millions of 512k bio requests, even if
  the underlying device supports larger requests.

  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard
  at once:

  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040

  Where the Raid10 md device only supports 512k:

  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288

  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()

  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>]

[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Focal.

I enabled -proposed and installed 5.4.0-55-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme3n1 259:20  1.7T  0 disk 
nvme2n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=1
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real0m5.350s
user0m0.022s
sys 0m0.179s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m2.944s
user0m0.006s
sys 0m0.013s

We can see that mkfs.xfs took 5.3 seconds, and fstrim only 3 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 5.4.0-55-generic kernel in -proposed fixes the issue, and I am happy
to mark as verified.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Groovy.

I enabled -proposed and installed 5.8.0-30-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
5.8.0-30-generic #32-Ubuntu SMP Mon Nov 9 21:03:15 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme3n1 259:20  1.7T  0 disk 
nvme2n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=1
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.

real0m4.413s
user0m0.022s
sys 0m0.245s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m1.973s
user0m0.000s
sys 0m0.037s

We can see that mkfs.xfs took 4.4 seconds, and fstrim only 2 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 5.8.0-30-generic kernel in -proposed fixes the issue, and I am happy
to mark as verified.

** Tags removed: verification-needed-groovy
** Tags added: verification-done-groovy

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1896578] Re: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

2020-11-17 Thread Matthew Ruffell

Performing verification for Groovy.

I enabled -proposed and installed 5.8.0-30-generic to a i3.8xlarge AWS
instance.

>From there, I followed the testcase steps:

$ uname -rv
5.8.0-30-generic #32-Ubuntu SMP Mon Nov 9 21:03:15 UTC 2020
$ lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda202:008G  0 disk 
└─xvda1 202:108G  0 part /
nvme0n1 259:00  1.7T  0 disk 
nvme1n1 259:10  1.7T  0 disk 
nvme3n1 259:20  1.7T  0 disk 
nvme2n1 259:30  1.7T  0 disk 
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855336448K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ time sudo mkfs.xfs /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0   isize=512agcount=32, agsize=28989568 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=1
data =   bsize=4096   blocks=927666176, imaxpct=5
 =   sunit=128swidth=256 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=452968, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
Discarding blocks...Done.

real0m4.413s
user0m0.022s
sys 0m0.245s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real0m1.973s
user0m0.000s
sys 0m0.037s

We can see that mkfs.xfs took 4.4 seconds, and fstrim only 2 seconds.
This is a significant improvement over the current 11 minutes.

I started up a c5.large instance, and attached 4x EBS drives, which do
not support block discard, and went through the testcase steps.
Everything worked fine, and the changes have not caused any regressions
to disks which do not support block discard.

I also started another i3.8xlarge instance and tested raid0, to check
for regressions around the refactoring. raid0 deployed fine, and was as
performant as usual.

The 5.8.0-30-generic kernel in -proposed fixes the issue, and I am happy
to mark as verified.

** Tags removed: verification-needed-groovy
** Tags added: verification-done-groovy

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed
Status in linux source package in Groovy:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896578

  [Impact]

  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to
  take a very long time.

  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on
  Raid 0, takes 4 seconds.

  The bigger the devices, the longer it takes.

  The cause is that Raid10 currently uses a 512k chunk size, and uses
  this for the discard_max_bytes value. If we need to discard 1.9TB, the
  kernel splits the request into millions of 512k bio requests, even if
  the underlying device supports larger requests.

  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard
  at once:

  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040

  Where the Raid10 md device only supports 512k:

  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288

  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()

  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>]

[Bug 1896154] Re: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data

2020-11-17 Thread Matthew Ruffell

Performing verification for Focal.

I created a i3.large instance on AWS, since it has 1x NVMe drive that
supports trim and block discard.

I ensured that I could reproduce the problem with 5.4.0-54-generic from
-updates, and I followed the instructions in the Testcase section, and
the final fstrim after shrinking locked up the instance, and filled up
the root disk. I terminated the instance.

I then created a new instance, and enabled -proposed, and installed
5.4.0-55-generic, and rebooted. From there, I ran though the test steps
again:

$ uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020
$ sudo -s
# lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0 7:00  28.1M  1 loop /snap/amazon-ssm-agent/2012
loop1 7:10  97.8M  1 loop /snap/core/10185
loop2 7:20  55.3M  1 loop /snap/core18/1885
loop3 7:30  70.6M  1 loop /snap/lxd/16922
xvda202:00 8G  0 disk 
└─xvda1 202:10 8G  0 part /
nvme0n1 259:00 442.4G  0 disk 
# dev=/dev/nvme0n1
# mnt=/mnt
# mkfs.btrfs -f $dev -b 10G
btrfs-progs v5.4.1 
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want 
to force metadata duplication.
Label:  (null)
UUID:   db9dd9f5-7993-4827-9a43-93a72a73aa3a
Node size:  16384
Sector size:4096
Filesystem size:10.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: single8.00MiB
  System:   single4.00MiB
SSD detected:   yes
Incompat features:  extref, skinny-metadata
Checksum:   crc32c
Number of devices:  1
Devices:
   IDSIZE  PATH
110.00GiB  /dev/nvme0n1

# mount $dev $mnt
# fstrim $mnt
# btrfs filesystem resize 1:-1G $mnt
Resize '/mnt' of '1:-1G'
# fstrim $mnt
# 

The final fstrim completed almost immediately, the same speed as the
initial fstrim. The instance did not lock up, and the root disk did not
get filled with any garbage data.

The kernel in -proposed fixes the problem, happy to mark as verified.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1896154

Title:
  btrfs: trimming a btrfs device which has been shrunk previously fails
  and fills root disk with garbage data

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896154/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Kernel-packages] [Bug 1896154] Re: btrfs: trimming a btrfs device which has been shrunk previously fails and fills root disk with garbage data

2020-11-17 Thread Matthew Ruffell

Performing verification for Focal.

I created a i3.large instance on AWS, since it has 1x NVMe drive that
supports trim and block discard.

I ensured that I could reproduce the problem with 5.4.0-54-generic from
-updates, and I followed the instructions in the Testcase section, and
the final fstrim after shrinking locked up the instance, and filled up
the root disk. I terminated the instance.

I then created a new instance, and enabled -proposed, and installed
5.4.0-55-generic, and rebooted. From there, I ran though the test steps
again:

$ uname -rv
5.4.0-55-generic #61-Ubuntu SMP Mon Nov 9 20:49:56 UTC 2020
$ sudo -s
# lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0 7:00  28.1M  1 loop /snap/amazon-ssm-agent/2012
loop1 7:10  97.8M  1 loop /snap/core/10185
loop2 7:20  55.3M  1 loop /snap/core18/1885
loop3 7:30  70.6M  1 loop /snap/lxd/16922
xvda202:00 8G  0 disk 
└─xvda1 202:10 8G  0 part /
nvme0n1 259:00 442.4G  0 disk 
# dev=/dev/nvme0n1
# mnt=/mnt
# mkfs.btrfs -f $dev -b 10G
btrfs-progs v5.4.1 
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want 
to force metadata duplication.
Label:  (null)
UUID:   db9dd9f5-7993-4827-9a43-93a72a73aa3a
Node size:  16384
Sector size:4096
Filesystem size:10.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: single8.00MiB
  System:   single4.00MiB
SSD detected:   yes
Incompat features:  extref, skinny-metadata
Checksum:   crc32c
Number of devices:  1
Devices:
   IDSIZE  PATH
110.00GiB  /dev/nvme0n1

# mount $dev $mnt
# fstrim $mnt
# btrfs filesystem resize 1:-1G $mnt
Resize '/mnt' of '1:-1G'
# fstrim $mnt
# 

The final fstrim completed almost immediately, the same speed as the
initial fstrim. The instance did not lock up, and the root disk did not
get filled with any garbage data.

The kernel in -proposed fixes the problem, happy to mark as verified.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1896154

Title:
  btrfs: trimming a btrfs device which has been shrunk previously fails
  and fills root disk with garbage data

Status in linux package in Ubuntu:
  Fix Released
Status in linux-azure package in Ubuntu:
  New
Status in linux source package in Focal:
  Fix Committed
Status in linux-azure source package in Focal:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896154

  [Impact]

  Since 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
  find_first_clear_extent_bit") which landed in 5.3, btrfs wont trim a
  range that has already been trimmed, and will instead go looking for a
  range where the CHUNK_TRIMMED and CHUNK_ALLOCATED bits aren't set.

  If a device had been shrunk, the CHUNK_TRIMMED and CHUNK_ALLOCATED
  bits are never cleared, which means that btrfs could go looking for a
  range to trim which is beyond the new device size. This leads to an
  underflow in a length calculation for the range to trim, and we will
  end up trimming past the device's boundary.

  This has an unfortunate side effect of mangling and filling the root
  disk with garbage data, and it will not stop until the root disk is
  totally filled, and makes the instance unusable.

  [Fix]

  The issue was fixed in the following commit, in 5.9-rc1:

  commit c57dd1f2f6a7cd1bb61802344f59ccdc5278c983
  Author: Qu Wenruo 
  Date: Fri Jul 31 19:29:11 2020 +0800
  Subject: btrfs: trim: fix underflow in trim length to prevent access beyond 
device boundary
  Link: 
https://github.com/torvalds/linux/commit/c57dd1f2f6a7cd1bb61802344f59ccdc5278c983

  The fix clears the CHUNK_TRIMMED and CHUNK_ALLOCATED bits when a
  device is being shrunk, and performs some additional checks to ensure
  we do not trim past the device size boundary.

  The fix was backported to 5.7.17 and 5.8.3 upstream stable, but it
  seems 5.4 was skipped.

  The patch required a minor backport to 5.4, with the CHUNK_STATE_MASK
  #define moving files back to fs/btrfs/extent_io.h, as the file had
  been renamed in later kernels.

  [Testcase]

  The easiest way to reproduce is to use a cloud instance that supplies
  a real NVMe drive, that supports TRIM and block discards.

  Warning, this will fill the root disk with garbage data, ONLY run on a
  throwaway instance!

  Run the following commands:

  $ dev=/dev/nvme0n1
  $ mnt=/mnt
  $ mkfs.btrfs -f $dev -b 10G
  $ mount $dev $mnt
  $ fstrim $mnt
  $ btrfs filesystem resize 1:-1G $mnt
  $ fstrim $mnt

  The last command will appear to hang, while the root filesystem will
  begin filling with garbage data. Once the root filesystem fills, you
  will see the

[Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)

2020-11-09 Thread Matthew Ruffell

** Description changed:

[Impact]

Microsoft has released a new security advisory for Active Directory (AD)
which outlines that man-in-the-middle attacks can be performed on a LDAP
server, such as AD DS, that works by an attacker forwarding an
authentication request to a Windows LDAP server that does not enforce
LDAP channel binding or LDAP signing for incoming connections.

To address this, Microsoft has announced new Active Directory
requirements in ADV190023 [1][2].

[1] https://msrc.microsoft.com/update-guide/en-us/vulnerability/ADV190023
[2]
https://support.microsoft.com/en-us/help/4520412/2020-ldap-channel-binding-and-ldap-signing-requirements-for-windows

These new requirements strongly encourage system administrators to
require LDAP signing and authenticated channel binding in their AD
environments.

The effects of this is to stop unauthenticated and unencrypted traffic
from communicating over LDAP port 389, and to force authenticated and
encrypted traffic instead, over LDAPS port 636 and Global Catalog port
3629.

Microsoft will not be forcing this change via updates to their servers,
system administrators must opt in and change their own configuration.

To support these new requirements in Ubuntu, changes need to be made to
the sssd and adcli packages. Upstream have added a new flag
"ad_use_ldaps" to sssd, and "use-ldaps" has been added to adcli.

If "ad_use_ldaps = True", then sssd will send all communication over
port 636, authenticated and encrypted.

For adcli, if the server supports GSS-SPNEGO, it will be now be used by
default, with the normal LDAP port 389. If the LDAP port is blocked,
then "use-ldaps" can now be used, which will use the LDAPS port 636
instead.

This is currently reporting the following on Ubuntu 18.04/20.04LTS
machines with the following error:

"[sssd] [sss_ini_call_validators] (0x0020):
[rule/allowed_domain_options]: Attribute 'ad_use_ldaps' is not allowed
in section 'domain/test.com'. Check for typos."

These patches are needed to stay in line with Microsoft security
advisories, since security conscious system administrators would like to
firewall off the LDAP port 389 in their environments, and use LDAPS port
636 only.

[Testcase]

To test these changes, you will need to set up a Windows Server 2019
box, install and configure Active Directory, import the AD certificate
to the Ubuntu clients, and create some users in Active Directory.

From there, you can try do a user search from the client to the AD
server, and check what ports are used for communication.

Currently, you should see port 389 in use:

$ sudo netstat -tanp |grep sssd
tcp 0 0 x.x.x.x:43954 x.x.x.x:389 ESTABLISHED 27614/sssd_be
tcp 0 0 x.x.x.x:54381 x.x.x.x:3268 ESTABLISHED 27614/sssd_be

Test packages are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf294530-test

Instructions to install (on a bionic or focal system):
1) sudo add-apt-repository ppa:mruffell/sf294530-test
2) sudo apt update
3) sudo apt install adcli sssd

Then, modify /etc/sssd/sssd.conf and add "ad_use_ldaps = True", restart
sssd.

Add a firewall rule to block traffic to LDAP port 389 and Global Catalog
3268.

$ sudo ufw deny 389
$ sudo ufw deny 3268

Then do another user lookup, and check ports in use:

$ sudo netstat -tanp |grep sssd
tcp 0 0 x.x.x.x:44586 x.x.x.x:636 ESTABLISHED 28474/sssd_be
tcp 0 0 x.x.x.x:56136 x.x.x.x:3269 ESTABLISHED 28474/sssd_be

We see LDAPS port 636, and Global Catalog port 3629 in use. The user
lookup will succeed even with ports 389 and 3268 blocked, since it uses
their authenticated and encrypted variants instead.

[Where problems could occur]

Firstly, the adcli and sssd packages will continue to work with AD
servers that haven't had LDAP signing or authenticated channel binding
enforced, due to the measures being optional.

For both sssd and adcli, the changes don't implement anything new, and
instead, the changes add configuration and logic to "select" what
protocol to use to talk to the AD server. LDAP and LDAPS are already
implemented in both sssd and adcli, the changes just add some logic to
select the use of LDAPS over LDAP.

For sssd, the changes are hidden behind configuration parameters, such
as "ldap_sasl_mech" and "ad_use_ldaps". If a regression were to occur,
it would be limited to systems where the system administrator had
enabled these configuration options to the /etc/sssd/sssd.conf file.

For adcli, the changes are more immediate. adcli will now use GSS-SPENGO
by default if the server supports it, which is a behaviour change. The
"use-ldaps" option is a flag on the command line, e.g. "--use-ldaps",
and if a regression were to occur, users can remove "--use-ldaps" from
their command to fall back to the new GSS-SPENGO defaults on

[Sts-sponsors] [Bug 1868703] Re: Support "ad_use_ldaps" flag for new AD requirements (ADV190023)