Re: [PATCH] libbpf: Use the correct fd when attaching to perf events
On Fri, Mar 12, 2021 at 06:33:01PM -0800, Andrii Nakryiko wrote: > On Fri, Mar 12, 2021 at 6:22 PM Sultan Alsawaf wrote: > > > > On Fri, Mar 12, 2021 at 05:31:14PM -0800, Andrii Nakryiko wrote: > > > On Fri, Mar 12, 2021 at 1:43 PM Sultan Alsawaf > > > wrote: > > > > > > > > From: Sultan Alsawaf > > > > > > > > We should be using the program fd here, not the perf event fd. > > > > > > Why? Can you elaborate on what issue you ran into with the current code? > > > > bpf_link__pin() would fail with -EINVAL when using tracepoints, kprobes, or > > uprobes. The failure would happen inside the kernel, in > > bpf_link_get_from_fd() > > right here: > > if (f.file->f_op != &bpf_link_fops) { > > fdput(f); > > return ERR_PTR(-EINVAL); > > } > > kprobe/tracepoint/perf_event attachments behave like bpf_link (so > libbpf uses user-space high-level bpf_link APIs for it), but they are > not bpf_link-based in the kernel. So bpf_link__pin() won't work for > such types of programs until we actually have bpf_link-backed > attachment support in the kernel itself. I never got to implementing > this because we already had auto-detachment properties from perf_event > FD itself. But it would be nice to have that done as a real bpf_link > in the kernel (with all the observability, program update, > force-detach support). > > Looking for volunteers to make this happen ;) > > > > > > Since bpf wasn't looking for the perf event fd, I swapped it for the > > program fd > > and bpf_link__pin() worked. > > But you were pinning the BPF program, not a BPF link. Which is not > what should have happen. This is the code in question: link = bpf_program__attach(prog); // make sure `link` is valid, blah blah... bpf_link__pin(link, some_path); Are you saying that this usage is incorrect? Sultan
Re: [PATCH] libbpf: Use the correct fd when attaching to perf events
On Fri, Mar 12, 2021 at 05:31:14PM -0800, Andrii Nakryiko wrote: > On Fri, Mar 12, 2021 at 1:43 PM Sultan Alsawaf wrote: > > > > From: Sultan Alsawaf > > > > We should be using the program fd here, not the perf event fd. > > Why? Can you elaborate on what issue you ran into with the current code? bpf_link__pin() would fail with -EINVAL when using tracepoints, kprobes, or uprobes. The failure would happen inside the kernel, in bpf_link_get_from_fd() right here: if (f.file->f_op != &bpf_link_fops) { fdput(f); return ERR_PTR(-EINVAL); } Since bpf wasn't looking for the perf event fd, I swapped it for the program fd and bpf_link__pin() worked. Sultan
[PATCH] libbpf: Use the correct fd when attaching to perf events
From: Sultan Alsawaf We should be using the program fd here, not the perf event fd. Fixes: 63f2f5ee856ba ("libbpf: add ability to attach/detach BPF program to perf event") Signed-off-by: Sultan Alsawaf --- tools/lib/bpf/libbpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index d43cc3f29dae..3d20d57d4af5 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -9538,7 +9538,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog, if (!link) return ERR_PTR(-ENOMEM); link->detach = &bpf_link__detach_perf_event; - link->fd = pfd; + link->fd = prog_fd; if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) { err = -errno; -- 2.30.2
Re: [PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads
On Fri, Oct 16, 2020 at 01:16:18PM +0200, Hans de Goede wrote: > Hi, > > On 9/22/20 11:19 AM, Jiri Kosina wrote: > > On Wed, 16 Sep 2020, Sultan Alsawaf wrote: > > > >> From: Sultan Alsawaf > >> > >> This is a fixed resubmission of "[PATCH 0/2] i2c-hid: Save power by > >> reducing i2c > >> xfers with block reads". That original patchset did not have enough fixes > >> for > >> the designware i2c adapter's I2C_M_RECV_LEN feature, which is documented > >> extensively in the original email thread. > >> > >> Here is the original cover letter, which still applies: > >> "I noticed on my Dell Precision 15 5540 with an i9-9880H that simply > >> putting my > >> finger on the touchpad would increase my system's power consumption by 4W, > >> which > >> is quite considerable. Resting my finger on the touchpad would generate > >> roughly > >> 4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq. > >> > >> Upon closer inspection, I noticed that the i2c-hid driver would always > >> transfer > >> the maximum report size over i2c (which is 60 bytes for my touchpad), but > >> all of > >> my touchpad's normal touch events are only 32 bytes long according to the > >> length > >> byte contained in the buffer sequence. > >> > >> Therefore, I was able to save about 2W of power by passing the > >> I2C_M_RECV_LEN > >> flag in i2c-hid, which says to look for the payload length in the first > >> byte of > >> the transfer buffer and adjust the i2c transaction accordingly. The only > >> problem > >> though is that my i2c controller's driver allows bytes other than the > >> first one > >> to be used to retrieve the payload length, which is incorrect according to > >> the > >> SMBus spec, and would break my i2c-hid change since not *all* of the > >> reports > >> from my touchpad are conforming SMBus block reads. > >> > >> This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c > >> driver and > >> modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even > >> if the > >> peripheral controlled by i2c-hid doesn't support block reads, the i2c > >> controller > >> drivers should cope with this and proceed with the i2c transfer using the > >> original requested length." > >> > >> Sultan > >> > >> Sultan Alsawaf (4): > >> i2c: designware: Fix transfer failures for invalid SMBus block reads > >> i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads > >> i2c: designware: Allow SMBus block reads up to 255 bytes in length > >> HID: i2c-hid: Use block reads when possible to save power > >> > >> drivers/hid/i2c-hid/i2c-hid-core.c | 5 - > >> drivers/i2c/busses/i2c-designware-master.c | 15 +-- > >> 2 files changed, 13 insertions(+), 7 deletions(-) > > > > Hans, Benjamin, could you please give this patchset some smoke-testing? It > > looks good to me, but I'd like it to get some testing from your testing > > machinery before merging. > > Sorry for being slow to respond to this. I have not gotten around to testing > this, but I saw another email that this breaks things on at least AMD > platforms, so I guess that this is on hold for now ? > > Regards, > > Hans Hi, Only the i2c-hid-core.c hunk is on hold. It is on hold until every i2c adapter gets updated for proper block read functionality up to the new block read limit of 255 bytes. This is not really a surprise. The designware patches are good to go. Please let me know what you think of them. Sultan
Re: [PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads
On Tue, Sep 22, 2020 at 09:59:44PM +0200, Jiri Kosina wrote: > On Tue, 22 Sep 2020, Wolfram Sang wrote: > > > > Hans, Benjamin, could you please give this patchset some smoke-testing? > > > It > > > looks good to me, but I'd like it to get some testing from your testing > > > machinery before merging. > > > > Please give me some more days. I am not fully convinced yet that this > > use of I2C_M_RECV_LEN is not broken on some controllers. > > > > Plus, I'd favor if this could go via I2C tree. It is within I2C where > > the non-trivial changes are. The HID part is just the final bit. Can we > > agree on that? > > Absolutely no problem with that. But I'd like to have this ran through > Benjamin/Hans first too. > > Thanks, > > -- > Jiri Kosina > SUSE Labs > I suppose the HID part does need to be held off until all the adapters are updated with functional I2C_M_RECV_LEN bits. I just got a Ryzen laptop which panics when using I2C_M_RECV_LEN. So it looks like only the designware changes can be considered for merging now. Sultan
Re: [PATCH v2 3/4] i2c: designware: Allow SMBus block reads up to 255 bytes in length
On Thu, Sep 17, 2020 at 10:57:04PM +0200, Wolfram Sang wrote: > On Wed, Sep 16, 2020 at 10:22:55PM -0700, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > According to the SMBus 3.0 protocol specification, block transfer limits > > were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte > > limitation. > > Sadly, it is not that easy. We are trying to extend BLOCK_MAX to 255 > (SMBus 3 specs) but there are various things to be considered, > especially with buffers and when passing it to userspace. Check here for > the discussion (and you are welcome to join, of course): > > http://patchwork.ozlabs.org/project/linux-i2c/list/?submitter=79741&state=* > > > > > Signed-off-by: Sultan Alsawaf > > --- > > drivers/i2c/busses/i2c-designware-master.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/i2c/busses/i2c-designware-master.c > > b/drivers/i2c/busses/i2c-designware-master.c > > index 22f28516bca7..5bd64bd17d94 100644 > > --- a/drivers/i2c/busses/i2c-designware-master.c > > +++ b/drivers/i2c/busses/i2c-designware-master.c > > @@ -433,7 +433,7 @@ i2c_dw_read(struct dw_i2c_dev *dev) > > regmap_read(dev->map, DW_IC_DATA_CMD, &tmp); > > if (flags & I2C_M_RECV_LEN) { > > /* Ensure length byte is a valid value */ > > - if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) > > + if (tmp > 0) > > len = i2c_dw_recv_len(dev, tmp); > > else > > len = i2c_dw_recv_len(dev, len); > > -- > > 2.28.0 > > Yes, it is not that easy to make the change on a global scale. However, in the case of the designware adapter, it really *is* that easy. This change covers the designware adapter, and others can follow later with the much more invasive changes that are needed. Sultan
[PATCH v2 3/4] i2c: designware: Allow SMBus block reads up to 255 bytes in length
From: Sultan Alsawaf According to the SMBus 3.0 protocol specification, block transfer limits were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte limitation. Signed-off-by: Sultan Alsawaf --- drivers/i2c/busses/i2c-designware-master.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index 22f28516bca7..5bd64bd17d94 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -433,7 +433,7 @@ i2c_dw_read(struct dw_i2c_dev *dev) regmap_read(dev->map, DW_IC_DATA_CMD, &tmp); if (flags & I2C_M_RECV_LEN) { /* Ensure length byte is a valid value */ - if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) + if (tmp > 0) len = i2c_dw_recv_len(dev, tmp); else len = i2c_dw_recv_len(dev, len); -- 2.28.0
[PATCH v2 2/4] i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads
From: Sultan Alsawaf The point of adding a byte to len in i2c_dw_recv_len() is to make sure that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c controller know that the i2c transaction can end. Otherwise, the i2c controller will think that the transaction can never end for block reads, which results in the stop-detection bit never being set and thus the transaction timing out. Adding a byte to len is not a reliable way to do this though; sometimes it lets tx_buf_len become zero, which results in the scenario described above. Therefore, just directly ensure tx_buf_len cannot be zero to fix the issue. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf --- drivers/i2c/busses/i2c-designware-master.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index d78f48ca4886..22f28516bca7 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len) * Adjust the buffer length and mask the flag * after receiving the first byte. */ - len += (flags & I2C_CLIENT_PEC) ? 2 : 1; - dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding); + if (flags & I2C_CLIENT_PEC) + len++; + dev->tx_buf_len = len - min_t(u8, len - 1, dev->rx_outstanding); msgs[dev->msg_read_idx].len = len; msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN; -- 2.28.0
[PATCH v2 1/4] i2c: designware: Fix transfer failures for invalid SMBus block reads
From: Sultan Alsawaf SMBus block reads can be broken because the read function will just skip over bytes it doesn't like until reaching a byte that conforms to the length restrictions for block reads. This is problematic when it isn't known if the incoming payload is indeed a conforming block read. According to the SMBus specification, block reads will only send the payload length in the first byte, so we can fix this by only considering the first byte in a sequence for block read length purposes. In addition, when the length byte is invalid, the original transfer length still needs to be adjusted to avoid a controller timeout. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf --- drivers/i2c/busses/i2c-designware-master.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index d6425ad6e6a3..d78f48ca4886 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -430,10 +430,12 @@ i2c_dw_read(struct dw_i2c_dev *dev) u32 flags = msgs[dev->msg_read_idx].flags; regmap_read(dev->map, DW_IC_DATA_CMD, &tmp); - /* Ensure length byte is a valid value */ - if (flags & I2C_M_RECV_LEN && - tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) { - len = i2c_dw_recv_len(dev, tmp); + if (flags & I2C_M_RECV_LEN) { + /* Ensure length byte is a valid value */ + if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) + len = i2c_dw_recv_len(dev, tmp); + else + len = i2c_dw_recv_len(dev, len); } *buf++ = tmp; dev->rx_outstanding--; -- 2.28.0
[PATCH v2 0/4] i2c-hid: Save power by reducing i2c xfers with block reads
From: Sultan Alsawaf This is a fixed resubmission of "[PATCH 0/2] i2c-hid: Save power by reducing i2c xfers with block reads". That original patchset did not have enough fixes for the designware i2c adapter's I2C_M_RECV_LEN feature, which is documented extensively in the original email thread. Here is the original cover letter, which still applies: "I noticed on my Dell Precision 15 5540 with an i9-9880H that simply putting my finger on the touchpad would increase my system's power consumption by 4W, which is quite considerable. Resting my finger on the touchpad would generate roughly 4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq. Upon closer inspection, I noticed that the i2c-hid driver would always transfer the maximum report size over i2c (which is 60 bytes for my touchpad), but all of my touchpad's normal touch events are only 32 bytes long according to the length byte contained in the buffer sequence. Therefore, I was able to save about 2W of power by passing the I2C_M_RECV_LEN flag in i2c-hid, which says to look for the payload length in the first byte of the transfer buffer and adjust the i2c transaction accordingly. The only problem though is that my i2c controller's driver allows bytes other than the first one to be used to retrieve the payload length, which is incorrect according to the SMBus spec, and would break my i2c-hid change since not *all* of the reports from my touchpad are conforming SMBus block reads. This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c driver and modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even if the peripheral controlled by i2c-hid doesn't support block reads, the i2c controller drivers should cope with this and proceed with the i2c transfer using the original requested length." Sultan Sultan Alsawaf (4): i2c: designware: Fix transfer failures for invalid SMBus block reads i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads i2c: designware: Allow SMBus block reads up to 255 bytes in length HID: i2c-hid: Use block reads when possible to save power drivers/hid/i2c-hid/i2c-hid-core.c | 5 - drivers/i2c/busses/i2c-designware-master.c | 15 +-- 2 files changed, 13 insertions(+), 7 deletions(-) -- 2.28.0
[PATCH v2 4/4] HID: i2c-hid: Use block reads when possible to save power
From: Sultan Alsawaf We have no way of knowing how large an incoming payload is going to be, so the only strategy available up until now has been to always retrieve the maximum possible report length over i2c, which can be quite inefficient. For devices that send reports in block read format, the i2c controller driver can read the payload length on the fly and terminate the i2c transaction early, resulting in considerable power savings. On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the touchpad causes psys power readings to go up by about 4W and hover there until I remove my finger. With this patch, my psys readings go from 4.7W down to 3.1W, yielding about 1.6W in savings. This is because my touchpad's max report length is 60 bytes, but all of the regular reports it sends for touch events are only 32 bytes, so the i2c transfer is roughly halved for the common case. Acked-by: Jiri Kosina Signed-off-by: Sultan Alsawaf --- drivers/hid/i2c-hid/i2c-hid-core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c index dbd04492825d..66950f472122 100644 --- a/drivers/hid/i2c-hid/i2c-hid-core.c +++ b/drivers/hid/i2c-hid/i2c-hid-core.c @@ -476,11 +476,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid) int ret; u32 ret_size; int size = le16_to_cpu(ihid->hdesc.wMaxInputLength); + u16 flags; if (size > ihid->bufsize) size = ihid->bufsize; - ret = i2c_master_recv(ihid->client, ihid->inbuf, size); + /* Try to do a block read if the size fits in one byte */ + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; + ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags); if (ret != size) { if (ret < 0) return; -- 2.28.0
Re: [PATCH v3] i2c: Squash of SMBus block read patchset to save power
On Tue, Sep 15, 2020 at 02:55:48PM +0300, Jarkko Nikula wrote: > Hi > > On 9/14/20 3:15 AM, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > This is a squash of the following: > > > > i2c: designware: Fix transfer failures for invalid SMBus block reads > > > > SMBus block reads can be broken because the read function will just skip > > over bytes it doesn't like until reaching a byte that conforms to the > > length restrictions for block reads. This is problematic when it isn't > > known if the incoming payload is indeed a conforming block read. > > > > According to the SMBus specification, block reads will only send the > > payload length in the first byte, so we can fix this by only considering > > the first byte in a sequence for block read length purposes. > > > > In addition, when the length byte is invalid, the original transfer > > length still needs to be adjusted to avoid a controller timeout. > > > > Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block > > read and write") > > Signed-off-by: Sultan Alsawaf > > > > i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads > > > > The point of adding a byte to len in i2c_dw_recv_len() is to make sure > > that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c > > controller know that the i2c transaction can end. Otherwise, the i2c > > controller will think that the transaction can never end for block > > reads, which results in the stop-detection bit never being set and thus > > the transaction timing out. > > > > Adding a byte to len is not a reliable way to do this though; sometimes > > it lets tx_buf_len become zero, which results in the scenario described > > above. Therefore, just directly ensure tx_buf_len cannot be zero to fix > > the issue. > > > > Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block > > read and write") > > Signed-off-by: Sultan Alsawaf > > > > i2c: designware: Allow SMBus block reads up to 255 bytes in length > > > > According to the SMBus 3.0 protocol specification, block transfer limits > > were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte > > limitation. > > > > Signed-off-by: Sultan Alsawaf > > > > HID: i2c-hid: Use block reads when possible to save power > > > > We have no way of knowing how large an incoming payload is going to be, > > so the only strategy available up until now has been to always retrieve > > the maximum possible report length over i2c, which can be quite > > inefficient. For devices that send reports in block read format, the i2c > > controller driver can read the payload length on the fly and terminate > > the i2c transaction early, resulting in considerable power savings. > > > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the > > touchpad causes psys power readings to go up by about 4W and hover there > > until I remove my finger. With this patch, my psys readings go from 4.7W > > down to 3.1W, yielding about 1.6W in savings. This is because my > > touchpad's max report length is 60 bytes, but all of the regular reports > > it sends for touch events are only 32 bytes, so the i2c transfer is > > roughly halved for the common case. > > > > Signed-off-by: Sultan Alsawaf > > --- > > Hi Jarkko, > > > > Sorry for the delayed response. Life gets in the way of the things that > > really > > matter, like kernel hacking ;) > > > > I fixed the issue with the i2c block reads on 5.8. I've squashed all 4 of > > my i2c > > commits into this email for simplicity; please apply this patch on either > > 5.8 or > > 5.9 (it applies cleanly to both) and let me know if it works with your > > i2c-hid > > touchscreen. If all is well, I will resubmit these patches individually in > > one > > patchset, in a new thread. > > > I tested this on top of fc4f28bb3daf ("Merge tag 'for-5.9-rc5-tag' of > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux") and seems to be > working fine. What was the key change compared to previous version that was > regressing for me? This change fixed your issue (and my issue with 5.8): --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len) * Adjust the buffer length and mask the flag * after receiving the first byte. */ - len
Re: [PATCH 1/2] locking/mutex: Don't hog RCU read lock while optimistically spinning
On Mon, Sep 07, 2020 at 05:20:31PM +0100, Will Deacon wrote: > On Fri, Aug 07, 2020 at 12:16:35PM -0700, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > There's no reason to hold an RCU read lock the entire time while > > optimistically spinning for a mutex lock. This can needlessly lengthen > > RCU grace periods and slow down synchronize_rcu() when it doesn't brute > > force the RCU grace period via rcupdate.rcu_expedited=1. > > Would be good to demonstrate this with numbers if you can. I could simulate the worst possible case, which would stall synchronize_rcu() by one jiffy, which could be 10ms with CONFIG_HZ=100. The way that would happen is when the mutex owner does a lot of non-sleeping work while the lock is held, and while another CPU tries to acquire the lock. This is a dummy example of the scenario I have in mind: CPU0 CPU1 -- mutex_lock(locky) mdelay(100) mutex_lock(locky) mutex_unlock(locky) In this case, CPU1 could spin in mutex_lock() for up to a jiffy (until CPU0 releases locky, which won't happen for 100ms, or until CPU1's task needs to reschedule). While the spinning occurs, the RCU read lock will be held the whole time, and then synchronize_rcu() will be stalled. One could argue that most mutex-locked critical sections probably wouldn't spend so long working on something without scheduling (at least, not intentionally), but on slower SMP CPUs I suspect that this is common. > > Signed-off-by: Sultan Alsawaf > > --- > > kernel/locking/mutex.c | 25 + > > 1 file changed, 17 insertions(+), 8 deletions(-) > > > > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > > index 5352ce50a97e..cc5676712458 100644 > > --- a/kernel/locking/mutex.c > > +++ b/kernel/locking/mutex.c > > @@ -552,21 +552,31 @@ bool mutex_spin_on_owner(struct mutex *lock, struct > > task_struct *owner, > > { > > bool ret = true; > > > > - rcu_read_lock(); > > - while (__mutex_owner(lock) == owner) { > > + for (;;) { > > + unsigned int cpu; > > + bool same_owner; > > + > > /* > > -* Ensure we emit the owner->on_cpu, dereference _after_ > > -* checking lock->owner still matches owner. If that fails, > > +* Ensure lock->owner still matches owner. If that fails, > > * owner might point to freed memory. If it still matches, > > * the rcu_read_lock() ensures the memory stays valid. > > */ > > - barrier(); > > + rcu_read_lock(); > > + same_owner = __mutex_owner(lock) == owner; > > + if (same_owner) { > > + ret = owner->on_cpu; > > + if (ret) > > + cpu = task_cpu(owner); > > + } > > + rcu_read_unlock(); > > Are you sure this doesn't break the ww mutex spinning? That thing also goes > and looks at the owner, but now it's called outside of the read-side > critical section. Yes, it's safe because it's not dereferencing the owner pointer. Sultan
[PATCH v3] i2c: Squash of SMBus block read patchset to save power
From: Sultan Alsawaf This is a squash of the following: i2c: designware: Fix transfer failures for invalid SMBus block reads SMBus block reads can be broken because the read function will just skip over bytes it doesn't like until reaching a byte that conforms to the length restrictions for block reads. This is problematic when it isn't known if the incoming payload is indeed a conforming block read. According to the SMBus specification, block reads will only send the payload length in the first byte, so we can fix this by only considering the first byte in a sequence for block read length purposes. In addition, when the length byte is invalid, the original transfer length still needs to be adjusted to avoid a controller timeout. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf i2c: designware: Ensure tx_buf_len is nonzero for SMBus block reads The point of adding a byte to len in i2c_dw_recv_len() is to make sure that tx_buf_len is nonzero, so that i2c_dw_xfer_msg() can let the i2c controller know that the i2c transaction can end. Otherwise, the i2c controller will think that the transaction can never end for block reads, which results in the stop-detection bit never being set and thus the transaction timing out. Adding a byte to len is not a reliable way to do this though; sometimes it lets tx_buf_len become zero, which results in the scenario described above. Therefore, just directly ensure tx_buf_len cannot be zero to fix the issue. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf i2c: designware: Allow SMBus block reads up to 255 bytes in length According to the SMBus 3.0 protocol specification, block transfer limits were increased from 32 bytes to 255 bytes. Remove the obsolete 32-byte limitation. Signed-off-by: Sultan Alsawaf HID: i2c-hid: Use block reads when possible to save power We have no way of knowing how large an incoming payload is going to be, so the only strategy available up until now has been to always retrieve the maximum possible report length over i2c, which can be quite inefficient. For devices that send reports in block read format, the i2c controller driver can read the payload length on the fly and terminate the i2c transaction early, resulting in considerable power savings. On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the touchpad causes psys power readings to go up by about 4W and hover there until I remove my finger. With this patch, my psys readings go from 4.7W down to 3.1W, yielding about 1.6W in savings. This is because my touchpad's max report length is 60 bytes, but all of the regular reports it sends for touch events are only 32 bytes, so the i2c transfer is roughly halved for the common case. Signed-off-by: Sultan Alsawaf --- Hi Jarkko, Sorry for the delayed response. Life gets in the way of the things that really matter, like kernel hacking ;) I fixed the issue with the i2c block reads on 5.8. I've squashed all 4 of my i2c commits into this email for simplicity; please apply this patch on either 5.8 or 5.9 (it applies cleanly to both) and let me know if it works with your i2c-hid touchscreen. If all is well, I will resubmit these patches individually in one patchset, in a new thread. Thanks, Sultan drivers/hid/i2c-hid/i2c-hid-core.c | 5 - drivers/i2c/busses/i2c-designware-master.c | 15 +-- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c index dbd04492825d..66950f472122 100644 --- a/drivers/hid/i2c-hid/i2c-hid-core.c +++ b/drivers/hid/i2c-hid/i2c-hid-core.c @@ -476,11 +476,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid) int ret; u32 ret_size; int size = le16_to_cpu(ihid->hdesc.wMaxInputLength); + u16 flags; if (size > ihid->bufsize) size = ihid->bufsize; - ret = i2c_master_recv(ihid->client, ihid->inbuf, size); + /* Try to do a block read if the size fits in one byte */ + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; + ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags); if (ret != size) { if (ret < 0) return; diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index d6425ad6e6a3..5bd64bd17d94 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -395,8 +395,9 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len) * Adjust the buffer length and mask the flag * after receiving the first byte. */ - len += (flags & I2C_CLIENT_PEC) ? 2 : 1; - dev->tx_buf_len = len - min_t(u8, len,
Re: [PATCH] x86/msr: do not warn on writes to OC_MAILBOX
On Tue, Sep 08, 2020 at 08:01:12PM +0200, Borislav Petkov wrote: > On Tue, Sep 08, 2020 at 07:42:12PM +0200, Jason A. Donenfeld wrote: > > Are you prepared to track down all the MSRs that might maybe do > > something naughty? > > I'm not prepared - that's why this MSR filtering. To block *all* direct > MSR accesses from userspace in the future. > > > Does `dd` warn when you run `dd if=/dev/zero of=/dev/sda`? > > Yah, because that's the same as bricking your hardware. Geez. > > > Probably not possible. Optimal values are related to the "silicon > > lottery" that occurs when you buy a new CPU. Different optimal values > > for different individual chips. > > Let's wait for what Srinivas finds out. I'd let Intel decide what they > wanna do. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette I'd like to point out that on Intel's recent 14nm parts, undervolting is not so much for squeezing every last drop of performance out of the SoC as it is for necessity. I have an i9-9880H laptop which, at 4000 MHz on all CPUs while compiling a kernel, uses 107W with the stock configuration. Upon undervolting, this figure goes down to 88W, a ~22% reduction in power consumption at the maximum pstate. The point of this is not primarily to reduce the power footprint of my machine, but to get the thermals of this CPU under control. Without undervolting, my i9-9880H is quite useless, and quickly reaches 100 degC with the fans at max speed when placed under load. The 22% reduction in power consumption also makes it clear that these parts use very heavy-handed voltages out-of-the-box. I consider this to be a defect in the part; I've never encountered any other CPUs that run, by default, at a voltage so high that I could achieve a 22% reduction in power consumption by undervolting. I've tested multiple 14nm Intel CPUs, 8th gen and above, and they all exhibit this behavior. On my i5-8265U laptop, compiling a kernel causes the CPU to hit its 35W power limit and run at about 3300 MHz instead of its maximum all-core frequency of 3700 MHz. After undervolting, when compiling a kernel, my i5-8265U runs at 3700 MHz and uses 27W; not only was power consumption reduced by ~30%, but also performance was improved. Not to mention, my i5-8265U laptop's thermal design cannot dissipate 35W for very long, but it can sustain 27W without reaching 100 degC quite well. If Intel's recent 14nm CPUs didn't guzzle power, then we would have no need to undervolt. But they do, and here we are. Sultan
[PATCH 1/2] locking/mutex: Don't hog RCU read lock while optimistically spinning
From: Sultan Alsawaf There's no reason to hold an RCU read lock the entire time while optimistically spinning for a mutex lock. This can needlessly lengthen RCU grace periods and slow down synchronize_rcu() when it doesn't brute force the RCU grace period via rcupdate.rcu_expedited=1. Signed-off-by: Sultan Alsawaf --- kernel/locking/mutex.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 5352ce50a97e..cc5676712458 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -552,21 +552,31 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, { bool ret = true; - rcu_read_lock(); - while (__mutex_owner(lock) == owner) { + for (;;) { + unsigned int cpu; + bool same_owner; + /* -* Ensure we emit the owner->on_cpu, dereference _after_ -* checking lock->owner still matches owner. If that fails, +* Ensure lock->owner still matches owner. If that fails, * owner might point to freed memory. If it still matches, * the rcu_read_lock() ensures the memory stays valid. */ - barrier(); + rcu_read_lock(); + same_owner = __mutex_owner(lock) == owner; + if (same_owner) { + ret = owner->on_cpu; + if (ret) + cpu = task_cpu(owner); + } + rcu_read_unlock(); + + if (!ret || !same_owner) + break; /* * Use vcpu_is_preempted to detect lock holder preemption issue. */ - if (!owner->on_cpu || need_resched() || - vcpu_is_preempted(task_cpu(owner))) { + if (need_resched() || vcpu_is_preempted(cpu)) { ret = false; break; } @@ -578,7 +588,6 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, cpu_relax(); } - rcu_read_unlock(); return ret; } -- 2.28.0
[PATCH 2/2] locking/rwsem: Don't hog RCU read lock while optimistically spinning
From: Sultan Alsawaf There's no reason to hold an RCU read lock the entire time while optimistically spinning for a rwsem. This can needlessly lengthen RCU grace periods and slow down synchronize_rcu() when it doesn't brute force the RCU grace period via rcupdate.rcu_expedited=1. Signed-off-by: Sultan Alsawaf --- kernel/locking/rwsem.c | 27 +-- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index f11b9bd3431d..a1e3ceb254d1 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -723,8 +723,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) if (state != OWNER_WRITER) return state; - rcu_read_lock(); for (;;) { + bool same_owner; + + rcu_read_lock(); /* * When a waiting writer set the handoff flag, it may spin * on the owner as well. Once that writer acquires the lock, @@ -732,27 +734,32 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) * handoff bit is set. */ new = rwsem_owner_flags(sem, &new_flags); - if ((new != owner) || (new_flags != flags)) { - state = rwsem_owner_state(new, new_flags, nonspinnable); - break; - } /* -* Ensure we emit the owner->on_cpu, dereference _after_ -* checking sem->owner still matches owner, if that fails, +* Ensure sem->owner still matches owner. If that fails, * owner might point to free()d memory, if it still matches, * the rcu_read_lock() ensures the memory stays valid. */ - barrier(); + same_owner = new == owner && new_flags == flags; + if (same_owner && !owner_on_cpu(owner)) + state = OWNER_NONSPINNABLE; + rcu_read_unlock(); - if (need_resched() || !owner_on_cpu(owner)) { + if (!same_owner) { + state = rwsem_owner_state(new, new_flags, nonspinnable); + break; + } + + if (state == OWNER_NONSPINNABLE) + break; + + if (need_resched()) { state = OWNER_NONSPINNABLE; break; } cpu_relax(); } - rcu_read_unlock(); return state; } -- 2.28.0
Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power
On Wed, Jul 01, 2020 at 11:04:01AM +0300, Jarkko Nikula wrote: > On 6/29/20 8:43 PM, Sultan Alsawaf wrote: > > Hmm, for some reason in 5.8 I get the same problem, but 5.7 is fine. Could > > you > > try this on 5.7 and see if it works? > > > > In the meantime I'll bisect 5.8 to see why it's causing problems for me... > > > I see the same issue on top of v5.7: Try reverting my "i2c: designware: Only check the first byte for SMBus block read length" patch and apply the following change instead: --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -394,10 +394,12 @@ i2c_dw_read(struct dw_i2c_dev *dev) u32 flags = msgs[dev->msg_read_idx].flags; *buf = dw_readl(dev, DW_IC_DATA_CMD); - /* Ensure length byte is a valid value */ - if (flags & I2C_M_RECV_LEN && - *buf <= I2C_SMBUS_BLOCK_MAX && *buf > 0) { - len = i2c_dw_recv_len(dev, *buf); + if (flags & I2C_M_RECV_LEN) { + /* Ensure length byte is a valid value */ + if (*buf <= I2C_SMBUS_BLOCK_MAX && *buf > 0) + len = i2c_dw_recv_len(dev, *buf); + else + len = i2c_dw_recv_len(dev, len); } buf++; dev->rx_outstanding--;
Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power
On Wed, Jun 17, 2020 at 02:17:19PM +0300, Jarkko Nikula wrote: > On 6/16/20 6:49 PM, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > We have no way of knowing how large an incoming payload is going to be, > > so the only strategy available up until now has been to always retrieve > > the maximum possible report length over i2c, which can be quite > > inefficient. For devices that send reports in block read format, the i2c > > controller driver can read the payload length on the fly and terminate > > the i2c transaction early, resulting in considerable power savings. > > > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the > > touchpad causes psys power readings to go up by about 4W and hover there > > until I remove my finger. With this patch, my psys readings go from 4.7W > > down to 3.1W, yielding about 1.6W in savings. This is because my > > touchpad's max report length is 60 bytes, but all of the regular reports > > it sends for touch events are only 32 bytes, so the i2c transfer is > > roughly halved for the common case. > > > > Signed-off-by: Sultan Alsawaf > > --- > > Jarkko, could you try this? > > drivers/hid/i2c-hid/i2c-hid-core.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c > > b/drivers/hid/i2c-hid/i2c-hid-core.c > > index 294c84e136d7..739dccfc57e1 100644 > > --- a/drivers/hid/i2c-hid/i2c-hid-core.c > > +++ b/drivers/hid/i2c-hid/i2c-hid-core.c > > @@ -472,11 +472,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid) > > int ret; > > u32 ret_size; > > int size = le16_to_cpu(ihid->hdesc.wMaxInputLength); > > + u16 flags; > > if (size > ihid->bufsize) > > size = ihid->bufsize; > > - ret = i2c_master_recv(ihid->client, ihid->inbuf, size); > > + /* Try to do a block read if the size fits in one byte */ > > + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; > > + ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags); > > if (ret != size) { > > if (ret < 0) > > return; > > This still causes a regression for me. Hmm, for some reason in 5.8 I get the same problem, but 5.7 is fine. Could you try this on 5.7 and see if it works? In the meantime I'll bisect 5.8 to see why it's causing problems for me... Sultan
Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power
On Tue, Jun 16, 2020 at 09:02:54PM +0300, Andi Shyti wrote: > Hi Sultan, > > > > > > so the only strategy available up until now has been to always > > > > > retrieve > > > > > the maximum possible report length over i2c, which can be quite > > > > > inefficient. For devices that send reports in block read format, the > > > > > i2c > > > > > controller driver can read the payload length on the fly and terminate > > > > > the i2c transaction early, resulting in considerable power savings. > > > > > > > > > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the > > > > > touchpad causes psys power readings to go up by about 4W and hover > > > > > there > > > > > until I remove my finger. With this patch, my psys readings go from > > > > > 4.7W > > > > > down to 3.1W, yielding about 1.6W in savings. This is because my > > > > > touchpad's max report length is 60 bytes, but all of the regular > > > > > reports > > > > > it sends for touch events are only 32 bytes, so the i2c transfer is > > > > > roughly halved for the common case. > > > > > > > > > + /* Try to do a block read if the size fits in one byte */ > > > > > + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; > > > > > > > > AFAIR SMBus specification tells about 256. Why 255? > > > > > > > > Andi, am I correct? > > > > > > Actually the SMBUS 3.0 protocol from 2015[*] says 255: > > > > > > " > > > D.6 255 Bytes in Process Call > > > > > > The maximum number of bytes allowed in the Block Write-Block Read > > > Process Call (Section 6.5.8) was increased from 32 to 255. > > > " > > > > > > But why does it matter... I see the patch is detatching itself > > > from smbus. > > > > > > And, actually, I wonder if this is the right way to fix it, isn't > > > it better to fix smbus instead? > > > > I think the best solution would be to modify the i2c api to allow passing > > in a > > function pointer and a payload size length, to specify how to interpret the > > size > > of the incoming payload, so the adapter could handle both the HID over i2c > > transfer spec and SMBus block reads without needing to read more bytes than > > needed. > > Can't you do that by specifying the xfer function? > > When you use smbus_read/write in block or byte or whatever, smbus > always checks if there is an xfer function specified and uses > that. > > If it's not specified it uses the default smbus functions with > the limitations that come with it. The xfer functions are specified on a per-adapter basis. In the case of i2c-hid, we need to tell the adapter to interpret the payload size in a specific way, which I *think* is only specific to HID over i2c (i.e., using 16 bits to store the length and then checking it for device quirks). > > For example, for an SMBus block read, the payload size is specified in the > > first > > byte and it is limited to 32 bytes. However, for HID over i2c, the payload > > size > > is specified in the first two bytes, and there are also some device quirks > > involved to reinterpret the reported size. > > which is wrong. The 32 bytes limitation is outdated: in the link > that I gave before (i.e. this one [*]), the new SMBUS specifies > 255 maximum for read/write block. Oops. But still, for SMBus block reads, the size is limited to 8 bits. For HID over i2c, it can be 16 bits. I don't see how we can handle this without some api cooperation to tell the adapter what the caller is expecting to see. > > A nice solution would be to pass in how many bytes the i2c payload size can > > contain, as well as a function pointer to evaluate the reported payload > > size in > > a way that the caller wants. This would require modifying every i2c adapter > > driver to add this functionality, but it would fix the efficiency problem > > faced > > by i2c-hid and perhaps others. > > > > > I have a patch ready that fixes the smbus transfer size, perhaps > > > I should rebase, test and send it. > > > > For the i2c-hid driver? > > No, sorry, for smbus. > > Now... here you are replacing "i2c_master_recv" with > "i2c_transfer_buffer_flags". I do not really like this change, > although I understand it's necessary, because we are bypassing > the real issue that is that the smbus implementation is outdated. > > I have a patch for that that for a matter of time I never sent. Can it handle block reads (8 bit size) and HID over i2c (16 bit size)? Sultan
Re: [PATCH v2] HID: i2c-hid: Use block reads when possible to save power
On Tue, Jun 16, 2020 at 08:18:54PM +0300, Andi Shyti wrote: > Hi Andy, > > > > so the only strategy available up until now has been to always retrieve > > > the maximum possible report length over i2c, which can be quite > > > inefficient. For devices that send reports in block read format, the i2c > > > controller driver can read the payload length on the fly and terminate > > > the i2c transaction early, resulting in considerable power savings. > > > > > > On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the > > > touchpad causes psys power readings to go up by about 4W and hover there > > > until I remove my finger. With this patch, my psys readings go from 4.7W > > > down to 3.1W, yielding about 1.6W in savings. This is because my > > > touchpad's max report length is 60 bytes, but all of the regular reports > > > it sends for touch events are only 32 bytes, so the i2c transfer is > > > roughly halved for the common case. > > > > > + /* Try to do a block read if the size fits in one byte */ > > > + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; > > > > AFAIR SMBus specification tells about 256. Why 255? > > > > Andi, am I correct? > > Actually the SMBUS 3.0 protocol from 2015[*] says 255: > > " > D.6 255 Bytes in Process Call > > The maximum number of bytes allowed in the Block Write-Block Read > Process Call (Section 6.5.8) was increased from 32 to 255. > " > > But why does it matter... I see the patch is detatching itself > from smbus. > > And, actually, I wonder if this is the right way to fix it, isn't > it better to fix smbus instead? I think the best solution would be to modify the i2c api to allow passing in a function pointer and a payload size length, to specify how to interpret the size of the incoming payload, so the adapter could handle both the HID over i2c transfer spec and SMBus block reads without needing to read more bytes than needed. For example, for an SMBus block read, the payload size is specified in the first byte and it is limited to 32 bytes. However, for HID over i2c, the payload size is specified in the first two bytes, and there are also some device quirks involved to reinterpret the reported size. A nice solution would be to pass in how many bytes the i2c payload size can contain, as well as a function pointer to evaluate the reported payload size in a way that the caller wants. This would require modifying every i2c adapter driver to add this functionality, but it would fix the efficiency problem faced by i2c-hid and perhaps others. > I have a patch ready that fixes the smbus transfer size, perhaps > I should rebase, test and send it. For the i2c-hid driver? Sultan
[PATCH v2] HID: i2c-hid: Use block reads when possible to save power
From: Sultan Alsawaf We have no way of knowing how large an incoming payload is going to be, so the only strategy available up until now has been to always retrieve the maximum possible report length over i2c, which can be quite inefficient. For devices that send reports in block read format, the i2c controller driver can read the payload length on the fly and terminate the i2c transaction early, resulting in considerable power savings. On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the touchpad causes psys power readings to go up by about 4W and hover there until I remove my finger. With this patch, my psys readings go from 4.7W down to 3.1W, yielding about 1.6W in savings. This is because my touchpad's max report length is 60 bytes, but all of the regular reports it sends for touch events are only 32 bytes, so the i2c transfer is roughly halved for the common case. Signed-off-by: Sultan Alsawaf --- Jarkko, could you try this? drivers/hid/i2c-hid/i2c-hid-core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c index 294c84e136d7..739dccfc57e1 100644 --- a/drivers/hid/i2c-hid/i2c-hid-core.c +++ b/drivers/hid/i2c-hid/i2c-hid-core.c @@ -472,11 +472,14 @@ static void i2c_hid_get_input(struct i2c_hid *ihid) int ret; u32 ret_size; int size = le16_to_cpu(ihid->hdesc.wMaxInputLength); + u16 flags; if (size > ihid->bufsize) size = ihid->bufsize; - ret = i2c_master_recv(ihid->client, ihid->inbuf, size); + /* Try to do a block read if the size fits in one byte */ + flags = size > 255 ? I2C_M_RD : I2C_M_RD | I2C_M_RECV_LEN; + ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, flags); if (ret != size) { if (ret < 0) return; -- 2.27.0
[PATCH v2] i2c: designware: Only check the first byte for SMBus block read length
From: Sultan Alsawaf SMBus block reads can be broken because the read function will just skip over bytes it doesn't like until reaching a byte that conforms to the length restrictions for block reads. This is problematic when it isn't known if the incoming payload is indeed a conforming block read. According to the SMBus specification, block reads will only send the payload length in the first byte, so we can fix this by only considering the first byte in a sequence for block read length purposes. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf --- drivers/i2c/busses/i2c-designware-master.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index d6425ad6e6a3..d22271438869 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -391,14 +391,10 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len) struct i2c_msg *msgs = dev->msgs; u32 flags = msgs[dev->msg_read_idx].flags; - /* -* Adjust the buffer length and mask the flag -* after receiving the first byte. -*/ + /* Adjust the buffer length */ len += (flags & I2C_CLIENT_PEC) ? 2 : 1; dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding); msgs[dev->msg_read_idx].len = len; - msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN; return len; } @@ -430,10 +426,12 @@ i2c_dw_read(struct dw_i2c_dev *dev) u32 flags = msgs[dev->msg_read_idx].flags; regmap_read(dev->map, DW_IC_DATA_CMD, &tmp); - /* Ensure length byte is a valid value */ - if (flags & I2C_M_RECV_LEN && - tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) { - len = i2c_dw_recv_len(dev, tmp); + if (flags & I2C_M_RECV_LEN) { + /* Ensure length byte is a valid value */ + if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) + len = i2c_dw_recv_len(dev, tmp); + /* Mask the flag after receiving the first byte */ + msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN; } *buf++ = tmp; dev->rx_outstanding--; -- 2.27.0
Re: [PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length
On Mon, Jun 15, 2020 at 07:07:42PM +0300, Andy Shevchenko wrote: > On Mon, Jun 15, 2020 at 7:06 PM Sultan Alsawaf wrote: > > On Mon, Jun 15, 2020 at 12:40:19PM +0300, Andy Shevchenko wrote: > > > On Sun, Jun 14, 2020 at 02:02:54PM -0700, Sultan Alsawaf wrote: > > > > From: Sultan Alsawaf > > > > > > > > SMBus block reads can be broken because the read function will just skip > > > > over bytes it doesn't like until reaching a byte that conforms to the > > > > length restrictions for block reads. This is problematic when it isn't > > > > known if the incoming payload is indeed a conforming block read. > > > > > > > > According to the SMBus specification, block reads will only send the > > > > payload length in the first byte, so we can fix this by only considering > > > > the first byte in a sequence for block read length purposes. > > > > > > I'm wondering if this overlaps with [1]. AFAIU that one is also makes > > > sure that > > > the length is not a garbage. > > > > > > [1]: > > > https://lore.kernel.org/linux-i2c/20200613104109.2989-1-m...@mansr.com/T/#u > > > > No overlap. > > Thanks for clarifying. > > > That looks like a similar bug for a different driver. In my case, > > the adapter provides native SMBus support, so emulation is never used. This > > is > > clear to see by looking at i2c_transfer_buffer_flags(), which only uses the > > master_xfer functions provided by the adapter; it doesn't call the emulation > > path at all. > > But do we get an advantage if this can be done in the i2c core instead > (once for all)? We can't, because the adapter driver needs to know mid-transfer to look for the payload length in the first byte, and then alter the transfer size on-the-fly. That can't be done in the i2c core, sadly. The problem is that we don't know if a transfer is going to be a block read or not beforehand. And altering the transfer size mid-transfer is definitely a controller specific task. Sultan
Re: [PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length
On Mon, Jun 15, 2020 at 12:40:19PM +0300, Andy Shevchenko wrote: > On Sun, Jun 14, 2020 at 02:02:54PM -0700, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > SMBus block reads can be broken because the read function will just skip > > over bytes it doesn't like until reaching a byte that conforms to the > > length restrictions for block reads. This is problematic when it isn't > > known if the incoming payload is indeed a conforming block read. > > > > According to the SMBus specification, block reads will only send the > > payload length in the first byte, so we can fix this by only considering > > the first byte in a sequence for block read length purposes. > > I'm wondering if this overlaps with [1]. AFAIU that one is also makes sure > that > the length is not a garbage. > > [1]: > https://lore.kernel.org/linux-i2c/20200613104109.2989-1-m...@mansr.com/T/#u No overlap. That looks like a similar bug for a different driver. In my case, the adapter provides native SMBus support, so emulation is never used. This is clear to see by looking at i2c_transfer_buffer_flags(), which only uses the master_xfer functions provided by the adapter; it doesn't call the emulation path at all. Sultan
[PATCH 0/2] i2c-hid: Save power by reducing i2c xfers with block reads
From: Sultan Alsawaf Hi, I noticed on my Dell Precision 15 5540 with an i9-9880H that simply putting my finger on the touchpad would increase my system's power consumption by 4W, which is quite considerable. Resting my finger on the touchpad would generate roughly 4000 i2c irqs per second, or roughly 20 i2c irqs per touchpad irq. Upon closer inspection, I noticed that the i2c-hid driver would always transfer the maximum report size over i2c (which is 60 bytes for my touchpad), but all of my touchpad's normal touch events are only 32 bytes long according to the length byte contained in the buffer sequence. Therefore, I was able to save about 2W of power by passing the I2C_M_RECV_LEN flag in i2c-hid, which says to look for the payload length in the first byte of the transfer buffer and adjust the i2c transaction accordingly. The only problem though is that my i2c controller's driver allows bytes other than the first one to be used to retrieve the payload length, which is incorrect according to the SMBus spec, and would break my i2c-hid change since not *all* of the reports from my touchpad are conforming SMBus block reads. This patchset fixes the I2C_M_RECV_LEN behavior in the designware i2c driver and modifies i2c-hid to use I2C_M_RECV_LEN to save quite a bit of power. Even if the peripheral controlled by i2c-hid doesn't support block reads, the i2c controller drivers should cope with this and proceed with the i2c transfer using the original requested length. Sultan Sultan Alsawaf (2): i2c: designware: Only check the first byte for SMBus block read length HID: i2c-hid: Use block reads when possible to save power drivers/hid/i2c-hid/i2c-hid-core.c | 3 ++- drivers/i2c/busses/i2c-designware-master.c | 10 +- 2 files changed, 7 insertions(+), 6 deletions(-) -- 2.27.0
[PATCH 2/2] HID: i2c-hid: Use block reads when possible to save power
From: Sultan Alsawaf We have no way of knowing how large an incoming payload is going to be, so the only strategy available up until now has been to always retrieve the maximum possible report length over i2c, which can be quite inefficient. For devices that send reports in block read format, the i2c controller driver can read the payload length on the fly and terminate the i2c transaction early, resulting in considerable power savings. On a Dell Precision 15 5540 with an i9-9880H, resting my finger on the touchpad causes psys power readings to go up by about 4W and hover there until I remove my finger. With this patch, my psys readings go from 4.7W down to 3.1W, yielding about 1.6W in savings. This is because my touchpad's max report length is 60 bytes, but all of the regular reports it sends for touch events are only 32 bytes, so the i2c transfer is roughly halved for the common case. Signed-off-by: Sultan Alsawaf --- drivers/hid/i2c-hid/i2c-hid-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c index 294c84e136d7..4b507de48d70 100644 --- a/drivers/hid/i2c-hid/i2c-hid-core.c +++ b/drivers/hid/i2c-hid/i2c-hid-core.c @@ -476,7 +476,8 @@ static void i2c_hid_get_input(struct i2c_hid *ihid) if (size > ihid->bufsize) size = ihid->bufsize; - ret = i2c_master_recv(ihid->client, ihid->inbuf, size); + ret = i2c_transfer_buffer_flags(ihid->client, ihid->inbuf, size, + I2C_M_RD | I2C_M_RECV_LEN); if (ret != size) { if (ret < 0) return; -- 2.27.0
[PATCH 1/2] i2c: designware: Only check the first byte for SMBus block read length
From: Sultan Alsawaf SMBus block reads can be broken because the read function will just skip over bytes it doesn't like until reaching a byte that conforms to the length restrictions for block reads. This is problematic when it isn't known if the incoming payload is indeed a conforming block read. According to the SMBus specification, block reads will only send the payload length in the first byte, so we can fix this by only considering the first byte in a sequence for block read length purposes. Fixes: c3ae106050b9 ("i2c: designware: Implement support for SMBus block read and write") Signed-off-by: Sultan Alsawaf --- drivers/i2c/busses/i2c-designware-master.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c index d6425ad6e6a3..16d38b8fc19a 100644 --- a/drivers/i2c/busses/i2c-designware-master.c +++ b/drivers/i2c/busses/i2c-designware-master.c @@ -398,7 +398,6 @@ i2c_dw_recv_len(struct dw_i2c_dev *dev, u8 len) len += (flags & I2C_CLIENT_PEC) ? 2 : 1; dev->tx_buf_len = len - min_t(u8, len, dev->rx_outstanding); msgs[dev->msg_read_idx].len = len; - msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN; return len; } @@ -430,10 +429,11 @@ i2c_dw_read(struct dw_i2c_dev *dev) u32 flags = msgs[dev->msg_read_idx].flags; regmap_read(dev->map, DW_IC_DATA_CMD, &tmp); - /* Ensure length byte is a valid value */ - if (flags & I2C_M_RECV_LEN && - tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) { - len = i2c_dw_recv_len(dev, tmp); + if (flags & I2C_M_RECV_LEN) { + /* Ensure length byte is a valid value */ + if (tmp <= I2C_SMBUS_BLOCK_MAX && tmp > 0) + len = i2c_dw_recv_len(dev, tmp); + msgs[dev->msg_read_idx].flags &= ~I2C_M_RECV_LEN; } *buf++ = tmp; dev->rx_outstanding--; -- 2.27.0
[PATCH v2] locking/Kconfig: Don't forcefully uninline spin_unlock for PREEMPT
From: Sultan Alsawaf This change was originally done in 2005 without any justification in commit bda98685b855 ("[PATCH] x86: inline spin_unlock if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT"). Perhaps the reasoning at the time was that PREEMPT was still considered unstable and needed extra debugging; however, this is no longer the case, so remove the artificial limitation. Signed-off-by: Sultan Alsawaf --- kernel/Kconfig.locks | 10 +- kernel/Kconfig.preempt | 1 - 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 3de8fd11873b..2d992f5c6ec3 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -139,7 +139,7 @@ config INLINE_SPIN_UNLOCK_BH config INLINE_SPIN_UNLOCK_IRQ def_bool y - depends on !PREEMPTION || ARCH_INLINE_SPIN_UNLOCK_IRQ + depends on ARCH_INLINE_SPIN_UNLOCK_IRQ config INLINE_SPIN_UNLOCK_IRQRESTORE def_bool y @@ -168,7 +168,7 @@ config INLINE_READ_LOCK_IRQSAVE config INLINE_READ_UNLOCK def_bool y - depends on !PREEMPTION || ARCH_INLINE_READ_UNLOCK + depends on ARCH_INLINE_READ_UNLOCK config INLINE_READ_UNLOCK_BH def_bool y @@ -176,7 +176,7 @@ config INLINE_READ_UNLOCK_BH config INLINE_READ_UNLOCK_IRQ def_bool y - depends on !PREEMPTION || ARCH_INLINE_READ_UNLOCK_IRQ + depends on ARCH_INLINE_READ_UNLOCK_IRQ config INLINE_READ_UNLOCK_IRQRESTORE def_bool y @@ -205,7 +205,7 @@ config INLINE_WRITE_LOCK_IRQSAVE config INLINE_WRITE_UNLOCK def_bool y - depends on !PREEMPTION || ARCH_INLINE_WRITE_UNLOCK + depends on ARCH_INLINE_WRITE_UNLOCK config INLINE_WRITE_UNLOCK_BH def_bool y @@ -213,7 +213,7 @@ config INLINE_WRITE_UNLOCK_BH config INLINE_WRITE_UNLOCK_IRQ def_bool y - depends on !PREEMPTION || ARCH_INLINE_WRITE_UNLOCK_IRQ + depends on ARCH_INLINE_WRITE_UNLOCK_IRQ config INLINE_WRITE_UNLOCK_IRQRESTORE def_bool y diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index bf82259cff96..5a9e0409c844 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -39,7 +39,6 @@ config PREEMPT bool "Preemptible Kernel (Low-Latency Desktop)" depends on !ARCH_NO_PREEMPT select PREEMPTION - select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK help This option reduces the latency of the kernel by making all kernel code (that is not executing in a critical section) -- 2.26.2
[PATCH] locking/Kconfig: Don't forcefully uninline spin_unlock for PREEMPT
From: Sultan Alsawaf This change was originally done in 2005 without any justification in commit bda98685b855 ("[PATCH] x86: inline spin_unlock if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT"). Perhaps the reasoning at the time was that PREEMPT was still considered unstable and needed extra debugging; however, this is no longer the case, so remove the artificial limitation. Signed-off-by: Sultan Alsawaf --- kernel/Kconfig.preempt | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index bf82259cff96..5a9e0409c844 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -39,7 +39,6 @@ config PREEMPT bool "Preemptible Kernel (Low-Latency Desktop)" depends on !ARCH_NO_PREEMPT select PREEMPTION - select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK help This option reduces the latency of the kernel by making all kernel code (that is not executing in a critical section) -- 2.26.2
[PATCH] drm/i915: Don't enable WaIncreaseLatencyIPCEnabled when IPC is disabled
From: Sultan Alsawaf In commit 5a7d202b1574, a logical AND was erroneously changed to an OR, causing WaIncreaseLatencyIPCEnabled to be enabled unconditionally for kabylake and coffeelake, even when IPC is disabled. Fix the logic so that WaIncreaseLatencyIPCEnabled is only used when IPC is enabled. Fixes: 5a7d202b1574 ("drm/i915: Drop WaIncreaseLatencyIPCEnabled/1140 for cnl") Cc: sta...@vger.kernel.org # 5.3.x+ Signed-off-by: Sultan Alsawaf --- drivers/gpu/drm/i915/intel_pm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 8375054ba27d..a52986a9e7a6 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -4992,7 +4992,7 @@ static void skl_compute_plane_wm(const struct intel_crtc_state *crtc_state, * WaIncreaseLatencyIPCEnabled: kbl,cfl * Display WA #1141: kbl,cfl */ - if ((IS_KABYLAKE(dev_priv) || IS_COFFEELAKE(dev_priv)) || + if ((IS_KABYLAKE(dev_priv) || IS_COFFEELAKE(dev_priv)) && dev_priv->ipc_enabled) latency += 4; -- 2.26.2
Re: [PATCH] rt-tests: backfire: Don't include asm/uaccess.h directly
On Mon, Sep 16, 2019 at 11:57:32PM +0200, John Kacur wrote: > Signed-off-by: John Kacur > But please in the future > 1. Don't cc lkml on this > 2. Include the maintainers in your patch Hi, Thanks for the sign-off. I was following the instructions listed here: https://wiki.linuxfoundation.org/realtime/communication/send_rt_patches I couldn't find any documentation of how to send patches for rt-tests. Is there a different set of patch submission instructions on a wiki somewhere I missed? Thanks, Sultan
[PATCH] rt-tests: backfire: Don't include asm/uaccess.h directly
From: Sultan Alsawaf Architecture-specific uaccess.h headers can have dependencies on linux/uaccess.h (i.e., VERIFY_WRITE), so it cannot be included directly. Since linux/uaccess.h includes asm/uaccess.h, just do that instead. This fixes compile errors with certain kernels and architectures. Signed-off-by: Sultan Alsawaf --- src/backfire/backfire.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backfire/backfire.c b/src/backfire/backfire.c index aaf9c4a..a8ac9f5 100644 --- a/src/backfire/backfire.c +++ b/src/backfire/backfire.c @@ -30,8 +30,8 @@ #include #include #include +#include -#include #include #define BACKFIRE_MINOR MISC_DYNAMIC_MINOR -- 2.23.0
Re: [PATCH] mbcache: Speed up cache entry creation
On Tue, Jul 23, 2019 at 10:56:05AM -0600, Andreas Dilger wrote: > Do you have any kind of performance metrics that show this is an actual > improvement in performance? This would be either macro-level benchmarks > (e.g. fio, but this seems unlikely to show any benefit), or micro-level > measurements (e.g. flame graph) that show a net reduction in CPU cycles, > lock contention, etc. in this part of the code. Hi Andreas, Here are some basic micro-benchmark results: Before: [3.162896] mb_cache_entry_create: AVG cycles: 75 [3.054701] mb_cache_entry_create: AVG cycles: 78 [3.152321] mb_cache_entry_create: AVG cycles: 77 After: [3.043380] mb_cache_entry_create: AVG cycles: 68 [3.194321] mb_cache_entry_create: AVG cycles: 71 [3.038100] mb_cache_entry_create: AVG cycles: 69 The performance difference is probably more drastic when free memory is low, since an unnecessary call to kmem_cache_alloc() can result in a long wait for pages to be freed. The micro-benchmark code is attached. Thanks, Sultan --- diff --git a/fs/mbcache.c b/fs/mbcache.c index 289f3664061e..e0f22ff8fab8 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -82,7 +82,7 @@ static inline struct mb_bucket *mb_cache_entry_bucket(struct mb_cache *cache, * -EBUSY if entry with the same key and value already exists in cache. * Otherwise 0 is returned. */ -int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, +static int __mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, u64 value, bool reusable) { struct mb_cache_entry *entry, *dup; @@ -148,6 +148,29 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, return 0; } + +int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, + u64 value, bool reusable) +{ + static unsigned long count, sum; + static DEFINE_MUTEX(lock); + volatile cycles_t start, delta; + int ret; + + mutex_lock(&lock); + local_irq_disable(); + start = get_cycles(); + ret = __mb_cache_entry_create(cache, mask, key, value, reusable); + delta = get_cycles() - start; + local_irq_enable(); + + sum += delta; + if (++count == 1000) + printk("%s: AVG cycles: %lu\n", __func__, sum / count); + mutex_unlock(&lock); + + return ret; +} EXPORT_SYMBOL(mb_cache_entry_create); void __mb_cache_entry_free(struct mb_cache_entry *entry)
[PATCH] mbcache: Speed up cache entry creation
From: Sultan Alsawaf In order to prevent redundant entry creation by racing against itself, mb_cache_entry_create scans through a hash-list of all current entries in order to see if another allocation for the requested new entry has been made. Furthermore, it allocates memory for a new entry before scanning through this hash-list, which results in that allocated memory being discarded when the requested new entry is already present. Speed up cache entry creation by keeping a small linked list of requested new entries in progress, and scanning through that first instead of the large hash-list. And don't allocate memory for a new entry until it's known that the allocated memory will be used. Signed-off-by: Sultan Alsawaf --- fs/mbcache.c | 82 1 file changed, 57 insertions(+), 25 deletions(-) diff --git a/fs/mbcache.c b/fs/mbcache.c index 97c54d3a2227..289f3664061e 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -25,9 +25,14 @@ * size hash table is used for fast key lookups. */ +struct mb_bucket { + struct hlist_bl_head hash; + struct list_head req_list; +}; + struct mb_cache { /* Hash table of entries */ - struct hlist_bl_head*c_hash; + struct mb_bucket*c_bucket; /* log2 of hash table size */ int c_bucket_bits; /* Maximum entries in cache to avoid degrading hash too much */ @@ -42,15 +47,21 @@ struct mb_cache { struct work_struct c_shrink_work; }; +struct mb_cache_req { + struct list_head lnode; + u32 key; + u64 value; +}; + static struct kmem_cache *mb_entry_cache; static unsigned long mb_cache_shrink(struct mb_cache *cache, unsigned long nr_to_scan); -static inline struct hlist_bl_head *mb_cache_entry_head(struct mb_cache *cache, - u32 key) +static inline struct mb_bucket *mb_cache_entry_bucket(struct mb_cache *cache, + u32 key) { - return &cache->c_hash[hash_32(key, cache->c_bucket_bits)]; + return &cache->c_bucket[hash_32(key, cache->c_bucket_bits)]; } /* @@ -77,6 +88,8 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, struct mb_cache_entry *entry, *dup; struct hlist_bl_node *dup_node; struct hlist_bl_head *head; + struct mb_cache_req *tmp_req, req; + struct mb_bucket *bucket; /* Schedule background reclaim if there are too many entries */ if (cache->c_entry_count >= cache->c_max_entries) @@ -85,9 +98,33 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, if (cache->c_entry_count >= 2*cache->c_max_entries) mb_cache_shrink(cache, SYNC_SHRINK_BATCH); + bucket = mb_cache_entry_bucket(cache, key); + head = &bucket->hash; + hlist_bl_lock(head); + list_for_each_entry(tmp_req, &bucket->req_list, lnode) { + if (tmp_req->key == key && tmp_req->value == value) { + hlist_bl_unlock(head); + return -EBUSY; + } + } + hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) { + if (dup->e_key == key && dup->e_value == value) { + hlist_bl_unlock(head); + return -EBUSY; + } + } + req.key = key; + req.value = value; + list_add(&req.lnode, &bucket->req_list); + hlist_bl_unlock(head); + entry = kmem_cache_alloc(mb_entry_cache, mask); - if (!entry) + if (!entry) { + hlist_bl_lock(head); + list_del(&req.lnode); + hlist_bl_unlock(head); return -ENOMEM; + } INIT_LIST_HEAD(&entry->e_list); /* One ref for hash, one ref returned */ @@ -96,15 +133,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key, entry->e_value = value; entry->e_reusable = reusable; entry->e_referenced = 0; - head = mb_cache_entry_head(cache, key); + hlist_bl_lock(head); - hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) { - if (dup->e_key == key && dup->e_value == value) { - hlist_bl_unlock(head); - kmem_cache_free(mb_entry_cache, entry); - return -EBUSY; - } - } + list_del(&req.lnode); hlist_bl_add_head(&entry->e_hash_list, head); hlist_bl_unlock(head); @@ -133,7 +164,7 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache, struct hlist_bl_node *node; struct hlist_bl_head *head; -
[PATCH] scatterlist: Don't allocate sg lists using __get_free_page
From: Sultan Alsawaf Allocating pages with __get_free_page is slower than going through the slab allocator to grab free pages out from a pool. These are the results from running the code at the bottom of this message: [1.278602] speedtest: __get_free_page: 9 us [1.278606] speedtest: kmalloc: 4 us [1.278609] speedtest: kmem_cache_alloc: 4 us [1.278611] speedtest: vmalloc: 13 us kmalloc and kmem_cache_alloc (which is what kmalloc uses for common sizes behind the scenes) are the fastest choices. Use kmalloc to speed up sg list allocation. This is the code used to produce the above measurements: #include #include #include static int speedtest(void *data) { static const struct sched_param sched_max_rt_prio = { .sched_priority = MAX_RT_PRIO - 1 }; volatile s64 ctotal = 0, gtotal = 0, ktotal = 0, vtotal = 0; struct kmem_cache *page_pool; int i, j, trials = 1000; volatile ktime_t start; void *ptr[100]; sched_setscheduler_nocheck(current, SCHED_FIFO, &sched_max_rt_prio); page_pool = kmem_cache_create("pages", PAGE_SIZE, PAGE_SIZE, SLAB_PANIC, NULL); for (i = 0; i < trials; i++) { start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j++) while (!(ptr[j] = kmem_cache_alloc(page_pool, GFP_KERNEL))); ctotal += ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j++) kmem_cache_free(page_pool, ptr[j]); start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j++) while (!(ptr[j] = (void *)__get_free_page(GFP_KERNEL))); gtotal += ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j++) free_page((unsigned long)ptr[j]); start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j++) while (!(ptr[j] = kmalloc(PAGE_SIZE, GFP_KERNEL))); ktotal += ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j++) kfree(ptr[j]); start = ktime_get(); *ptr = vmalloc(ARRAY_SIZE(ptr) * PAGE_SIZE); vtotal += ktime_us_delta(ktime_get(), start); vfree(*ptr); } kmem_cache_destroy(page_pool); printk("%s: __get_free_page: %lld us\n", __func__, gtotal / trials); printk("%s: kmalloc: %lld us\n", __func__, ktotal / trials); printk("%s: kmem_cache_alloc: %lld us\n", __func__, ctotal / trials); printk("%s: vmalloc: %lld us\n", __func__, vtotal / trials); complete(data); return 0; } static int __init start_test(void) { DECLARE_COMPLETION_ONSTACK(done); BUG_ON(IS_ERR(kthread_run(speedtest, &done, "malloc_test"))); wait_for_completion(&done); return 0; } late_initcall(start_test); Signed-off-by: Sultan Alsawaf --- lib/scatterlist.c | 23 ++- 1 file changed, 2 insertions(+), 21 deletions(-) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c2cf2c311b7d..d3093e8f8978 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -148,31 +148,12 @@ EXPORT_SYMBOL(sg_init_one); */ static struct scatterlist *sg_kmalloc(unsigned int nents, gfp_t gfp_mask) { - if (nents == SG_MAX_SINGLE_ALLOC) { - /* -* Kmemleak doesn't track page allocations as they are not -* commonly used (in a raw form) for kernel data structures. -* As we chain together a list of pages and then a normal -* kmalloc (tracked by kmemleak), in order to for that last -* allocation not to become decoupled (and thus a -* false-positive) we need to inform kmemleak of all the -* intermediate allocations. -*/ - void *ptr = (void *) __get_free_page(gfp_mask); - kmemleak_alloc(ptr, PAGE_SIZE, 1, gfp_mask); - return ptr; - } else - return kmalloc_array(nents, sizeof(struct scatterlist), -gfp_mask); + return kmalloc_array(nents, sizeof(struct scatterlist), gfp_mask); } static void sg_kfree(struct scatterlist *sg, unsigned int nents) { - if (nents == SG_MAX_SINGLE_ALLOC) { - kmemleak_free(sg); - free_page((unsigned long) sg); - } else - kfree(sg); + kfree(sg); } /** -- 2.22.0
Re: [PATCH] scatterlist: Allocate a contiguous array instead of chaining
On Fri, Jul 12, 2019 at 09:06:40AM +0200, Thomas Gleixner wrote: > On Fri, 12 Jul 2019, Ming Lei wrote: > > vmalloc() may sleep, so it is impossible to be called in atomic context. > > Allocations from atomic context should be avoided wherever possible and you > really have to have a very convincing argument why an atomic allocation is > absolutely necessary. I cleaned up quite some GFP_ATOMIC users over the > last couple of years and all of them were doing it for the very wrong > reasons and mostly just to silence the warning which is triggered with > GFP_KERNEL when called from a non-sleepable context. > > So I suggest to audit all call sites first and figure out whether they > really must use GFP_ATOMIC and if possible clean them up, remove the GFP > argument and then do the vmalloc thing on top. Hello Thomas and Ming, It looks like the following call sites are atomic: drivers/crypto/qce/ablkcipher.c:92: ret = sg_alloc_table(&rctx->dst_tbl, rctx->dst_nents, gfp); drivers/crypto/ccp/ccp-crypto-aes-cmac.c:110: ret = sg_alloc_table(&rctx->data_sg, sg_count, gfp); drivers/crypto/ccp/ccp-crypto-sha.c:103:ret = sg_alloc_table(&rctx->data_sg, sg_count, gfp); drivers/spi/spi-pl022.c:1035: ret = sg_alloc_table(&pl022->sgt_rx, pages, GFP_ATOMIC); drivers/spi/spi-pl022.c:1039: ret = sg_alloc_table(&pl022->sgt_tx, pages, GFP_ATOMIC); The crypto ones are conditionally made atomic depending on the presence of CRYPTO_TFM_REQ_MAY_SLEEP. Additionally, the following allocation could be problematic with kvmalloc: net/ceph/crypto.c:180: ret = sg_alloc_table(sgt, chunk_cnt, GFP_NOFS); This is a snippet from kvmalloc: /* * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables) * so the given set of flags has to be compatible. */ if ((flags & GFP_KERNEL) != GFP_KERNEL) return kmalloc_node(size, flags, node); Use of GFP_NOFS in net/ceph/crypto.c would cause kvmalloc to fall back to kmalloc_node, which could cause problems if the allocation size is too large for kmalloc_node to reasonably accomodate. Also, it looks like the vmalloc family doesn't have kvmalloc's GFP_KERNEL check. Is this intentional, or does vmalloc really not require GFP_KERNEL context? Thanks, Sultan
[PATCH] scatterlist: Allocate a contiguous array instead of chaining
From: Sultan Alsawaf Typically, drivers allocate sg lists of sizes up to a few MiB in size. The current algorithm deals with large sg lists by splitting them into several smaller arrays and chaining them together. But if the sg list allocation is large, and we know the size ahead of time, sg chaining is both inefficient and unnecessary. Rather than calling kmalloc hundreds of times in a loop for chaining tiny arrays, we can simply do it all at once with kvmalloc, which has the proper tradeoff on when to stop using kmalloc and instead use vmalloc. Abusing repeated kmallocs to produce a large allocation puts strain on the slab allocator, when kvmalloc can be used instead. The single kvmalloc allocation for all sg lists reduces the burden on the slab and page allocators, since for large sg list allocations, this commit replaces numerous kmalloc calls with one kvmalloc call. The sg chaining is effectively disabled by changing SG_MAX_SINGLE_ALLOC to UINT_MAX, which causes sg list allocations to never be split into chains, since no allocation is larger than UINT_MAX. We then plumb kvmalloc into the allocation functions so that it is used. Signed-off-by: Sultan Alsawaf --- include/linux/scatterlist.h | 2 +- lib/scatterlist.c | 23 ++- 2 files changed, 3 insertions(+), 22 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6eec50fb36c8..e2e26c53c441 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -310,7 +310,7 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. */ -#define SG_MAX_SINGLE_ALLOC(PAGE_SIZE / sizeof(struct scatterlist)) +#define SG_MAX_SINGLE_ALLOC(UINT_MAX) /* * The maximum number of SG segments that we will put inside a diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c2cf2c311b7d..bf76854a34aa 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -148,31 +148,12 @@ EXPORT_SYMBOL(sg_init_one); */ static struct scatterlist *sg_kmalloc(unsigned int nents, gfp_t gfp_mask) { - if (nents == SG_MAX_SINGLE_ALLOC) { - /* -* Kmemleak doesn't track page allocations as they are not -* commonly used (in a raw form) for kernel data structures. -* As we chain together a list of pages and then a normal -* kmalloc (tracked by kmemleak), in order to for that last -* allocation not to become decoupled (and thus a -* false-positive) we need to inform kmemleak of all the -* intermediate allocations. -*/ - void *ptr = (void *) __get_free_page(gfp_mask); - kmemleak_alloc(ptr, PAGE_SIZE, 1, gfp_mask); - return ptr; - } else - return kmalloc_array(nents, sizeof(struct scatterlist), -gfp_mask); + return kvmalloc_array(nents, sizeof(struct scatterlist), gfp_mask); } static void sg_kfree(struct scatterlist *sg, unsigned int nents) { - if (nents == SG_MAX_SINGLE_ALLOC) { - kmemleak_free(sg); - free_page((unsigned long) sg); - } else - kfree(sg); + kvfree(sg); } /** -- 2.22.0
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Wed, May 15, 2019 at 02:32:48PM -0400, Steven Rostedt wrote: > I'm confused why you did this? Oleg said that debug_locks_off() could've been called and thus prevented lockdep complaints about simple_lmk from appearing. To eliminate any possibility of that, I disabled debug_locks_off(). Oleg also said that __lock_acquire() could return early if lock debugging were somehow turned off after lockdep reported one bug. To mitigate any possibility of that as well, I threw in the BUG_ON() for good measure. I think at this point it's pretty clear that lockdep truly isn't complaining about simple_lmk's locking pattern, and that lockdep's lack of complaints isn't due to it being mysteriously turned off... Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 14, 2019 at 12:44:53PM -0400, Steven Rostedt wrote: > OK, this has gotten my attention. > > This thread is quite long, do you have a git repo I can look at, and > also where is the first task_lock() taken before the > find_lock_task_mm()? > > -- Steve Hi Steve, This is the git repo I work on: https://github.com/kerneltoast/android_kernel_google_wahoo With the newest simple_lmk iteration being this commit: https://github.com/kerneltoast/android_kernel_google_wahoo/commit/6b145b8c28b39f7047393169117f72ea7387d91c This repo is based off the 4.4 kernel that Google ships on the Pixel 2/2XL. simple_lmk iterates through the entire task list more than once and locks potential victims using find_lock_task_mm(). It keeps these potential victims locked across the multiple times that the task list is iterated. The locking pattern that Oleg said should cause lockdep to complain is that iterating through the entire task list more than once can lead to locking the same task that was locked earlier with find_lock_task_mm(), and thus deadlock. But there is a check in simple_lmk that avoids locking potential victims that were already found, which avoids the deadlock, but lockdep doesn't know about the check (which is done with vtsk_is_duplicate()) and should therefore complain. Lockdep does not complain though. Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
to 1876:3090 [ 58.736761] binder: undelivered TRANSACTION_COMPLETE [ 58.736766] binder: undelivered transaction 58752, process died. [ 58.762060] simple_lmk: Killing ndroid.settings with adj 700 to free 74708 KiB [ 58.763238] simple_lmk: Killing roid.apps.turbo with adj 700 to free 54660 KiB [ 58.863590] binder: release 1876:3089 transaction 58117 out, still active [ 58.863606] binder: undelivered TRANSACTION_COMPLETE [ 58.873337] simple_lmk: Killing d.process.acore with adj 700 to free 48540 KiB [ 58.873513] simple_lmk: Killing d.process.media with adj 500 to free 46188 KiB [ 58.873713] simple_lmk: Killing putmethod.latin with adj 200 to free 67304 KiB [ 59.014046] simple_lmk: Killing android.vending with adj 201 to free 54900 KiB [ 59.017623] simple_lmk: Killing rak.mibandtools with adj 201 to free 44552 KiB [ 59.018423] simple_lmk: Killing eport.trebuchet with adj 100 to free 106208 KiB [ 59.028460] binder: 1206:1206 transaction failed 29189/-22, size 100-0 line 3052 [ 59.142592] binder_alloc: 2814: binder_alloc_buf, no vma [ 59.142620] binder: 1206:1218 transaction failed 29189/-3, size 76-0 line 3189 [ 59.223633] simple_lmk: Killing id.printspooler with adj 900 to free 39664 KiB [ 59.223762] simple_lmk: Killing gle.android.gms with adj 100 to free 64176 KiB [ 59.540176] binder: undelivered transaction 59447, process died. [ 59.540763] binder: undelivered transaction 59446, process died. [ 59.815404] binder: 1206:3140 transaction failed 29189/0, size 12-0 line 2992 [ 59.815418] binder: send failed reply for transaction 58117, target dead [ 60.977609] binder: 2105:2308 transaction failed 29189/-22, size 168-0 line 3052 [ 63.040202] FG: fg_get_battery_temp: batt temperature original:350, tuned:309 [ 63.040219] lge_battery: bm_watch_work: PRESENT:1, CHG_STAT:1, THM_STAT:2, BAT_TEMP:309, BAT_VOLT:4148182, VOTE_CUR:100, SET_CUR:100 [ 63.076086] msm-dwc3 a80.ssusb: Avail curr from USB = 2 [ 63.077278] PMI: smblib_handle_switcher_power_ok: Weak charger detected: voting 500mA ICL [ 63.080014] PMI: smblib_handle_switcher_power_ok: Reverse boost detected: voting 0mA to suspend input [ 63.081886] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.093639] fts_touch 5-0049: [FTS] Received Charger Disconnected Event [ 63.104656] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.122546] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.135620] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.156383] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.156897] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.160481] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.185029] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.189015] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.212484] lge_battery: bm_check_status: wake_unlocked: present[0] chg_state[2] vbus[0] [ 63.213039] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.231096] FG: fg_get_battery_temp: batt temperature original:350, tuned:310 [ 63.233981] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.234663] msm-dwc3 a80.ssusb: dwc3_msm_suspend: Calling suspend 1996 [ 63.249755] msm-dwc3 a80.ssusb: DWC3 in low power mode [ 63.250247] healthd: battery l=100 v=4148 t=31.0 h=2 st=3 c=748 fc=3229000 cc=289 chg= [ 63.250430] android_work: sent uevent USB_STATE=DISCONNECTED [ 63.294456] msm-dwc3 a80.ssusb: Avail curr from USB = 0 [ 70.492114] binder: undelivered transaction 86938, process died. [ 70.955204] simple_lmk: Killing .apps.translate with adj 906 to free 47564 KiB [ 70.955633] simple_lmk: Killing cloudprint:sync with adj 906 to free 31932 KiB [ 70.955787] simple_lmk: Killing oid.apps.photos with adj 904 to free 50204 KiB [ 71.060789] simple_lmk: Killing ecamera.android with adj 906 to free 32232 KiB [ 71.061074] simple_lmk: Killing webview_service with adj 906 to free 26028 KiB [ 71.061199] simple_lmk: Killing com.whatsapp with adj 904 to free 49484 KiB [ 71.164996] binder: undelivered transaction 87881, process died. [ 71.190625] simple_lmk: Killing rbandroid.sleep with adj 906 to free 54724 KiB [ 71.190775] simple_lmk: Killing android.vending with adj 906 to free 39848 KiB [ 71.191303] simple_lmk: Killing m.facebook.orca with adj 904 to free 72296 KiB [ 71.342042] simple_lmk: Killing droid.deskclock with adj 902 to free 49284 KiB [ 71.342240] simple_lmk: Killing .apps.wellbeing with adj 900 to free 47632 KiB [ 71.342529] simple_lmk: Killing le.modemservice with adj 800 to free 33648 KiB [ 71.482391] simple_lmk: Killing d.process.media with adj 800 to free 40676 KiB [ 71.482511] simple_lmk: Killing rdog.challegram with adj 700 to free 71920 KiB No lockdep warnings! Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, May 09, 2019 at 05:56:46PM +0200, Oleg Nesterov wrote: > Impossible ;) I bet lockdep should report the deadlock as soon as > find_victims() > calls find_lock_task_mm() when you already have a locked victim. I hope you're not a betting man ;) With the following configured: CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y # CONFIG_DEBUG_SPINLOCK_BITE_ON_BUG is not set CONFIG_DEBUG_SPINLOCK_PANIC_ON_BUG=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_LOCK_TORTURE_TEST is not set And a printk added in vtsk_is_duplicate() to print when a duplicate is detected, and my phone's memory cut in half to make simple_lmk do something, this is what I observed: taimen:/ # dmesg | grep lockdep [0.00] \x09RCU lockdep checking is enabled. taimen:/ # dmesg | grep simple_lmk [ 23.211091] simple_lmk: Killing android.carrier with adj 906 to free 37420 kiB [ 23.211160] simple_lmk: Killing oadcastreceiver with adj 906 to free 36784 kiB [ 23.248457] simple_lmk: Killing .apps.translate with adj 904 to free 45884 kiB [ 23.248720] simple_lmk: Killing ndroid.settings with adj 904 to free 42868 kiB [ 23.313417] simple_lmk: DUPLICATE VTSK! [ 23.313440] simple_lmk: Killing ndroid.keychain with adj 906 to free 33680 kiB [ 23.313513] simple_lmk: Killing com.whatsapp with adj 904 to free 51436 kiB [ 34.646695] simple_lmk: DUPLICATE VTSK! [ 34.646717] simple_lmk: Killing ndroid.apps.gcs with adj 906 to free 37956 kiB [ 34.646792] simple_lmk: Killing droid.apps.maps with adj 904 to free 63600 kiB taimen:/ # dmesg | grep lockdep [0.00] \x09RCU lockdep checking is enabled. taimen:/ # > As for > https://github.com/kerneltoast/android_kernel_google_wahoo/commit/afc8c9bf2dbde95941253c168d1adb64cfa2e3ad > Well, > > mmdrop(mm); > simple_lmk_mm_freed(mm); > > looks racy because mmdrop(mm) can free this mm_struct. Yes, > simple_lmk_mm_freed() > does not dereference this pointer, but the same memory can be re-allocated as > another ->mm for the new task which can be found by find_victims(), and _in > theory_ > this all can happen in between, so the "victims[i].mm == mm" can be false > positive. > > And this also means that simple_lmk_mm_freed() should clear victims[i].mm when > it detects "victims[i].mm == mm", otherwise we have the same theoretical race, > victims_to_kill is only cleared when the last victim goes away. Fair point. Putting simple_lmk_mm_freed(mm) right before mmdrop(mm), and sprinkling in a cmpxchg in simple_lmk_mm_freed() should fix that up. > Another nit... you can drop tasklist_lock right after the 1st "find_victims" > loop. True! > And it seems that you do not really need to walk the "victims" array twice > after that, > you can do everything in a single loop, but this is cosmetic. Won't this result in potentially holding the task lock way longer than necessary for multiple tasks that aren't going to be killed? Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 12:58:27PM +0200, Christian Brauner wrote: > This is work that is ongoing and requires kernel changes to make it > feasible. One of the things that I have been working on for quite a > while is the whole file descriptor for processes thing that is important > for LMKD (Even though I never thought about this use-case when I started > pitching this.). Joel and Daniel have joined in and are working on > making LMKD possible. > What I find odd is that every couple of weeks different solutions to the > low memory problem are pitched. There is simple_lkml, there is LMKD, and > there was a patchset that wanted to speed up memory reclaim at process > kill-time by adding a new flag to the new pidfd_send_signal() syscall. > That all seems - though related - rather uncoordinated. Now granted, > coordinated is usually not how kernel development necessarily works but > it would probably be good to have some sort of direction and from what I > have seen LMKD seems to be the most coordinated effort. But that might > just be my impression. LMKD is just Android's userspace low-memory-killer daemon. It's been around for a while. This patch (simple_lmk) is meant to serve as an immediate solution for the devices that'll never see a single line of PSI code running on them, which amounts to... well, most Android devices currently in existence. I'm more of a cowboy who made this patch after waiting a few years for memory management improvements on Android that never happened. Though it looks like it's going to happen soon(ish?) for super new devices that'll have the privilege of shipping with PSI in use. On Tue, May 07, 2019 at 01:09:21PM +0200, Greg Kroah-Hartman wrote: > > It's even more odd that although a userspace solution is touted as the > > proper > > way to go on LKML, almost no Android OEMs are using it, and even in that > > commit > > I linked in the previous message, Google made a rather large set of > > modifications to the supposedly-defunct lowmemorykiller.c not one month ago. > > What's going on? > > "almost no"? Again, Android Go is doing that, right? I'd check for myself, but I can't seem to find kernel source for an Android Go device... This seems more confusing though. Why would the ultra-low-end devices use LMKD while other devices use the broken lowmemorykiller driver? > > Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. > > Qualcomm should never be used as an example of a company that has any > idea of what to do in their kernel :) Agreed, but nearly all OEMs that use Qualcomm chipsets roll with Qualcomm's kernel decisions, so Qualcomm has a bit of influence here. > > If PSI were backported to 4.4, or even 3.18, would it really be used? > > Why wouldn't it, if it worked properly? For the same mysterious reason that Qualcomm and others cling to lowmemorykiller, I presume. This is part of what's been confusing me for quite some time... > Please see the work that went into PSI and the patches around it. > There's also a lwn.net article last week about the further work ongoing > in this area. With all of that, you should see how in-kernel memory > killers are NOT the way to go. > > > Regardless, even if PSI were backported, a full-fledged LMKD using it has > > yet to > > be made, so it wouldn't be of much use now. > > "LMKD"? Again, PSI is in the 4.9.y android-common tree, so the > userspace side should be in AOSP, right? LMKD as in Android's low-memory-killer daemon. It is in AOSP, but it looks like it's still a work in progress. Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:28:47AM -0700, Suren Baghdasaryan wrote: > Hi Sultan, > Looks like you are posting this patch for devices that do not use > userspace LMKD solution due to them using older kernels or due to > their vendors sticking to in-kernel solution. If so, I see couple > logistical issues with this patch. I don't see it being adopted in > upstream kernel 5.x since it re-implements a deprecated mechanism even > though vendors still use it. Vendors on the other hand, will not adopt > it until you show evidence that it works way better than what > lowmemorykilled driver does now. You would have to provide measurable > data and explain your tests before they would consider spending time > on this. Yes, this is mostly for the devices already produced that are forced to suffer with poor memory management. I can't even convince vendors to fix kernel memory leaks, so there's no way I'd be able to convince them of trying this patch, especially since it seems like you're having trouble convincing vendors to stop using lowmemorykiller in the first place. And thankfully, convincing vendors isn't my job :) > On the implementation side I'm not convinced at all that this would > work better on all devices and in all circumstances. We had cases when > a new mechanism would show very good results until one usecase > completely broke it. Bulk killing of processes that you are doing in > your patch was a very good example of such a decision which later on > we had to rethink. That's why baking these policies into kernel is > very problematic. Another problem I see with the implementation that > it ties process killing with the reclaim scan depth. It's very similar > to how vmpressure works and vmpressure in my experience is very > unpredictable. Could you elaborate a bit on why bulk killing isn't good? > > > I linked in the previous message, Google made a rather large set of > > > modifications to the supposedly-defunct lowmemorykiller.c not one month > > > ago. > > > What's going on? > > If you look into that commit, it adds ability to report kill stats. If > that was a change in how that driver works it would be rejected. Fair, though it was quite strange seeing something that was supposedly totally abandoned receiving a large chunk of code for reporting stats. Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote: > I am not going to comment the intent, but to be honest I am skeptical too. The general sentiment has been that this is a really bad idea, but I'm just a frustrated Android user who wants his phone to not require mountains of zRAM only to still manage memory poorly. Until I can go out and buy a non-Pixel phone that uses PSI to make these decisions (and does a good job of it), I'm going to stick to my hacky little driver for my personal devices. Many others who like to hack their Android devices to make them last longer will probably find value in this as well, since there are millions of people who use devices that'll never seen any of PSI ported to their ancient 3.x kernels. And yes, I know this would never be accepted to upstream in a million years. I mostly wanted some code review and guidance, since mm code is pretty tricky :) > On 05/06, Sultan Alsawaf wrote: > > > > +static unsigned long find_victims(struct victim_info *varr, int *vindex, > > + int vmaxlen, int min_adj, int max_adj) > > +{ > > + unsigned long pages_found = 0; > > + int old_vindex = *vindex; > > + struct task_struct *tsk; > > + > > + for_each_process(tsk) { > > + struct task_struct *vtsk; > > + unsigned long tasksize; > > + short oom_score_adj; > > + > > + /* Make sure there's space left in the victim array */ > > + if (*vindex == vmaxlen) > > + break; > > + > > + /* Don't kill current, kthreads, init, or duplicates */ > > + if (same_thread_group(tsk, current) || > > + tsk->flags & PF_KTHREAD || > > + is_global_init(tsk) || > > + vtsk_is_duplicate(varr, *vindex, tsk)) > > + continue; > > + > > + vtsk = find_lock_task_mm(tsk); > > Did you test this patch with lockdep enabled? > > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() > ensures that we do not take the same ->alloc_lock twice or more, but lockdep > can't know this. Yeah, lockdep is fine with this, at least on 4.4. > > +static void scan_and_kill(unsigned long pages_needed) > > +{ > > + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq); > > + struct victim_info victims[MAX_VICTIMS]; > > + int i, nr_to_kill = 0, nr_victims = 0; > > + unsigned long pages_found = 0; > > + atomic_t victim_count; > > + > > + /* > > +* Hold the tasklist lock so tasks don't disappear while scanning. This > > +* is preferred to holding an RCU read lock so that the list of tasks > > +* is guaranteed to be up to date. Keep preemption disabled until the > > +* SIGKILLs are sent so the victim kill process isn't interrupted. > > +*/ > > + read_lock(&tasklist_lock); > > + preempt_disable(); > > read_lock() disables preemption, every task_lock() too, so this looks > unnecessary. Good point. > > + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) { > > + pages_found += find_victims(victims, &nr_victims, MAX_VICTIMS, > > + adj_prio[i], adj_prio[i - 1]); > > + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS) > > + break; > > + } > > + > > + /* > > +* Calculate the number of tasks that need to be killed and quickly > > +* release the references to those that'll live. > > +*/ > > + for (i = 0, pages_found = 0; i < nr_victims; i++) { > > + struct victim_info *victim = &victims[i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + /* The victims' mm lock is taken in find_victims; release it */ > > + if (pages_found >= pages_needed) { > > + task_unlock(vtsk); > > + continue; > > + } > > + > > + /* > > +* Grab a reference to the victim so it doesn't disappear after > > +* the tasklist lock is released. > > +*/ > > + get_task_struct(vtsk); > > The comment doesn't look correct. the victim can't dissapear until > task_unlock() > below, it can't pass exit_mm(). I was always unsure about this and decided to hold a reference to the task_struct to be safe. Thanks for clearing that up. > > + pages_found += victim->size; > > + nr_to_kill++; > > + } > > + read_u
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:43:34AM +0200, Greg Kroah-Hartman wrote: > Given that any "new" android device that gets shipped "soon" should be > using 4.9.y or newer, is this a real issue? It's certainly a real issue for those who can't buy brand new Android devices without software bugs every six months :) > And if it is, I'm sure that asking for those patches to be backported to > 4.4.y would be just fine, have you asked? > > Note that I know of Android Go devices, running 3.18.y kernels, do NOT > use the in-kernel memory killer, but instead use the userspace solution > today. So trying to get another in-kernel memory killer solution added > anywhere seems quite odd. It's even more odd that although a userspace solution is touted as the proper way to go on LKML, almost no Android OEMs are using it, and even in that commit I linked in the previous message, Google made a rather large set of modifications to the supposedly-defunct lowmemorykiller.c not one month ago. What's going on? Qualcomm still uses lowmemorykiller.c [1] on the Snapdragon 845. If PSI were backported to 4.4, or even 3.18, would it really be used? I don't really understand the aversion to an in-kernel memory killer on LKML despite the rest of the industry's attraction to it. Perhaps there's some inherently great cost in using the userspace solution that I'm unaware of? Regardless, even if PSI were backported, a full-fledged LMKD using it has yet to be made, so it wouldn't be of much use now. Thanks, Sultan [1] https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/configs/sdm845_defconfig?h=LA.UM.7.3.r1-07400-sdm845.0#n492
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, May 07, 2019 at 09:04:30AM +0200, Greg Kroah-Hartman wrote: > Um, why can't "all" Android devices take the same patches that the Pixel > phones are using today? They should all be in the public android-common > kernel repositories that all Android devices should be syncing with on a > weekly/monthly basis anyway, right? > > thanks, > > greg k-h Hi Greg, I only see PSI present in the android-common kernels for 4.9 and above. The vast majority of Android devices do not run a 4.9+ kernel. It seems unreasonable to expect OEMs to toil with backporting PSI themselves to get decent memory management. But even if they did backport PSI, it wouldn't help too much because a PSI-enabled LMKD solution is not ready yet. It looks like a PSI-based LMKD is still under heavy development and won't be ready for all Android devices for quite some time. Additionally, it looks like the supposedly-dead lowmemorykiller.c is still being actively tweaked by Google [1], which does not instill confidence that a definitive LMK solution a la PSI is coming any time soon. Thanks, Sultan [1] https://android.googlesource.com/kernel/common/+/152bacdd85c46f0c76b00c4acc253e414513634c
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
This is a complete low memory killer solution for Android that is small and simple. Processes are killed according to the priorities that Android gives them, so that the least important processes are always killed first. Processes are killed until memory deficits are satisfied, as observed from kswapd struggling to free up pages. Simple LMK stops killing processes when kswapd finally goes back to sleep. The only tunables are the desired amount of memory to be freed per reclaim event and desired frequency of reclaim events. Simple LMK tries to free at least the desired amount of memory per reclaim and waits until all of its victims' memory is freed before proceeding to kill more processes. Signed-off-by: Sultan Alsawaf --- Hello everyone, I've addressed some of the concerns that were brought up with the first version of the Simple LMK patch. I understand that a kernel-based solution like this that contains policy decisions for a specific userspace is not the way to go, but the Android ecosystem still has a pressing need for a low memory killer that works well. Most Android devices still use the ancient and deprecated lowmemorykiller.c kernel driver; Simple LMK seeks to replace that, at the very least until PSI and a userspace daemon utilizing PSI are ready for *all* Android devices, and not just the privileged Pixel phone line. The feedback I received last time was quite helpful. This Simple LMK patch works significantly better than the first, improving memory management by a large margin while being immune to transient spikes in memory usage (since the signal to start killing processes is determined by how hard kswapd tries to free up pages, which is something that occurs over a span of time and not a single point in time). I'd love to hear some feedback on this new patch. I do encourage those who are interested to take it for a spin on an Android device. This patch has been tested successfully on Android 4.4 and 4.9 kernels. For the sake of review here, I have adapted the patch to 5.1. Thanks, Sultan drivers/android/Kconfig | 33 drivers/android/Makefile | 1 + drivers/android/simple_lmk.c | 315 +++ include/linux/mm_types.h | 4 + include/linux/simple_lmk.h | 11 ++ kernel/fork.c| 13 ++ mm/vmscan.c | 12 ++ 7 files changed, 389 insertions(+) create mode 100644 drivers/android/simple_lmk.c create mode 100644 include/linux/simple_lmk.h diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 6fdf2abe4..bdd429338 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -54,6 +54,39 @@ config ANDROID_BINDER_IPC_SELFTEST exhaustively with combinations of various buffer sizes and alignments. +config ANDROID_SIMPLE_LMK + bool "Simple Android Low Memory Killer" + depends on !ANDROID_LOW_MEMORY_KILLER && !MEMCG + ---help--- + This is a complete low memory killer solution for Android that is + small and simple. Processes are killed according to the priorities + that Android gives them, so that the least important processes are + always killed first. Processes are killed until memory deficits are + satisfied, as observed from kswapd struggling to free up pages. Simple + LMK stops killing processes when kswapd finally goes back to sleep. + +if ANDROID_SIMPLE_LMK + +config ANDROID_SIMPLE_LMK_AGGRESSION + int "Reclaim frequency selection" + range 1 3 + default 1 + help + This value determines how frequently Simple LMK will perform memory + reclaims. A lower value corresponds to less frequent reclaims, which + maximizes memory usage. The range of values has a logarithmic + correlation; 2 is twice as aggressive as 1, and 3 is twice as + aggressive as 2, which makes 3 four times as aggressive as 1. + +config ANDROID_SIMPLE_LMK_MINFREE + int "Minimum MiB of memory to free per reclaim" + range 8 512 + default 64 + help + Simple LMK will try to free at least this much memory per reclaim. + +endif + endif # if ANDROID endmenu diff --git a/drivers/android/Makefile b/drivers/android/Makefile index c7856e320..7c91293b6 100644 --- a/drivers/android/Makefile +++ b/drivers/android/Makefile @@ -3,3 +3,4 @@ ccflags-y += -I$(src) # needed for trace events obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o obj-$(CONFIG_ANDROID_BINDER_IPC) += binder.o binder_alloc.o obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o +obj-$(CONFIG_ANDROID_SIMPLE_LMK) += simple_lmk.o diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c new file mode 100644 index 0..a2ffb57b5 --- /dev/null +++ b/drivers/android/simple_lmk.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Sul
Re: [RFCv2] string: Use faster alternatives when constant arguments are used
On Mon, Apr 01, 2019 at 10:43:13PM +0200, Rasmus Villemoes wrote: > Consider your patch replacing !strcmp(buf, "123") by !memcmp(buf, "123", > 4). buf is known to point to a nul-terminated string. But it may point > at, say, the second-last byte in a page, with the last byte in that page > being a nul byte, and the following page being MMIO or unmapped or all > kinds of bad things. On e.g. x86 where unaligned accesses are cheap, and > seeing that you're only comparing for equality, gcc is likely to compile > the memcmp version into > > *(u32*)buf == 0x00333231 > > because you've told the compiler that there's no problem accessing four > bytes starting at buf. Boom. Even without unaligned access being cheap > this can happen; suppose the length is 8 instead, and gcc somehow knows > that buf is four-byte aligned (and in this case it happens to point four > bytes before a page boundary), so it could compile the memcmp(,,8) into > > *(u32*)(buf+4) == secondword && *(u32*)buf == firstword > > (or do the comparisons in the "natural" order, but it might still do > both loads first). Ok, this is the first solid counter I've seen against my patch. I hadn't considered unaligned word-sized accesses. Thanks. > > And if there are concerns for some arches but not others, then couldn't > > this be > > a feasible optimization for those which would work well with it? > > No. First, these are concerns for all arches. Second, if you can find > some particular place where string parsing/matching is in any way > performance relevant and not just done once during driver init or > whatnot, maybe the maintainers of that file would take a patch > hand-optimizing some strcmps to memcmps, or, depending on what the code > does, perhaps replacing the whole *cmp logic with a custom hash table. FWIW, I didn't have a specific place in the kernel in mind that heavily relied on str* operations and needed optimization. I just thought it could be a "free" optimization, but apparently not. > But a patch implicitly and silently touching thousands of lines of code, > without an analysis of why none of the above is a problem for any of > those lines, for any .config, arch, compiler version? No. Well, I thought it could be proven to be correct regardless of the context, which would imply that making such a global change touching thousands lof lines of code would be safe. But sadly it isn't. This does bring into question a greater problem though: usage of memcmp throughout the kernel where the size provided is not the size of the smaller container being compared. This usage of memcmp (which is quite easy to find all across the kernel) could run into the unaligned memory access across a page boundary that you gave as an example, no? Sultan
Re: [RFCv2] string: Use faster alternatives when constant arguments are used
On Mon, Mar 25, 2019 at 10:24:00PM +0100, Rasmus Villemoes wrote: > What I'm worried about is your patch changing every single strcmp(, > "literal") into a memcmp, with absolutely no way of knowing or checking > anything about the other buffer. And actually, it doesn't have to be a > BE arch with a word-at-a-time memcmp. > If (as is usually the case) the strcmp() result is compared to zero, after you > change > > !strcmp(buf, "literal") > > into > > !memcmp(buf, "literal", 8) > > the compiler may (exactly as you want it to) change that into a single > 8-byte load (or two 4-byte loads) and comparisons to literals, no > memcmp() involved. And how do you know that _that_ is ok, for every one > of the hundreds, if not thousands, of instances in the tree? When would this not be ok though? From what I've always known, strcmp(terminated_buf1, terminated_buf2) is equivalent to memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf1)) and memcmp(terminated_buf1, terminated_buf2, strlen(terminated_buf2)) regardless of whether or not one side is a literal (my patch just leverages the compiler's ability to recognize strlen called on literals and optimize it out). The latter memcmp instances would indeed perform worse than the first strcmp when neither arguments are literals, but I don't see what makes the memcmp usage "dangerous". How can the memcmps cross a page boundary when memcmp itself will only read in large buffers of data at word boundaries? And if there are concerns for some arches but not others, then couldn't this be a feasible optimization for those which would work well with it? Sultan
Re: [RFCv2] string: Use faster alternatives when constant arguments are used
On Sun, Mar 24, 2019 at 10:17:49PM +0100, Rasmus Villemoes wrote: > gcc already knows the semantics of these functions and can optimize > accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily > compiles The example you gave appears to get optimized accordingly, but there are numerous sane counterexamples that don't get optimized. On arm64, with GCC 8.3.0, let's look at the simple strcmp usage in kernel/trace/preemptirq_delay_test.c: static int preemptirq_delay_run(void *data) { unsigned long flags; if (!strcmp(test_mode, "irq")) { local_irq_save(flags); busy_wait(delay); local_irq_restore(flags); } else if (!strcmp(test_mode, "preempt")) { preempt_disable(); busy_wait(delay); preempt_enable(); } return 0; } This is what happens without my patch: preemptirq_delay_run: .LFB2517: .cfi_startproc stp x29, x30, [sp, -32]! .cfi_def_cfa_offset 32 .cfi_offset 29, -32 .cfi_offset 30, -24 adrpx1, .LC0 add x1, x1, :lo12:.LC0 mov x29, sp stp x19, x20, [sp, 16] .cfi_offset 19, -16 .cfi_offset 20, -8 adrpx19, .LANCHOR0 add x19, x19, :lo12:.LANCHOR0 mov x0, x19 > bl strcmp cbz w0, .L22 adrpx1, .LC1 mov x0, x19 add x1, x1, :lo12:.LC1 > bl strcmp cbz w0, .L23 The calls to strcmp are not optimized out. Here is what this snippet looks like after my patch: preemptirq_delay_run: .LFB2517: .cfi_startproc stp x29, x30, [sp, -32]! .cfi_def_cfa_offset 32 .cfi_offset 29, -32 .cfi_offset 30, -24 adrpx0, .LANCHOR0 mov w1, 29289 mov x29, sp ldr w2, [x0, #:lo12:.LANCHOR0] movkw1, 0x71, lsl 16 add x3, x0, :lo12:.LANCHOR0 cmp w2, w1 beq .L23 ldr x1, [x0, #:lo12:.LANCHOR0] mov x0, 29296 movkx0, 0x6565, lsl 16 movkx0, 0x706d, lsl 32 movkx0, 0x74, lsl 48 cmp x1, x0 beq .L24 The calls to strcmp were optimized out completely, not even replaced by a call to memcmp. I can produce lots of kernel examples that don't seem to follow basic str* optimization, which is what prompted me to write this in the first place. > Does this even compile? It's a well-known (or perhaps > not-so-well-known?) pitfall that __builtin_constant_p() is not > guaranteed to be usable in __builtin_choose_expr() - the latter only > accepts bona fide integer constant expressions, while evaluation of > __builtin_constant_p can be delayed until various optimization phases. It not only compiles, but also seems to work correctly from what I've observed. > This seems to be buggy - you don't know that src is at least as long as > dest. And arguing "if it's shorter, there's a nul byte, which will > differ from dest at that point, so memcmp will/should stop" means that > the whole word-at-a-time argument would be out. You mean reading the last word of a string could read out of bounds of the string when using memcmp? There are lots of memcmp instances using a literal string in the kernel; I don't think it would be hard to find one that violates what you've pointed out, so I'm not really sure why it's a problem. After a few minutes of grepping, it seems like the memcmp usage in drivers/md/dm-integrity.c can read out of bounds on its arguments: } else if (!memcmp(opt_string, "internal_hash:", strlen("internal_hash:"))) { r = get_alg_and_key(opt_string, &ic->internal_hash_alg, &ti->error, "Invalid internal_hash argument"); if (r) goto bad; } else if (!memcmp(opt_string, "journal_crypt:", strlen("journal_crypt:"))) { r = get_alg_and_key(opt_string, &ic->journal_crypt_alg, &ti->error, "Invalid journal_crypt argument"); if (r) goto bad; } else if (!memcmp(opt_string, "journal_mac:", strlen("journal_mac:"))) { Where opt_string is a dynamically-set argument with no specified minimum length. Sultan
Re: [RFC v3] string: Use faster alternatives when constant arguments are used
On Sat, Mar 23, 2019 at 08:31:53PM -0700, Nathan Chancellor wrote: > Explicitly cc'ing some folks who have touched include/linux/string.h in > the past and might want to take a look at this. > > Nathan Thanks. One last revision with some nitpicks fixed is attached, though I doubt it'll influence any comments on the concept of this anyway. --- include/linux/string.h | 30 ++ 1 file changed, 30 insertions(+) diff --git a/include/linux/string.h b/include/linux/string.h index 7927b875f..40a8a9436 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix return strncmp(str, prefix, len) == 0 ? len : 0; } +/* + * Replace some common string helpers with faster alternatives when one of the + * arguments is a constant (i.e., literal string). This uses strlen instead of + * sizeof for calculating the string length in order to silence compiler + * warnings that may arise due to what the compiler thinks is incorrect sizeof + * usage. The strlen calls on constants are folded into scalar values at compile + * time, so performance is not reduced by using strlen. + */ +#define strcpy(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src), \ + memcpy((dest), (src), strlen(src) + 1), \ + (strcpy)((dest), (src))) + +#define strcat(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src), \ + ({ \ + memcpy((dest) + strlen(dest), (src), strlen(src) + 1); \ + (dest); \ + }), \ + (strcat)((dest), (src))) + +#define strcmp(left, right) \ + __builtin_choose_expr(__builtin_constant_p(left), \ + __builtin_choose_expr(__builtin_constant_p(right), \ + (strcmp)((left), (right)), \ + memcmp((left), (right), strlen(left) + 1)), \ + __builtin_choose_expr(__builtin_constant_p(right), \ + memcmp((left), (right), strlen(right) + 1), \ + (strcmp)((left), (right + #endif /* _LINUX_STRING_H_ */ -- 2.21.0
[RFCv2] string: Use faster alternatives when constant arguments are used
I messed up the return value for strcat in the first patch. Here's a fixed version, ready for some scathing reviews. From: Sultan Alsawaf When strcpy, strcat, and strcmp are used with a literal string, they can be optimized to memcpy or memcmp calls. These alternatives are faster since knowing the length of a string argument beforehand allows traversal through the string word at a time without being concerned about looking for the terminating zero character. In some cases, the replaced calls to memcpy or memcmp can even be optimized out completely for a significant speed up. Signed-off-by: Sultan Alsawaf --- include/linux/string.h | 30 ++ 1 file changed, 30 insertions(+) diff --git a/include/linux/string.h b/include/linux/string.h index 7927b875f..59c301c0e 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -476,4 +476,34 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix return strncmp(str, prefix, len) == 0 ? len : 0; } +/* + * Replace some common string helpers with faster alternatives when one of the + * arguments is a constant (i.e., literal string). This uses strlen instead of + * sizeof for calculating the string length in order to silence compiler + * warnings that may arise due to what the compiler thinks is incorrect sizeof + * usage. The strlen calls on constants are folded into scalar values at compile + * time, so performance is not reduced by using strlen. + */ +#define strcpy(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src), \ + memcpy((dest), (src), strlen(src) + 1), \ + (strcpy)((dest), (src))) + +#define strcat(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src), \ + ({ \ + memcpy(strchr((dest), '\0'), (src), strlen(src) + 1); \ + (dest); \ + }), \ + (strcat)((dest), (src))) + +#define strcmp(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(dest), \ + __builtin_choose_expr(__builtin_constant_p(src), \ + (strcmp)((dest), (src)), \ + memcmp((dest), (src), strlen(dest) + 1)), \ + __builtin_choose_expr(__builtin_constant_p(src), \ + memcmp((dest), (src), strlen(src) + 1), \ + (strcmp)((dest), (src + #endif /* _LINUX_STRING_H_ */ -- 2.21.0
[RFC] string: Use faster alternatives when constant arguments are used
From: Sultan Alsawaf When strcpy, strcat, and strcmp are used with a literal string, they can be optimized to memcpy or memcmp calls. These alternatives are faster since knowing the length of a string argument beforehand allows traversal through the string word at a time without being concerned about looking for the terminating zero character. In some cases, the replaced calls to memcpy or memcmp can even be optimized out completely for a significant speed up. Signed-off-by: Sultan Alsawaf --- include/linux/string.h | 27 +++ 1 file changed, 27 insertions(+) diff --git a/include/linux/string.h b/include/linux/string.h index 7927b875f..65db58342 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -476,4 +476,31 @@ static __always_inline size_t str_has_prefix(const char *str, const char *prefix return strncmp(str, prefix, len) == 0 ? len : 0; } +/* + * Replace some common string helpers with faster alternatives when one of the + * arguments is a constant (i.e., literal string). This uses strlen instead of + * sizeof for calculating the string length in order to silence compiler + * warnings that may arise due to what the compiler thinks is incorrect sizeof + * usage. The strlen calls on constants are folded into scalar values at compile + * time, so performance is not reduced by using strlen. + */ +#define strcpy(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src),\ + memcpy((dest), (src), strlen(src) + 1), \ + (strcpy)((dest), (src))) + +#define strcat(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(src),\ + memcpy(strchr((dest), '\0'), (src), strlen(src) + 1), \ + (strcat)((dest), (src))) + +#define strcmp(dest, src) \ + __builtin_choose_expr(__builtin_constant_p(dest), \ + __builtin_choose_expr(__builtin_constant_p(src),\ + (strcmp)((dest), (src)),\ + memcmp((dest), (src), strlen(dest) + 1)), \ + __builtin_choose_expr(__builtin_constant_p(src),\ + memcmp((dest), (src), strlen(src) + 1), \ + (strcmp)((dest), (src + #endif /* _LINUX_STRING_H_ */ -- 2.21.0
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 11:16:41PM -0400, Steven Rostedt wrote: > How would you implement such a method in userspace? kill() doesn't take > any parameters but the pid of the process you want to send a signal to, > and the signal to send. This would require a new system call, and be > quite a bit of work. If you can solve this with an ebpf program, I > strongly suggest you do that instead. This can be done by introducing a new signal number that provides SIGKILL functionality while blocking (maybe SIGKILLBLOCK?). Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 10:54:48PM -0400, Joel Fernandes wrote: > I'm not sure if that makes much semantic sense for how the signal handling is > supposed to work. Imagine a parent sends SIGKILL to its child, and then does > a wait(2). Because the SIGKILL blocks in your idea, then the wait cannot > execute, and because the wait cannot execute, the zombie task will not get > reaped and so the SIGKILL senders never gets unblocked and the whole thing > just gets locked up. No? I don't know it just feels incorrect. Block until the victim becomes a zombie instead. > Further, in your idea adding stuff to task_struct will simply bloat it - when > this task can easily be handled using eBPF without making any kernel changes. > Either by probing sched_process_free or sched_process_exit tracepoints. > Scheduler maintainers generally frown on adding stuff to task_struct > pointlessly there's a good reason since bloating it effects the performance > etc, and something like this would probably never be ifdef'd out behind a > CONFIG. Adding something to task_struct is just the easiest way to test things for experimentation. This can be avoided in my suggestion by passing the pointer to a completion via the relevant functions, and then completing it at the time the victim transitions to a zombie state. I understand it's possible to use eBPF for this, but it seems kind of messy since this functionality is something that I think others would want provided by the kernel (i.e., anyone using PSI to implement their own OOM killer daemon similar to LMKD). Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Thu, Mar 14, 2019 at 10:47:17AM -0700, Joel Fernandes wrote: > About the 100ms latency, I wonder whether it is that high because of > the way Android's lmkd is observing that a process has died. There is > a gap between when a process memory is freed and when it disappears > from the process-table. Once a process is SIGKILLed, it becomes a > zombie. Its memory is freed instantly during the SIGKILL delivery (I > traced this so that's how I know), but until it is reaped by its > parent thread, it will still exist in /proc/ . So if testing the > existence of /proc/ is how Android is observing that the process > died, then there can be a large latency where it takes a very long > time for the parent to actually reap the child way after its memory > was long freed. A quicker way to know if a process's memory is freed > before it is reaped could be to read back /proc//maps in > userspace of the victim , and that file will be empty for zombie > processes. So then one does not need wait for the parent to reap it. I > wonder how much of that 100ms you mentioned is actually the "Waiting > while Parent is reaping the child", than "memory freeing time". So > yeah for this second problem, the procfds work will help. > > By the way another approach that can provide a quick and asynchronous > notification of when the process memory is freed, is to monitor > sched_process_exit trace event using eBPF. You can tell eBPF the PID > that you want to monitor before the SIGKILL. As soon as the process > dies and its memory is freed, the eBPF program can send a notification > to user space (using the perf_events polling infra). The > sched_process_exit fires just after the mmput() happens so it is quite > close to when the memory is reclaimed. This also doesn't need any > kernel changes. I could come up with a prototype for this and > benchmark it on Android, if you want. Just let me know. Perhaps I'm missing something, but if you want to know when a process has died after sending a SIGKILL to it, then why not just make the SIGKILL optionally block until the process has died completely? It'd be rather trivial to just store a pointer to an onstack completion inside the victim process' task_struct, and then complete it in free_task(). Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 10:17:43AM -0700, Tim Murray wrote: > Knowing whether a SIGKILL'd process has finished reclaiming is as far > as I know not possible without something like procfds. That's where > the 100ms timeout in lmkd comes in. lowmemorykiller and lmkd both > attempt to wait up to 100ms for reclaim to finish by checking for the > continued existence of the thread that received the SIGKILL, but this > really means that they wait up to 100ms for the _thread_ to finish, > which doesn't tell you anything about the memory used by that process. > If those threads terminate early and lowmemorykiller/lmkd get a signal > to kill again, then there may be two processes competing for CPU time > to reclaim memory. That doesn't reclaim any faster and may be an > unnecessary kill. > ... > - offer a way to wait for process termination so lmkd can tell when > reclaim has finished and know when killing another process is > appropriate Should be pretty easy with something like this: diff --git a/include/linux/sched.h b/include/linux/sched.h index 1549584a1..6ac478af2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1199,6 +1199,7 @@ struct task_struct { unsigned long lowest_stack; unsigned long prev_lowest_stack; #endif + ktime_t sigkill_time; /* * New fields for task_struct should be added above here, so that diff --git a/kernel/fork.c b/kernel/fork.c index 9dcd18aa2..0ae182777 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -435,6 +435,8 @@ void put_task_stack(struct task_struct *tsk) void free_task(struct task_struct *tsk) { + ktime_t sigkill_time = tsk->sigkill_time; + pid_t pid = tsk->pid; #ifndef CONFIG_THREAD_INFO_IN_TASK /* * The task is finally done with both the stack and thread_info, @@ -455,6 +457,9 @@ void free_task(struct task_struct *tsk) if (tsk->flags & PF_KTHREAD) free_kthread_struct(tsk); free_task_struct(tsk); + if (sigkill_time) + printk("%d killed after %lld us\n", pid, + ktime_us_delta(ktime_get(), sigkill_time)); } EXPORT_SYMBOL(free_task); @@ -1881,6 +1886,7 @@ static __latent_entropy struct task_struct *copy_process( p->sequential_io= 0; p->sequential_io_avg= 0; #endif + p->sigkill_time = 0; /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); diff --git a/kernel/signal.c b/kernel/signal.c index 5d53183e2..1142c8811 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1168,6 +1168,8 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc } out_set: + if (sig == SIGKILL) + t->sigkill_time = ktime_get(); signalfd_notify(t, sig); sigaddset(&pending->signal, sig);
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Tue, Mar 12, 2019 at 09:05:32AM +0100, Michal Hocko wrote: > The only way to control the OOM behavior pro-actively is to throttle > allocation speed. We have memcg high limit for that purpose. Along with > PSI, I can imagine a reasonably working user space early oom > notifications and reasonable acting upon that. The issue with pro-active memory management that prompted me to create this was poor memory utilization. All of the alternative means of reclaiming pages in the page allocator's slow path turn out to be very useful for maximizing memory utilization, which is something that we would have to forgo by relying on a purely pro-active solution. I have not had a chance to look at PSI yet, but unless a PSI-enabled solution allows allocations to reach the same point as when the OOM killer is invoked (which is contradictory to what it sets out to do), then it cannot take advantage of all of the alternative memory-reclaim means employed in the slowpath, and will result in killing a process before it is _really_ necessary. > If you design is relies on the speed of killing then it is fundamentally > flawed AFAICT. You cannot assume anything about how quickly a task dies. > It might be blocked in an uninterruptible sleep or performin an > operation which takes some time. Sure, oom_reaper might help here but > still. In theory we could instantly zap any process that is not trapped in the kernel at the time that the OOM killer is invoked without any consequences though, no? Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 03:15:35PM -0700, Suren Baghdasaryan wrote: > This what LMKD currently is - a userspace RT process. > My point was that this page allocation queue that you implemented > can't be implemented in userspace, at least not without extensive > communication with kernel. Oh, that's easy to address. My page allocation queue and the decision on when to kill a process are orthogonal. In fact, the page allocation queue could be touched up a bit to factor in the issues Michal mentioned, and it can be implemented as an improvement to the existing OOM killer. The point of it is just to ensure that page allocation requests that have gone OOM are given priority over other allocation requests when free pages start to trickle in. Userspace doesn't need to know about the page allocation queue, and the queue is not necessary to implement the method of determining when to kill processes that I've proposed. It's an optimization, not a necessity. Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 05:11:25PM -0400, Joel Fernandes wrote: > But the point is that a transient temporary memory spike should not be a > signal to kill _any_ process. The reaction to kill shouldn't be so > spontaneous that unwanted tasks are killed because the system went into > panic mode. It should be averaged out which I believe is what PSI does. In my patch from the first email, I implemented the decision to kill a process at the same time that the existing kernel OOM killer decides to kill a process. If the kernel's OOM killer were susceptible to killing processes due to transient memory spikes, then I think there would have been several complaints about this behavior regardless of which userspace or architecture is in use. I think the existing OOM killer has this done right. The decision to kill a process occurs after the page allocator has tried _very_ hard to satisfy a page allocation via alternative means, such as utilizing compaction, flushing file-backed pages to disk via kswapd, and direct reclaim. Once all of those means have failed, it is quite reasonable to kill a process to free memory. Trying to wait out the memory starvation at this point would be futile. Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote: > The idea seems interesting although I need to think about this a bit > more. Killing processes based on failed page allocation might backfire > during transient spikes in memory usage. This issue could be alleviated if tasks could be killed and have their pages reaped faster. Currently, Linux takes a _very_ long time to free a task's memory after an initial privileged SIGKILL is sent to a task, even with the task's priority being set to the highest possible (so unwanted scheduler preemption starving dying tasks of CPU time is not the issue at play here). I've frequently measured the difference in time between when a SIGKILL is sent for a task and when free_task() is called for that task to be hundreds of milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD suffers from as well, and perhaps any OOM killer implementation in Linux, since you cannot evaluate effect you've had on memory pressure by killing a process for at least several tens of milliseconds. > AFAIKT the biggest issue with using this approach in userspace is that > it's not practically implementable without heavy in-kernel support. > How to implement such interaction between kernel and userspace would > be an interesting discussion which I would be happy to participate in. You could signal a lightweight userspace process that has maximum scheduler priority and have it kill the tasks it'd like. Thanks, Sultan
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Mon, Mar 11, 2019 at 06:43:20PM +0100, Michal Hocko wrote: > I am sorry but we are not going to maintain two different OOM > implementations in the kernel. From a quick look the implementation is > quite a hack which is not really suitable for anything but a very > specific usecase. E.g. reusing a freed page for a waiting allocation > sounds like an interesting idea but it doesn't really work for many > reasons. E.g. any NUMA affinity is broken, zone protection doesn't work > either. Not to mention how the code hooks into the allocator hot paths. > This is simply no no. > > Last but not least people have worked really hard to provide means (PSI) > to do what you need in the userspace. Hi Michal, Thanks for the feedback. I had no doubt that this would be vehemently rejected on the mailing list, but I wanted feedback/opinions on it and thus sent it as anRFC. At best I thought perhaps the mechanisms I've employed might serve as inspiration for LMKD improvements in Android, since this hacky OOM killer I've devised does work quite well for the very specific usecase it is set out to address. The NUMA affinity and zone protection bits are helpful insights too. I'll take a look at PSI which Joel mentioned as well. Thanks, Sultan Alsawaf
Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
On Sun, Mar 10, 2019 at 10:03:35PM +0100, Greg Kroah-Hartman wrote: > On Sun, Mar 10, 2019 at 01:34:03PM -0700, Sultan Alsawaf wrote: > > From: Sultan Alsawaf > > > > This is a complete low memory killer solution for Android that is small > > and simple. It kills the largest, least-important processes it can find > > whenever a page allocation has completely failed (right after direct > > reclaim). Processes are killed according to the priorities that Android > > gives them, so that the least important processes are always killed > > first. Killing larger processes is preferred in order to free the most > > memory possible in one go. > > > > Simple LMK is integrated deeply into the page allocator in order to > > catch exactly when a page allocation fails and exactly when a page is > > freed. Failed page allocations that have invoked Simple LMK are placed > > on a queue and wait for Simple LMK to satisfy them. When a page is about > > to be freed, the failed page allocations are given priority over normal > > page allocations by Simple LMK to see if they can immediately use the > > freed page. > > > > Additionally, processes are continuously killed by failed small-order > > page allocations until they are satisfied. > > > > Signed-off-by: Sultan Alsawaf > > Wait, why? We just removed the in-kernel android memory killer, we > don't want to add another one back again, right? Android Go devices > work just fine with the userspace memory killer code, and those are "low > memory" by design. > > Why do we need kernel code here at all? > > thanks, > > greg k-h Hi Greg, Thanks for replying. It has not been my experience and the experience of many others that Android's userspace low memory memory killer works "just fine." On my Pixel 3 XL with a meager 4GB of memory, the userspace killer has had issues with killing too many processes, which has resulted in a noticeably poor user experience for all Pixel owners. From the looks of lmkd on the master branch, there still isn't really any definitive solution for this, aside from a 100ms delay in between process kills. I think that the userspace low memory killer is more complex than necessary, especially since in the kernel we can detect exactly when we run out of memory and react far more quickly than any userspace daemon. The original reasoning behind why the old kernel low memory killer was removed is also a bit vague to me. It just seemed to be abandonware, and all of a sudden a userspace daemon was touted as the solution. This driver is like an Android-flavored version of the kernel's oom killer, and has proven quite effective for me on my Pixel. Processes are killed exactly when a page allocation fails, so memory use is maximized. There is no complexity to try and estimate how full memory is either. Thanks, Sultan
[RFC] simple_lmk: Introduce Simple Low Memory Killer for Android
From: Sultan Alsawaf This is a complete low memory killer solution for Android that is small and simple. It kills the largest, least-important processes it can find whenever a page allocation has completely failed (right after direct reclaim). Processes are killed according to the priorities that Android gives them, so that the least important processes are always killed first. Killing larger processes is preferred in order to free the most memory possible in one go. Simple LMK is integrated deeply into the page allocator in order to catch exactly when a page allocation fails and exactly when a page is freed. Failed page allocations that have invoked Simple LMK are placed on a queue and wait for Simple LMK to satisfy them. When a page is about to be freed, the failed page allocations are given priority over normal page allocations by Simple LMK to see if they can immediately use the freed page. Additionally, processes are continuously killed by failed small-order page allocations until they are satisfied. Signed-off-by: Sultan Alsawaf --- drivers/android/Kconfig | 28 drivers/android/Makefile | 1 + drivers/android/simple_lmk.c | 301 +++ include/linux/sched.h| 3 + include/linux/simple_lmk.h | 11 ++ kernel/fork.c| 3 + mm/page_alloc.c | 13 ++ 7 files changed, 360 insertions(+) create mode 100644 drivers/android/simple_lmk.c create mode 100644 include/linux/simple_lmk.h diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 6fdf2abe4..7469d049d 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -54,6 +54,34 @@ config ANDROID_BINDER_IPC_SELFTEST exhaustively with combinations of various buffer sizes and alignments. +config ANDROID_SIMPLE_LMK + bool "Simple Android Low Memory Killer" + depends on !MEMCG + ---help--- + This is a complete low memory killer solution for Android that is + small and simple. It is integrated deeply into the page allocator to + know exactly when a page allocation hits OOM and exactly when a page + is freed. Processes are killed according to the priorities that + Android gives them, so that the least important processes are always + killed first. + +if ANDROID_SIMPLE_LMK + +config ANDROID_SIMPLE_LMK_MINFREE + int "Minimum MiB of memory to free per reclaim" + default "64" + help + Simple LMK will free at least this many MiB of memory per reclaim. + +config ANDROID_SIMPLE_LMK_KILL_TIMEOUT + int "Kill timeout in milliseconds" + default "50" + help + Simple LMK will only perform memory reclaim at most once per this + amount of time. + +endif # if ANDROID_SIMPLE_LMK + endif # if ANDROID endmenu diff --git a/drivers/android/Makefile b/drivers/android/Makefile index c7856e320..7c91293b6 100644 --- a/drivers/android/Makefile +++ b/drivers/android/Makefile @@ -3,3 +3,4 @@ ccflags-y += -I$(src) # needed for trace events obj-$(CONFIG_ANDROID_BINDERFS) += binderfs.o obj-$(CONFIG_ANDROID_BINDER_IPC) += binder.o binder_alloc.o obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o +obj-$(CONFIG_ANDROID_SIMPLE_LMK) += simple_lmk.o diff --git a/drivers/android/simple_lmk.c b/drivers/android/simple_lmk.c new file mode 100644 index 0..8a441650a --- /dev/null +++ b/drivers/android/simple_lmk.c @@ -0,0 +1,301 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Sultan Alsawaf . + */ + +#define pr_fmt(fmt) "simple_lmk: " fmt + +#include +#include +#include +#include +#include +#include + +#define MIN_FREE_PAGES (CONFIG_ANDROID_SIMPLE_LMK_MINFREE * SZ_1M / PAGE_SIZE) + +struct oom_alloc_req { + struct page *page; + struct completion done; + struct list_head lh; + unsigned int order; + int migratetype; +}; + +struct victim_info { + struct task_struct *tsk; + unsigned long size; +}; + +enum { + DISABLED, + STARTING, + READY, + KILLING +}; + +/* Pulled from the Android framework */ +static const short int adj_prio[] = { + 906, /* CACHED_APP_MAX_ADJ */ + 905, /* Cached app */ + 904, /* Cached app */ + 903, /* Cached app */ + 902, /* Cached app */ + 901, /* Cached app */ + 900, /* CACHED_APP_MIN_ADJ */ + 800, /* SERVICE_B_ADJ */ + 700, /* PREVIOUS_APP_ADJ */ + 600, /* HOME_APP_ADJ */ + 500, /* SERVICE_ADJ */ + 400, /* HEAVY_WEIGHT_APP_ADJ */ + 300, /* BACKUP_APP_ADJ */ + 200, /* PERCEPTIBLE_APP_ADJ */ + 100, /* VISIBLE_APP_ADJ */ + 0/* FOREGROUND_APP_ADJ */ +}; + +/* Make sure that PID_MAX_DEFAULT isn't too big, or these arrays will be huge */ +static struct victim_info victim_array[PID_MAX_DEFAULT]; +st
[PATCH] dm crypt: fix memory leak in dm_crypt_integrity_io_alloc() error path
From: Sultan Alsawaf dm_crypt_integrity_io_alloc() allocates space for an integrity payload but doesn't free it in the error path, leaking memory. Add a bio_integrity_free() invocation upon error to fix the memory leak. Signed-off-by: Sultan Alsawaf --- drivers/md/dm-crypt.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index dd538e6b2..f731e1fe0 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -939,8 +939,10 @@ static int dm_crypt_integrity_io_alloc(struct dm_crypt_io *io, struct bio *bio) ret = bio_integrity_add_page(bio, virt_to_page(io->integrity_metadata), tag_len, offset_in_page(io->integrity_metadata)); - if (unlikely(ret != tag_len)) + if (unlikely(ret != tag_len)) { + bio_integrity_free(bio); return -ENOMEM; + } return 0; } -- 2.20.1
[PATCH] random: fix inconsistent spinlock usage
All users of the struct entropy_store spinlock use the irqsave spinlock variant. Spinlock users of the same lock should use be consistent in their use of a certain spinlock primitive, which makes add_interrupt_randomness()'s spinlock usage incorrect. Fix the inconsistency by converting add_interrupt_randomness()'s spinlocks to use the irqsave primitive. Signed-off-by: Sultan Alsawaf --- drivers/char/random.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 38c6d1af6..1365017a7 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1239,6 +1239,7 @@ void add_interrupt_randomness(int irq, int irq_flags) __u64 ip; unsigned long seed; int credit = 0; + unsigned long flags; if (cycles == 0) cycles = get_reg(fast_pool, regs); @@ -1269,7 +1270,7 @@ void add_interrupt_randomness(int irq, int irq_flags) return; r = &input_pool; - if (!spin_trylock(&r->lock)) + if (!spin_trylock_irqsave(&r->lock, flags)) return; fast_pool->last = now; @@ -1285,7 +1286,7 @@ void add_interrupt_randomness(int irq, int irq_flags) __mix_pool_bytes(r, &seed, sizeof(seed)); credit = 1; } - spin_unlock(&r->lock); + spin_unlock_irqrestore(&r->lock, flags); fast_pool->count = 0; -- 2.20.1
random: remove no-op BUG_ON
container_of simply does pointer arithmetic; it's not going to spit out NULL, so this BUG_ON is unneeded. Signed-off-by: Sultan Alsawaf --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 38c6d1af6..5ea9cf3d3 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1357,7 +1357,7 @@ static void push_to_pool(struct work_struct *work) { struct entropy_store *r = container_of(work, struct entropy_store, push_work); - BUG_ON(!r); + _xfer_secondary_pool(r, random_read_wakeup_bits/8); trace_push_to_pool(r->name, r->entropy_count >> ENTROPY_SHIFT, r->pull->entropy_count >> ENTROPY_SHIFT); -- 2.20.1
Re: [PATCH] random: remove unused argument from add_interrupt_randomness()
This is still an issue in the latest kernel version. Sultan On Mon, Apr 30, 2018 at 8:12 AM Sultan Alsawaf wrote: > > From a872cf4f0bb57a7bb3c95ea557082fb7733344f8 Mon Sep 17 00:00:00 2001 > From: Sultan Alsawaf > Date: Sun, 29 Apr 2018 20:04:35 -0700 > Subject: [PATCH] random: remove unused argument from > add_interrupt_randomness() > > The irq_flags parameter is not used. Remove it. > > Signed-off-by: Sultan Alsawaf > --- > drivers/char/random.c | 4 ++-- > drivers/hv/hv.c| 2 +- > drivers/hv/vmbus_drv.c | 2 +- > include/linux/random.h | 2 +- > kernel/irq/handle.c| 2 +- > 5 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 3cd3aae24d6d..125c1f545277 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -131,7 +131,7 @@ > * void add_device_randomness(const void *buf, unsigned int size); > * void add_input_randomness(unsigned int type, unsigned int code, > *unsigned int value); > - * void add_interrupt_randomness(int irq, int irq_flags); > + * void add_interrupt_randomness(int irq); > * void add_disk_randomness(struct gendisk *disk); > * > * add_device_randomness() is for adding data to the random pool that > @@ -1189,7 +1189,7 @@ static __u32 get_reg(struct fast_pool *f, struct > pt_regs *regs) > return *ptr; > } > > -void add_interrupt_randomness(int irq, int irq_flags) > +void add_interrupt_randomness(int irq) > { > struct entropy_store*r; > struct fast_pool*fast_pool = this_cpu_ptr(&irq_randomness); > diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c > index 9b82549cbbc8..4b76ba23d666 100644 > --- a/drivers/hv/hv.c > +++ b/drivers/hv/hv.c > @@ -115,7 +115,7 @@ static void hv_stimer0_isr(void) > > hv_cpu = this_cpu_ptr(hv_context.cpu_context); > hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt); > - add_interrupt_randomness(stimer0_vector, 0); > + add_interrupt_randomness(stimer0_vector); > } > > static int hv_ce_set_next_event(unsigned long delta, > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index b10fe26c4891..41f4116f8c31 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -1015,7 +1015,7 @@ static void vmbus_isr(void) > tasklet_schedule(&hv_cpu->msg_dpc); > } > > - add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0); > + add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR); > } > > > diff --git a/include/linux/random.h b/include/linux/random.h > index 2ddf13b4281e..210400258fef 100644 > --- a/include/linux/random.h > +++ b/include/linux/random.h > @@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {} > > extern void add_input_randomness(unsigned int type, unsigned int code, > unsigned int value) __latent_entropy; > -extern void add_interrupt_randomness(int irq, int irq_flags) > __latent_entropy; > +extern void add_interrupt_randomness(int irq) __latent_entropy; > > extern void get_random_bytes(void *buf, int nbytes); > extern int wait_for_random_bytes(void); > diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c > index 38554bc35375..e2f7afcb1ae6 100644 > --- a/kernel/irq/handle.c > +++ b/kernel/irq/handle.c > @@ -188,7 +188,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc) > > retval = __handle_irq_event_percpu(desc, &flags); > > - add_interrupt_randomness(desc->irq_data.irq, flags); > + add_interrupt_randomness(desc->irq_data.irq); > > if (!noirqdebug) > note_interrupt(desc, retval); > -- > 2.14.1
Re: Linux messages full of `random: get_random_u32 called from`
On Tue, May 01, 2018 at 08:56:04PM -0400, Theodore Y. Ts'o wrote: > On Tue, May 01, 2018 at 05:43:17PM -0700, Sultan Alsawaf wrote: > > > > I've attached what I think is a reasonable stopgap solution until this is > > actually fixed. If you're willing to revert the CVE-2018-1108 patches > > completely, then I don't think you'll mind using this patch in the meantime. > > I would put it slightly differently; reverting the CVE-2018-1108 > patches is less dangerous than what you are proposing in your attached > patch. > > Again, I think the right answer is to fix userspace to not require > cryptographic grade entropy during early system startup, and for > people to *think* about what they are doing. I've looked at the > systemd's use of hmac in journal-authenticate, and as near as I can > tell, there isn't any kind of explanation about why it was necessary, > or what threat it was trying to protect against. > > - Ted Why is /dev/urandom so much more dangerous than /dev/random? The more I search, the more I see that many sources consider /dev/urandom to be cryptographically secure... and since I hold down a single key on the keyboard to make my computer boot without any kernel workarounds, I'm sure the NSA would eventually notice my predictable behavior and get their hands on my Richard Stallman photos. Fixing all the "broken" userspace instances of entropy usage during early system startup is a tall order. What about barebone machines used as remote servers? I feel like just "fixing userspace" isn't going to cover all of the usecases that the CVE-2018-1108 patches broke. Sultan
Re: Linux messages full of `random: get_random_u32 called from`
On Tue, May 01, 2018 at 05:35:56PM -0500, Justin Forbes wrote: > > I have not reproduced in GCE myself. We did get some confirmation > that removing dracut-fips does make the problem less dire (but I > wouldn't call a 4 minute boot a win, but booting in 4 minutes is > better than not booting at all). Specifically systemd calls libgcrypt > before it even opens the log with fips there, and this is before > virtio-rng modules could even load. Right now though, we are looking > at pretty much any possible options as the majority of people are > calling for me to backout the patches completely from rawhide. I've attached what I think is a reasonable stopgap solution until this is actually fixed. If you're willing to revert the CVE-2018-1108 patches completely, then I don't think you'll mind using this patch in the meantime. Sultan >From 5be2efdde744d3c55db3df81c0493fc67dc35620 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Tue, 1 May 2018 17:36:17 -0700 Subject: [PATCH] random: use urandom instead of random for now and speed up crng init With the fixes for CVE-2018-1108, /dev/random now requires user-provided entropy on quite a few machines lacking high levels of boot entropy in order to complete its initialization. This causes issues on environments where userspace depends on /dev/random in order to finish booting completely (i.e., userspace will remain stuck, unable to boot, waiting for entropy more-or-less indefinitely until the user provides it via something like keystrokes or mouse movements). As a temporary workaround, redirect /dev/random to /dev/urandom instead, and speed up the initialization process by slightly relaxing the threshold for interrupts to go towards adding one bit of entropy credit (only until initialization is complete). Signed-off-by: Sultan Alsawaf --- drivers/char/mem.c| 3 ++- drivers/char/random.c | 9 ++--- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index ffeb60d3434c..cc9507f01c79 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -870,7 +870,8 @@ static const struct memdev { #endif [5] = { "zero", 0666, &zero_fops, 0 }, [7] = { "full", 0666, &full_fops, 0 }, -[8] = { "random", 0666, &random_fops, 0 }, +/* Redirect /dev/random to /dev/urandom until /dev/random is fixed */ +[8] = { "random", 0666, &urandom_fops, 0 }, [9] = { "urandom", 0666, &urandom_fops, 0 }, #ifdef CONFIG_PRINTK [11] = { "kmsg", 0644, &kmsg_fops, 0 }, diff --git a/drivers/char/random.c b/drivers/char/random.c index d9e38523b383..bce3b43cdd3b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1200,9 +1200,12 @@ void add_interrupt_randomness(int irq) return; } - if ((fast_pool->count < 64) && - !time_after(now, fast_pool->last + HZ)) - return; + if (fast_pool->count < 64) { + unsigned long timeout = crng_ready() ? HZ : HZ / 4; + + if (!time_after(now, fast_pool->last + timeout)) + return; + } r = &input_pool; if (!spin_trylock(&r->lock)) -- 2.14.1
Re: [PATCH] random: remove unused argument from add_interrupt_randomness()
>From a872cf4f0bb57a7bb3c95ea557082fb7733344f8 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 20:04:35 -0700 Subject: [PATCH] random: remove unused argument from add_interrupt_randomness() The irq_flags parameter is not used. Remove it. Signed-off-by: Sultan Alsawaf --- drivers/char/random.c | 4 ++-- drivers/hv/hv.c| 2 +- drivers/hv/vmbus_drv.c | 2 +- include/linux/random.h | 2 +- kernel/irq/handle.c| 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 3cd3aae24d6d..125c1f545277 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -131,7 +131,7 @@ * void add_device_randomness(const void *buf, unsigned int size); * void add_input_randomness(unsigned int type, unsigned int code, *unsigned int value); - * void add_interrupt_randomness(int irq, int irq_flags); + * void add_interrupt_randomness(int irq); * void add_disk_randomness(struct gendisk *disk); * * add_device_randomness() is for adding data to the random pool that @@ -1189,7 +1189,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs *regs) return *ptr; } -void add_interrupt_randomness(int irq, int irq_flags) +void add_interrupt_randomness(int irq) { struct entropy_store*r; struct fast_pool*fast_pool = this_cpu_ptr(&irq_randomness); diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c index 9b82549cbbc8..4b76ba23d666 100644 --- a/drivers/hv/hv.c +++ b/drivers/hv/hv.c @@ -115,7 +115,7 @@ static void hv_stimer0_isr(void) hv_cpu = this_cpu_ptr(hv_context.cpu_context); hv_cpu->clk_evt->event_handler(hv_cpu->clk_evt); - add_interrupt_randomness(stimer0_vector, 0); + add_interrupt_randomness(stimer0_vector); } static int hv_ce_set_next_event(unsigned long delta, diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index b10fe26c4891..41f4116f8c31 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -1015,7 +1015,7 @@ static void vmbus_isr(void) tasklet_schedule(&hv_cpu->msg_dpc); } - add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0); + add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR); } diff --git a/include/linux/random.h b/include/linux/random.h index 2ddf13b4281e..210400258fef 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {} extern void add_input_randomness(unsigned int type, unsigned int code, unsigned int value) __latent_entropy; -extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy; +extern void add_interrupt_randomness(int irq) __latent_entropy; extern void get_random_bytes(void *buf, int nbytes); extern int wait_for_random_bytes(void); diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index 38554bc35375..e2f7afcb1ae6 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -188,7 +188,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc) retval = __handle_irq_event_percpu(desc, &flags); - add_interrupt_randomness(desc->irq_data.irq, flags); + add_interrupt_randomness(desc->irq_data.irq); if (!noirqdebug) note_interrupt(desc, retval); -- 2.14.1
Re: Linux messages full of `random: get_random_u32 called from`
On Sun, Apr 29, 2018 at 08:11:07PM -0400, Theodore Y. Ts'o wrote: > > What your patch does is assume that there is a full bit of uncertainty > that can be obtained from the information gathered from each > interrupt. I *might* be willing to assume that to be valid on x86 > systems that have a high resolution cycle counter. But on ARM > platforms, especially during system bootup when the user isn't typing > anything and SSD's and flash storage tend to have very predictable > timing patterns? Not a bet I'd be willing to take. Even with a cycle > counter, there's a reason why we assumed that we need to mix in timing > results from 64 interrupts or one second's worth before we would give > a single bit's worth of entropy credit. > > - Ted What about abusing high-resolution timers to get entropy? Since hrtimers can't make guarantees down to the nanosecond, there's always a skew between the requested expiry time and the actual expiry time. Please see the attached patch and let me know just how horrible it is. Sultan >From b0d21c38558c661531d4cb46816fbb36b874a169 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 21:28:08 -0700 Subject: [PATCH] random: use high-res timers to generate entropy until crng init is done --- drivers/char/random.c | 47 +++ 1 file changed, 47 insertions(+) diff --git a/drivers/char/random.c b/drivers/char/random.c index d9e38523b383..af2d60bbcec3 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -286,6 +286,7 @@ #define OUTPUT_POOL_WORDS (1 << (OUTPUT_POOL_SHIFT-5)) #define SEC_XFER_SIZE 512 #define EXTRACT_SIZE 10 +#define ENTROPY_GEN_INTVL_NS (1 * NSEC_PER_MSEC) #define LONGS(x) (((x) + sizeof(unsigned long) - 1)/sizeof(unsigned long)) @@ -408,6 +409,8 @@ static struct fasync_struct *fasync; static DEFINE_SPINLOCK(random_ready_list_lock); static LIST_HEAD(random_ready_list); +static struct hrtimer entropy_gen_hrtimer; + struct crng_state { __u32 state[16]; unsigned long init_time; @@ -2287,3 +2290,47 @@ void add_hwgenerator_randomness(const char *buffer, size_t count, credit_entropy_bits(poolp, entropy); } EXPORT_SYMBOL_GPL(add_hwgenerator_randomness); + +/* + * Generate entropy on init using high-res timers. Although high-res timers + * provide nanosecond precision, they don't actually honor requests to the + * nanosecond. The skew between the expected time difference in nanoseconds and + * the actual time difference can be used as a way to generate entropy on boot + * for machines that lack sufficient boot-time entropy. + */ +static enum hrtimer_restart entropy_timer_cb(struct hrtimer *timer) +{ + static u64 prev_ns; + u64 curr_ns, delta; + + if (crng_ready()) + return HRTIMER_NORESTART; + + curr_ns = ktime_get_mono_fast_ns(); + delta = curr_ns - prev_ns; + + add_interrupt_randomness(delta); + + /* Use the hrtimer skew to make the next interval more unpredictable */ + if (likely(prev_ns)) + hrtimer_add_expires_ns(timer, delta); + else + hrtimer_add_expires_ns(timer, ENTROPY_GEN_INTVL_NS); + + prev_ns = curr_ns; + return HRTIMER_RESTART; +} + +static int entropy_gen_hrtimer_init(void) +{ + if (!IS_ENABLED(CONFIG_HIGH_RES_TIMERS)) + return 0; + + hrtimer_init(&entropy_gen_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + + entropy_gen_hrtimer.function = entropy_timer_cb; + hrtimer_start(&entropy_gen_hrtimer, ns_to_ktime(ENTROPY_GEN_INTVL_NS), + HRTIMER_MODE_REL); + return 0; +} +core_initcall(entropy_gen_hrtimer_init); -- 2.14.1
[PATCH] random: remove unused argument from add_interrupt_randomness()
>From cdc2a03f93fdec88ad040a212605e20ab97c3e19 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 20:04:35 -0700 Subject: [PATCH] random: remove unused argument from add_interrupt_randomness() The irq_flags parameter is not used. Remove it. Signed-off-by: Sultan Alsawaf --- drivers/char/random.c | 4 ++-- drivers/hv/vmbus_drv.c | 2 +- include/linux/random.h | 2 +- kernel/irq/handle.c| 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 38729baed6ee..d9e38523b383 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -131,7 +131,7 @@ * void add_device_randomness(const void *buf, unsigned int size); * void add_input_randomness(unsigned int type, unsigned int code, *unsigned int value); - * void add_interrupt_randomness(int irq, int irq_flags); + * void add_interrupt_randomness(int irq); * void add_disk_randomness(struct gendisk *disk); * * add_device_randomness() is for adding data to the random pool that @@ -1164,7 +1164,7 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs *regs) return *ptr; } -void add_interrupt_randomness(int irq, int irq_flags) +void add_interrupt_randomness(int irq) { struct entropy_store*r; struct fast_pool*fast_pool = this_cpu_ptr(&irq_randomness); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index bc65c4d79c1f..777096d560cb 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -1016,7 +1016,7 @@ static void vmbus_isr(void) tasklet_schedule(&hv_cpu->msg_dpc); } - add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0); + add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR); } diff --git a/include/linux/random.h b/include/linux/random.h index 4024f7d9c77d..b71f87dbf7cd 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -32,7 +32,7 @@ static inline void add_latent_entropy(void) {} extern void add_input_randomness(unsigned int type, unsigned int code, unsigned int value) __latent_entropy; -extern void add_interrupt_randomness(int irq, int irq_flags) __latent_entropy; +extern void add_interrupt_randomness(int irq) __latent_entropy; extern void get_random_bytes(void *buf, int nbytes); extern int wait_for_random_bytes(void); diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index 79f987b942b8..1bc4dcc489d0 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -186,7 +186,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc) retval = __handle_irq_event_percpu(desc, &flags); - add_interrupt_randomness(desc->irq_data.irq, flags); + add_interrupt_randomness(desc->irq_data.irq); if (!noirqdebug) note_interrupt(desc, retval); -- 2.14.1
Re: Linux messages full of `random: get_random_u32 called from`
On Mon, Apr 30, 2018 at 12:43:48AM +0200, Jason A. Donenfeld wrote: > > - if ((fast_pool->count < 64) && > > - !time_after(now, fast_pool->last + HZ)) > > - return; > > - > > I suspect you still want the rate-limiting in place. But if you _do_ > want to cheat like this, you could instead just modify the condition > to only relax the rate limiting when !crng_init(). Good idea. Attached a new patch that's less intrusive. It still fixes my issue, of course. Sultan >From 6870b0383b88438d842599aa8608a260e6fb0ed2 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 15:44:27 -0700 Subject: [PATCH] random: don't ratelimit add_interrupt_randomness() until crng is ready --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 38729baed6ee..8c00c008e797 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1201,7 +1201,7 @@ void add_interrupt_randomness(int irq, int irq_flags) } if ((fast_pool->count < 64) && - !time_after(now, fast_pool->last + HZ)) + !time_after(now, fast_pool->last + HZ) && crng_ready()) return; r = &input_pool; -- 2.14.1
Re: Linux messages full of `random: get_random_u32 called from`
On Sun, Apr 29, 2018 at 06:05:19PM -0400, Theodore Y. Ts'o wrote: > It's more accurate to say that using /dev/urandom is no worse than > before (from a few years ago). There are, alas, plenty of > distributions and user space application programmers that basically > got lazy using /dev/urandom, and assumed that there would be plenty of > entropy during early system startup. > > When they switched over the getrandom(2), the most egregious examples > of this caused pain (and they got fixed), but due to a bug in > drivers/char/random.c, if getrandom(2) was called after the entropy > pool was "half initialized", it would not block, but proceed. > > Is that exploitable? Well, Jann and I didn't find an _obvious_ way to > exploit the short coming, which is this wasn't treated like an > emergency situation ala the embarassing situation we had five years > ago[1]. > > [1] https://factorable.net/paper.html > > However, it was enough to make us be uncomfortable, which is why I > pushed the changes that I did. At least on the devices we had at > hand, using the distributions that we typically use, the impact seemed > minimal. Unfortuantely, there is no way to know for sure without > rolling out change and seeing who screams. In the ideal world, > software would not require cryptographic randomness immediately after > boot, before the user logs in. And ***really***, as in [1], softwaret > should not be generating long-term public keys that are essential to > the security of the box a few seconds immediately after the device is > first unboxed and plugged in.i > > What would be useful is if people gave reports that listed exactly > what laptop and distributions they are using. Just "a high spec x86 > laptop" isn't terribly useful, because *my* brand-new Dell XPS 13 > running Debian testing is working just fine. The year, model, make, > and CPU type plus what distribution (and distro version number) you > are running is useful, so I can assess how wide spread the unhappiness > is going to be, and what mitigation steps make sense. > > > What mitigations steps can be taken? > > If you believe in security-through-complexity (the cache architecture > of x86 is *so* complicated no one can understand it, so > Jitterentropy / Haveged *must* be secure), or security-through-secrecy > (the cache architecture of x86 is only avilable to internal architects > inside Intel, so Jitterentropy / Haveged *must* be secure, never mind > that the Intel CPU architects who were asked about it were "nervous"), > then wiring up CONFIG_JITTERENTROPY or using haveged might be one > approach. > > If you believe that Intel hasn't backdoored RDRAND, then installing > rng-tools and running rngd with --enable-drng will enable RDRAND. > That seems to be popular with various defense contractors, perhaps on > the assumption that if it _was_ backdoored (no one knows for sure), it > was probably with the connivance or request of the US government, who > doesn't need to worry about spying on itself. > > Or you can use some kind of open hardware design RNG, such as > ChoasKey[2] from Altus Metrum. But that requires using specially > ordered hardware plugged into a USB slot, and it's probably not a mass > solution. > > [2] https://altusmetrum.org/ChaosKey/ > > > Personally, I prefer fixing the software to simply not require > cryptographic grade entropy before the user has logged in. Because > it's better than the alternatives. > > - Ted > The attached patch fixes my crng init woes. With it, crng init completes 0.86 seconds into boot, but I can't help but feel like a solution this obvious would just expose my Richard Stallman photo collection to prying eyes at the NSA. Thoughts on the patch? Sultan >From 597b0f2b3c986f853bf1d30a7fb9d76869e47fe8 Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 15:22:59 -0700 Subject: [PATCH] random: remove ratelimiting from add_interrupt_randomness() --- drivers/char/random.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 38729baed6ee..5b38277b104a 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -574,7 +574,6 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in, struct fast_pool { __u32 pool[4]; - unsigned long last; unsigned short reg_idx; unsigned char count; }; @@ -1195,20 +1194,14 @@ void add_interrupt_randomness(int irq, int irq_flags) crng_fast_load((char *) fast_pool->pool, sizeof(fast_pool->pool))) {
Re: Linux messages full of `random: get_random_u32 called from`
On Sun, Apr 29, 2018 at 11:18:55PM +0200, Pavel Machek wrote: > So -- I'm pretty sure systemd and friends should be using > /dev/urandom. Maybe gpg wants to use /dev/random. _Maybe_. > > [2.948192] random: systemd: uninitialized urandom read (16 bytes > read) > [2.953526] systemd[1]: systemd 215 running in system mode. (+PAM > +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ > -SECCOMP -APPARMOR) > [2.980278] systemd[1]: Detected architecture 'x86'. > [3.115072] usb 5-2: New USB device found, idVendor=0483, > idProduct=2016, bcdDevice= 0.01 > [3.119633] usb 5-2: New USB device strings: Mfr=1, Product=2, > SerialNumber=0 > [3.124147] usb 5-2: Product: Biometric Coprocessor > [3.128621] usb 5-2: Manufacturer: STMicroelectronics > [3.163839] systemd[1]: Failed to insert module 'ipv6' > [3.181266] systemd[1]: Set hostname to . > [3.267243] random: systemd-sysv-ge: uninitialized urandom read (16 > bytes read) > [3.669590] random: systemd-sysv-ge: uninitialized urandom read (16 > bytes read) > [3.696242] random: systemd: uninitialized urandom read (16 bytes > read) > [3.700066] random: systemd: uninitialized urandom read (16 bytes > read) > [3.703716] random: systemd: uninitialized urandom read (16 bytes > read) > > Anyway, urandom should need to be seeded once, and then provide random > data forever... which is not impression I get from the dmesg output > above. Boot clearly proceeds... somehow. So now I'm confused. Hmm... Well, the attached patch (which redirects /dev/random to /dev/urandom) didn't fix my boot issue, so I'm at a loss as well. Sultan >From 15f54e2756866956d8713fdec92b54c6c69eb1bb Mon Sep 17 00:00:00 2001 From: Sultan Alsawaf Date: Sun, 29 Apr 2018 12:53:44 -0700 Subject: [PATCH] char: mem: Link /dev/random to /dev/urandom --- drivers/char/mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index ffeb60d3434c..0cd22e6100ad 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -870,7 +870,7 @@ static const struct memdev { #endif [5] = { "zero", 0666, &zero_fops, 0 }, [7] = { "full", 0666, &full_fops, 0 }, -[8] = { "random", 0666, &random_fops, 0 }, +[8] = { "random", 0666, &urandom_fops, 0 }, [9] = { "urandom", 0666, &urandom_fops, 0 }, #ifdef CONFIG_PRINTK [11] = { "kmsg", 0644, &kmsg_fops, 0 }, -- 2.14.1
Re: Linux messages full of `random: get_random_u32 called from`
On Sun, Apr 29, 2018 at 08:41:01PM +0200, Pavel Machek wrote: > Umm. No. https://www.youtube.com/watch?v=xneBjc8z0DE Okay, but /dev/urandom isn't a solution to this problem because it isn't usable until crng init is complete, so it suffers from the same init lag as /dev/random. Sultan
Re: Linux messages full of `random: get_random_u32 called from`
I'd also like to add that my high-spec x86 laptop exhibits the same issue as my Edgar Chromebook. Here's my dmesg: https://hastebin.com/dofejolobi.go The most interesting line: [ 90.811633] random: crng init done I waited 90 seconds after boot to provide entropy myself, at which point crng init completed. In other words, crng init only completed because I provided the entropy by smashing the keyboard. I could've waited longer and crng init wouldn't have completed without my input. Mind you, this laptop has a 45W CPU, so power savings were definitely not considered in its design. Do you have any machines that can provide enough boot entropy to satisfy crng init without requiring user-provided entropy? Sultan
Re: Linux messages full of `random: get_random_u32 called from`
On Sun, Apr 29, 2018 at 04:32:05PM +0200, Pavel Machek wrote: > Hi! > > > This is why ultimately, we do need to attack this problem from both > > ends, which means teaching userspace programs to only request > > cryptographic-grade randomness when it is really needed --- and most > > of the time, if the user has not logged in yet, you probably don't > > need cryptographic-grade randomness > > IOW moving them from /dev/random to /dev/urandom? > Pavel > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) > http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html /dev/urandom isn't cryptographically secure, so that's not an option.
Re: Linux messages full of `random: get_random_u32 called from`
> On Thu, Apr 26, 2018 at 10:20:44PM -0700, Sultan Alsawaf wrote: >> I noted at least 20,000 mmc interrupts before I intervened in the boot >> process to provide entropy >> myself. That's just for mmc, so I'm sure there were even more interrupts >> elsewhere. Is 20k+ interrupts >> really not sufficient? > How did you determine that there were 20,000 mmc interrupts before you > had logged in? Did you have access to the machine w/o having access > to the login prompt? > > I can send a patch (see attached) that will spew large amounts of logs > as each interrupt comes in and the entropy pool is getting intialized. > That's how I test things on QEMU, and Jann did something similar on a > (physical) test machine, so I'm pretty confident that if you were > getting interrupts, it would result in them contributing into the > pool. > > You will need a serial console, or build a kernel with a much larger > dmesg buffer, since if you really are getting that many interrupts it > will cause a lot of log spew. >> There are lots of other sources of entropy available as well, like >> the ever-changing CPU frequencies reported by any recent Intel chip >> (i.e., they report precision down to 1 kHz). > That's something we could look at, but the problem is if there is some > systemd unit during early boot that blocks waiting for the entropy > pool to be initalized, the system will come to a dead halt, and even > the CPU frequency shifts will probably not move much --- just as there > weren't any interrupts while some system startup on the boot path > wedges the system startup waiting for entropy. > > This is why ultimately, we do need to attack this problem from both > ends, which means teaching userspace programs to only request > cryptographic-grade randomness when it is really needed --- and most > of the time, if the user has not logged in yet, you probably don't > need cryptographic-grade randomness > > - Ted > > diff --git a/drivers/char/random.c b/drivers/char/random.c > index cd888d4ee605..69bd29f039e7 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -916,6 +916,10 @@ static void crng_reseed(struct crng_state *crng, struct > entropy_store *r) > __u32 key[8]; > } buf; > > + if (crng == &primary_crng) > + pr_notice("random: crng_reseed primary from %px\n", r); > + else > + pr_notice("random: crng_reseed crng %px from %px\n", crng, r); > if (r) { > num = extract_entropy(r, &buf, 32, 16, 0); > if (num == 0) > @@ -1241,6 +1245,10 @@ void add_interrupt_randomness(int irq, int irq_flags) > fast_pool->pool[2] ^= ip; > fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 : > get_reg(fast_pool, regs); > + if (crng_init < 2) > + pr_notice("random: add_interrupt(cycles=0x%08llx, now=%ld, " > + "irq=%d, ip=0x%08lx)\n", > + cycles, now, irq, _RET_IP_); > > fast_mix(fast_pool); > add_interrupt_bench(cycles); > @@ -1282,6 +1290,9 @@ void add_interrupt_randomness(int irq, int irq_flags) > > /* award one bit for the contents of the fast pool */ > credit_entropy_bits(r, credit + 1); > + if (crng_init < 2) > + pr_notice("random: batched into pool in stage %d, bits now %d", > + crng_init, ENTROPY_BITS(r)); > } > EXPORT_SYMBOL_GPL(add_interrupt_randomness); I dumped the contents of /proc/interrupts to dmesg using the attached patch I threw together, and then waited a sufficient amount of time before introducing entropy myself in order to ensure that the interrupt readings were not contaminated by user-provided interrupts. Here is the interesting snippet from my dmesg: [ 30.689076] /proc/interrupts dump: |CPU0 CPU1 CPU2 CPU3 0: 6 0 0 0 IO-APIC 2-edge timer 8: 0 0 1 0 IO-APIC 8-edge rtc0 9: 0533 0 0 IO-APIC 9-fasteoi acpi 10: 0 0 0 0 IO-APIC 10-edge tpm0 29: 0 0 0 0 IO-APIC 29-fasteoi intel_sst_driver 36:203 0 0 0 IO-APIC 36-fasteoi 808622C1:04 37: 0264 0 0 IO-APIC 37-fasteoi 808622C1:05
Re: Linux messages full of `random: get_random_u32 called from`
> The CRNG changes were needed because were erroneously saying that the > entropy pool was securely initialized before it really was. Saying > that CRNG should be able to init on its own is much like saying, "Ted > should be able to fly wherever he wants in his own personal Gulfstream > V." It would certainly be _nice_ if I could afford my personal jet. > I certainly wish I were that rich. But the problem is that dollars > (or Euro's) are like entropy, they don't just magically drop out of > the sky. > > If there isn't user-provided entropy, and the hardware isn't providing > sufficient entropy, where did you think the kernel is supposed to get > the entropy from? Should it dial 1-800-TRUST-NSA? > > From the dmesg log, you have a Chromebook Acer 14. I'm guessing the > problem is that Chromebooks have hardware tries *very* hard not to > issue interrupts, since that helps with power savings. The following > from your dmesg is very interesting: > > [0.526786] tpm tpm0: [Firmware Bug]: TPM interrupt not working, polling > instead > > I suspect this isn't a firmware bug; it's the hardware working as > intended / working as designed, for power savings reasons. > > So there are two ways to fix this that I can see. One is to try to > adjust userspace so that it allows the boot to proceed. As there is > more activity, the disk completion interrupts, the user typing their > username/password into the login screen, etc., there will be timing > events which can be used to harvest entropy. > > The other approach would be to compile the kernel with > CONFIG_HW_RANDOM_TPM and to modify drivers/char/tpm/tpm-chip.c tot > initalize chip->hwrng.quality = 500. We've historically made this > something that the system administrator must set via sysfs. This is > because we wanted system adminisrators to explicitly say that they > trust the any hardware manufacturer that (a) they haven't been paid by > your choice of the Chinese MSS or the US NSA to introduce a backdoor,i > and (b) they are competent to actually implemnt a _secure_ hardware > random number generator. Sadly, this has not always been the case. > Please see: > > https://www.chromium.org/chromium-os/tpm_firmware_update > > And note that your Edgar Chromebook is one the list of devices that > have a TPM with the buggy firmware. Fortunately this particular TPM > bug only affects RSA prime generation, so as far as I know there is no > _known_ vulerability in your TPM's hardware random number generator. > B ut we want it to be _your_ responsibility to decide you are willing > to truste it. I certainly don't want to be legally liable --- or even > have the moral responsibility --- of guaranteeing that every single > TPM out there is bug-free(tm). > > - Ted Why don't we tell users that they need to smash their keyboards to make their computers boot then? And if they question it, we can tell them that it certainly would be _nice_ to not have to smash their keyboards to make their computers boot, but alas, a part of me has a feeling that users would not take kindly to that :) I noted at least 20,000 mmc interrupts before I intervened in the boot process to provide entropy myself. That's just for mmc, so I'm sure there were even more interrupts elsewhere. Is 20k+ interrupts really not sufficient? There are lots of other sources of entropy available as well, like the ever-changing CPU frequencies reported by any recent Intel chip (i.e., they report precision down to 1 kHz). Why are we so limited to h/w interrupts? Sultan
Re: Linux messages full of `random: get_random_u32 called from`
> Hmm, it looks like the multiuser startup is getting blocked on snapd: > > 29.060s snapd.service > > graphical.target @1min 32.145s > └─multi-user.target @1min 32.145s > └─hddtemp.service @6.512s +28ms > └─network-online.target @6.508s > └─NetworkManager-wait-online.service @2.428s +4.079s > └─NetworkManager.service @2.016s +404ms > └─dbus.service @1.869s > └─basic.target @1.824s > └─sockets.target @1.824s > └─snapd.socket @1.821s +1ms > └─sysinit.target @1.812s > └─apparmor.service @587ms +1.224s > └─local-fs.target @585ms > └─local-fs-pre.target @585ms > └─keyboard-setup.service @235ms +346ms > └─systemd-journald.socket @226ms > └─system.slice @225ms > └─-.slice @220ms > > This appears to be some kind of new package management system for > Ubuntu: > > Description-en: Tool to interact with Ubuntu Core Snappy. > Install, configure, refresh and remove snap packages. Snaps are > 'universal' packages that work across many different Linux systems, > enabling secure distribution of the latest apps and utilities for > cloud, servers, desktops and the internet of things. > > Why it the Ubuntu package believes it needs to be fully started before > the login screen can display is unclear to me. It might be worth > using systemctl to disable snapd.serivce and see if that makes things > work better for you. > > - Ted I removed snapd completely which did nothing. Here are new logs: systemd-analyze blame: https://hastebin.com/edehikuyeb.css systemd-analyze critical-chain: https://hastebin.com/vedufafema.pl dmesg: https://hastebin.com/zuwuwoxadu.vbs I should also note that leaving the system untouched does not result in it booting: I must provide a source of entropy, otherwise it just stays stuck. In both of the dmesgs I've given, I manually provided entropy to the system after about 5 minutes of waiting. Also, regardless of what's hanging on CRNG init, CRNG should be able to init on its own in a timely manner without the need for user-provided entropy. Userspace was working fine before the recent CRNG kernel changes, so I don't think this is a userspace bug. -Sultan
Re: Linux messages full of `random: get_random_u32 called from`
> Hmm, can you let the boot hang for a while? It should continue after > a few minutes if you wait long enough, but wait a minute or two, then > give it entropy so the boot can continue. Then can you use > "systemd-analyze blame" or "systemd-analyize critical-chain" and we > can see what process was trying to get randomness during the boot > startup and blocking waiting for the CRNG to be fully initialized. > >- Ted systemd-analyze blame: https://hastebin.com/ikipavevew.css systemd-analyze critical-chain: https://hastebin.com/odoyuqeges.pl dmesg: https://hastebin.com/waracebeja.vbs
Re: Linux messages full of `random: get_random_u32 called from`
> Thanks for the report! > > I assume since you're upgrading your own kernel, you must not be > running Chrome OS on your Acer CB3-431 Chromebook (Edgar). Are you > running Chromium --- or some Linux distribution on it? > > Thanks, > > - Ted Correct, I'm running Xubuntu 18.04 with my own kernel based off linux-stable.
Re: Linux messages full of `random: get_random_u32 called from`
I noticed "systems without sufficient boot randomness" and would like to add to this. With the changes to /dev/random going from 4.16.3 to 4.16.4, my low-spec Chromebook does not reach the login screen upon boot (it stays stuck on a black screen) until I provide a source of entropy to the system via interrupts (e.g., holding down a key on the keyboard for 5 sec or moving my finger across the touchpad a lot). After providing a source of entropy for long enough, "random: crng init done" prints out in dmesg and the login screen finally pops up. Detailed information on my system can be found on this bug report I recently worked on: https://bugzilla.kernel.org/show_bug.cgi?id=199463
Re: Mis-backport in af_unix patch for Linux 3.10.95
Hello, Thank you very much for the warm welcome :-) You're right, the noblock variable is used elsewhere in the stream receive function, so nothing is needed there after the interruptible logic is restored for the dgram function. Your patch looks good to me. I had picked the Linux 3.12.52 version of the patch (where the interruptible locking was removed from the right place in the stream receive function) onto my 3.10 kernel branch a couple weeks ago and it has been working fine for me. Thanks, Sultan On Sun, Jan 24, 2016 at 3:31 AM, Willy Tarreau wrote: > Hello, > > On Sun, Jan 24, 2016 at 12:10:35AM -0500, Sultan Qasim wrote: >> Hello all, >> >> I'm an outsider to the Linux kernel community, so I apologize if this >> is not the right channel to mention this. > > The simple fact that you participate, inspect the code and report bugs > makes you part of this community :-) It's indeed the right place. > Usually when reporting an issue with a commit, we also CC the whole > signed-off-by / CC chain of that commit (which I'm doing now). For > bugs related to networking, we usually CC the netdev list as well. > >> I noticed that the >> backported version of the patch "af_unix: Revert 'lock_interruptible' >> in stream receive code" in Linux 3.10.95 seems to have removed the >> mutex_lock_interruptible from the wrong function. >> >> Here is the backported patch: >> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3a57e783016bf43ab9326172217f564941b85b17 >> >> Here is the original: >> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/unix/af_unix.c?id=3822b5c2fc62e3de8a0f33806ff279fb7df92432 >> >> Was it not meant to be removed from unix_stream_recvmsg instead of >> unix_dgram_recvmsg? > > You're absolutely right, good catch! Similar controls were added to > both functions resulting in the same code appearing there, which > confused the patch process, causing the change to be applied to the > wrong location. This happens from time to time in such circumstances > when backporting to older kernels. > >> Also, the variable called "noblock" needs to be >> removed from the function being changed to prevent unused variable >> warnings. > > If you mean this variable in function unix_dgram_recvmsg(), it would > indeed report a warning but only due to the patch being mis-applied. > In unix_stream_recvmsg(), it's still used as well. > > Does the attached patch seem better to you (not compile-tested) ? > > Greg/Ben, both 3.2.76 and 3.14.59 are OK regarding this, it seems > like only 3.10.95 was affected. > > Thanks, > Willy >
Mis-backport in af_unix patch for Linux 3.10.95
Hello all, I'm an outsider to the Linux kernel community, so I apologize if this is not the right channel to mention this. I noticed that the backported version of the patch "af_unix: Revert 'lock_interruptible' in stream receive code" in Linux 3.10.95 seems to have removed the mutex_lock_interruptible from the wrong function. Here is the backported patch: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3a57e783016bf43ab9326172217f564941b85b17 Here is the original: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/unix/af_unix.c?id=3822b5c2fc62e3de8a0f33806ff279fb7df92432 Was it not meant to be removed from unix_stream_recvmsg instead of unix_dgram_recvmsg? Also, the variable called "noblock" needs to be removed from the function being changed to prevent unused variable warnings. Sultan Q. Khan