Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match
Arnd Bergmann wrote on Mon, Apr 19, 2021 at 02:16:36PM +0200: > In some cases, you can use the device_link infrastructure to deal > with dependencies between devices. Not sure if this would help > in your case, but have a look at device_link_add() etc in drivers/base/core.c I'll need to actually try to convince myself but if creating the link forces driver registration then it should be workable. > > In this particular case the problem is that since 7d981405d0fd ("soc: > > imx8m: change to use platform driver") the soc probe tries to use the > > nvmem driver for ocotp fuses for imx8m devices, which isn't ready yet. > > So soc loading gets pushed back to the end of the list because it gets > > defered and other drivers relying on soc_device_match get confused > > because they wrongly think a device doesn't match a quirk when it > > actually does. > > > > If there is a way to ensure the nvmem driver gets loaded before the soc, > > that would also solve the problem nicely, and avoid the need to mess > > with all the ~50 drivers which use it. > > > > Is there a way to control in what order drivers get loaded? Something in > > the dtb perhaps? > > For built-in drivers, load order depends on the initcall level and > link order (how things are lined listed in the Makefile hierarchy). > > For loadable modules, this is up to user space in the end. > > Which of the drivers in this scenario are loadable modules? All the drivers involved in my case are built-in (nvmem, soc and final soc_device_match consumer e.g. caam_jr that crashes the kernel if soc is not identified properly). I frankly don't like the idea of moving nvmem/ above soc/ in drivers/Makefile as a "solution" to this (especially as there is one that seems to care about what soc they run on...), so I'll have a look at links first, hopefully that will work out. Thanks, -- Dominique
Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match
Geert Uytterhoeven wrote on Mon, Apr 19, 2021 at 11:03:24AM +0200: > > This is going to need quite some more work to be acceptable, in my > > opinion, but I think it should be possible. > > In general, this is very hard to do, IMHO. Some drivers may be used on > multiple platforms, some of them registering an SoC device, some of > them not registering an SoC device. So there is no way to know the > difference between "SoC device not registered, intentionally", and > "SoC device not yet registered". Hm, good point, I was probably a bit too optimistic if there are devices which don't register any soc yet have drivers which want one; I don't see how to make the difference indeed... And that does mean we can't just defer all the time. > soc_device_match() should only be used as a last resort, to identify > systems that cannot be identified otherwise. Typically this is used for > quirks, which should only be enabled on a very specific subset of > systems. IMHO such systems should make sure soc_device_match() > is available early, by registering their SoC device early. I definitely agree there, my suggestion to defer was only because I know of no other way to influence the ordering of drivers loading reliably and gave up on soc being init'd early. In this particular case the problem is that since 7d981405d0fd ("soc: imx8m: change to use platform driver") the soc probe tries to use the nvmem driver for ocotp fuses for imx8m devices, which isn't ready yet. So soc loading gets pushed back to the end of the list because it gets defered and other drivers relying on soc_device_match get confused because they wrongly think a device doesn't match a quirk when it actually does. If there is a way to ensure the nvmem driver gets loaded before the soc, that would also solve the problem nicely, and avoid the need to mess with all the ~50 drivers which use it. Is there a way to control in what order drivers get loaded? Something in the dtb perhaps? Thanks, -- Dominique
Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match
Alice Guo (OSS) wrote on Mon, Apr 19, 2021 at 12:27:22PM +0800: > From: Alice Guo > > Update all the code that use soc_device_match A single patch might be difficult to accept for all components, a each maintainer will probably want to have a say on their subsystem? I would suggest to split these for a non-RFC version; a this will really need to be case-by-case handling. > because add support for soc_device_match returning -EPROBE_DEFER. (English does not parse here for me) I've only commented a couple of places in the code itself, but this doesn't seem to add much support for errors, just sweep the problem under the rug. > Signed-off-by: Alice Guo > --- > > diff --git a/drivers/bus/ti-sysc.c b/drivers/bus/ti-sysc.c > index 5fae60f8c135..00c59aa217c1 100644 > --- a/drivers/bus/ti-sysc.c > +++ b/drivers/bus/ti-sysc.c > @@ -2909,7 +2909,7 @@ static int sysc_init_soc(struct sysc *ddata) > } > > match = soc_device_match(sysc_soc_feat_match); > - if (!match) > + if (!match || IS_ERR(match)) > return 0; This function handles errors, I would recommend returning the error as is if soc_device_match returned one so the probe can be retried later. > > if (match->data) > diff --git a/drivers/clk/renesas/r8a7795-cpg-mssr.c > b/drivers/clk/renesas/r8a7795-cpg-mssr.c > index c32d2c678046..90a18336a4c3 100644 > --- a/drivers/clk/renesas/r8a7795-cpg-mssr.c > +++ b/drivers/clk/renesas/r8a7795-cpg-mssr.c > @@ -439,6 +439,7 @@ static const unsigned int r8a7795es2_mod_nullify[] > __initconst = { > > static int __init r8a7795_cpg_mssr_init(struct device *dev) > { > + const struct soc_device_attribute *match; > const struct rcar_gen3_cpg_pll_config *cpg_pll_config; > u32 cpg_mode; > int error; > @@ -453,7 +454,8 @@ static int __init r8a7795_cpg_mssr_init(struct device > *dev) > return -EINVAL; > } > > - if (soc_device_match(r8a7795es1)) { > + match = soc_device_match(r8a7795es1); > + if (!IS_ERR(match) && match) { Same, return the error. Assuming an error means no match will just lead to hard to debug problems because the driver potentially assumed the wrong device when it's just not ready yet. > cpg_core_nullify_range(r8a7795_core_clks, > ARRAY_SIZE(r8a7795_core_clks), > R8A7795_CLK_S0D2, R8A7795_CLK_S0D12); > [...] > diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c > index eaaec0a55cc6..13a06b613379 100644 > --- a/drivers/iommu/ipmmu-vmsa.c > +++ b/drivers/iommu/ipmmu-vmsa.c > @@ -757,17 +757,20 @@ static const char * const devices_allowlist[] = { > > static bool ipmmu_device_is_allowed(struct device *dev) > { > + const struct soc_device_attribute *match1, *match2; > unsigned int i; > > /* >* R-Car Gen3 and RZ/G2 use the allow list to opt-in devices. >* For Other SoCs, this returns true anyway. >*/ > - if (!soc_device_match(soc_needs_opt_in)) > + match1 = soc_device_match(soc_needs_opt_in); > + if (!IS_ERR(match1) && !match1) I'm not sure what you intended to do, but !match1 already means there is no error so the original code is identical. In this case ipmmu_device_is_allowed does not allow errors so this is one of the "difficult" drivers that require slightly more thinking. It is only called in ipmmu_of_xlate which does return errors properly, so in this case the most straightforward approach would be to make ipmmu_device_is_allowed return an int and forward errors as well. ... This is going to need quite some more work to be acceptable, in my opinion, but I think it should be possible. Thanks, -- Dominique
Re: [RFC v1 PATCH 1/3] drivers: soc: add support for soc_device_match returning -EPROBE_DEFER
First comment overall for the whole serie: Since it is the solution I had suggested when I reported the problem[1] I have no qualm on the approach, comments for individual patches follow. [1] http://lore.kernel.org/r/YGGZJjAxA1IO+/v...@atmark-techno.com Alice Guo (OSS) wrote on Mon, Apr 19, 2021 at 12:27:20PM +0800: > From: Alice Guo > > In i.MX8M boards, the registration of SoC device is later than caam > driver which needs it. Caam driver needs soc_device_match to provide > -EPROBE_DEFER when no SoC device is registered and no > early_soc_dev_attr. This patch should be last in the set: you can't have soc_device_match return an error before its callers handle it. > Signed-off-by: Alice Guo As the one who reported the problem I would have been appreciated being at least added to Ccs... I only happened to notice you posted this by chance. There is also not a single Fixes tag -- I believe this commit should have Fixes: 7d981405d0fd ("soc: imx8m: change to use platform driver") but I'm not sure how such tags should be handled in case of multiple patches fixing something. -- Dominique
Re: [PATCH] net: 9p: free what was emitted when read count is 0
Jisheng Zhang wrote on Tue, Mar 02, 2021 at 03:39:40PM +0800: > > Rather than make an exception for 0, how about just removing the if as > > follow ? > > IMHO, we may need to keep the "if" in current logic. When count > reaches zero, we need to break the "while(iov_iter_count(to))" loop, so > removing > the "if" modifying the logic. We're not looking at the same loop, the break will happen properly without the if because it's the return value of p9_client_read_once() now. In the old code I remember what you're saying and it makes sense, I guess that was the reason for the special case. It's not longer required, let's remove it. -- Dominique
Re: [PATCH] net: 9p: free what was emitted when read count is 0
Jisheng Zhang wrote on Mon, Mar 01, 2021 at 11:01:57AM +0800: > Per my understanding of iov_iter, we need to call iov_iter_advance() > even when the read out count is 0. I believe we can see this common style > in other fs. I'm not sure where you see this style, but I don't see exceptions for 0-sized read not advancing the iov in general, and I guess this makes sense. Rather than make an exception for 0, how about just removing the if as follow ? I've checked that the non_zc case (copy_to_iter with 0 size) also works to the same effect, so I'm not sure why the check got added in the first place... But then again this is old code so maybe the semantics changed since 2015. diff --git a/net/9p/client.c b/net/9p/client.c index 4f62f299da0c..0a0039255c5b 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -1623,11 +1623,6 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, struct iov_iter *to, } p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count); - if (!count) { - p9_tag_remove(clnt, req); - return 0; - } - if (non_zc) { int n = copy_to_iter(dataptr, count, to); If you're ok with that, would you mind resending that way? I'd also want the commit message to be reworded a bit, at least the first line (summary) doesn't make sense right now: I have no idea what you mean by "free what was emitted". Just "9p: advance iov on empty read" or something similar would do. > > cat version? coreutils' doesn't seem to do that on their git) > > busybox cat Ok, could reproduce with busybox cat, thanks. As expected I can't reproduce with older kernels so will run a bisect for the sake of it as time allows -- Dominique
Re: [PATCH] net: 9p: free what was emitted when read count is 0
Jisheng Zhang wrote on Mon, Mar 01, 2021 at 10:33:36AM +0800: > I met below warning when cating a small size(about 80bytes) txt file > on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we > miss iov_iter_advance() if the read count is 0, so we didn't truncate > the pipe, then iov_iter_pipe() thinks the pipe is full. Fix it by > calling iov_iter_advance() on the iov_iter "to" even if read count is 0 Hm, there are plenty of other error cases that don't call iov_iter_advance() and shouldn't trigger this warning ; I'm not sure just adding one particular call to this is a good solution. How reproducible is this? From the description it should happen everytime you cat a small file? (I'm surprised cat uses sendfile, what cat version? coreutils' doesn't seem to do that on their git) What kernel version do you get this on? Bonus points if you can confirm this didn't use to happen, and full points for a bisect. (cat on a small file is something I do all the time in my tests, I'd like to be able to reproduce to understand the issue better as I'm not familiar with that part of the code) Thanks, -- Dominique
Re: [PATCH] 9p: fix: Uninitialized variable p.
YANG LI wrote on Wed, Dec 30, 2020: > The pointer p is being used but it isn't being initialized, > need to assign a NULL to it. My understanding is p has to be initialized: the only way out of the while(1) loop below sets it. I don't mind fixing false positive warnings as it makes maintenance easier for everyone, but: 1/ the commit message needs to reflect that and at least name which tool had a problem with it. I'm tempted to think this case is common enough that the tool ought to be fixed instead... 2/ the code needs to work in the p=NULL case if you set it that way (right now, it doesn't, so in the event the code changes in the future and there really comes a way to skip initialization this change would actually hinder bug detection -- it'd need to add a p == NULL check below, which is going to be useless code, but hopefully compilers will optimize it away) Thanks, -- Dominique
[GIT PULL] 9p update for 5.11-rc1
Hi Linus, here's this cycle's update, finally finished on some very old patches (originally april 2015!) to allow fixing open-unlink-fgetattr pattern. Thanks to Eric, Greg and Jianyong for the bulk of the work, and Dan for static analysis fixes on -next. The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891: Linux 5.10-rc2 (2020-11-01 14:43:51 -0800) are available in the Git repository at: https://github.com/martinetd/linux tags/9p-for-5.11-rc1 for you to fetch changes up to cfd1d0f524a87b7d6d14b41a14fa4cbe522cf8cc: 9p: Remove unnecessary IS_ERR() check (2020-12-01 08:19:02 +0100) 9p for 5.11-rc1 - fix long-standing limitation on open-unlink-fop pattern - add refcount to p9_fid (fixes the above and will allow for more cleanups and simplifications in the future) Dan Carpenter (2): 9p: Uninitialized variable in v9fs_writeback_fid() 9p: Remove unnecessary IS_ERR() check Dominique Martinet (2): 9p: apply review requests for fid refcounting 9p: Fix writeback fid incorrectly being attached to dentry Eric Van Hensbergen (1): fs/9p: fix create-unlink-getattr idiom Greg Kurz (2): fs/9p: track open fids fs/9p: search open fids first Jianyong Wu (1): 9p: add refcount to p9_fid struct fs/9p/fid.c | 65 - fs/9p/fid.h | 11 ++- fs/9p/vfs_dentry.c | 2 ++ fs/9p/vfs_dir.c | 6 +- fs/9p/vfs_file.c| 7 --- fs/9p/vfs_inode.c | 47 ++- fs/9p/vfs_inode_dotl.c | 35 +-- fs/9p/vfs_super.c | 1 + fs/9p/xattr.c | 16 +--- include/net/9p/client.h | 7 +++ net/9p/client.c | 14 +- 11 files changed, 178 insertions(+), 33 deletions(-) -- Dominique
Re: [PATCH net-next] net: 9p: Fix kerneldoc warnings of missing parameters etc
Andrew Lunn wrote on Sun, Nov 01, 2020: > > > Signed-off-by: Andrew Lunn Acked-by: Dominique Martinet > > > > Thanks, LGTM I'll take this for next cycle unless someone is grabbing > > these > > I hope to turn on W=1 by default soon in most of /net. That patch is > likely to go to net-next. That would be nice! > What route do your patches normally take to Linus? Do you send a pull > request to net-next? Or straight to Linus? I normally send pull requests straight to Linus (because I also have fs/9p which isn't part of net/) ; but since it's really low volume I don't like bugging him everytime for such churn and am not really sure what to do -- that's why I asked :) > If this patch is not in net-next, i cannot enable it for 9p. So > either: > [...] > 4) Jakub takes this patch into net-next, and i can then enable W=1 in >9p along with all the other sub-directories. We will get to know >about new warnings in net-next, and next, but not in your tree. Developers should use next for development anyway; I think that's the easiest way forward if you want to enable W=1 ASAP. I mean, if I take the patch the fixes will get in next in the next few days sure but it'll make enabling W=1 difficult for the net-next tree without it. I've added Jakub to direct recipients, could you take this one? Thanks, -- Dominique
Re: [PATCH net-next] net: 9p: Fix kerneldoc warnings of missing parameters etc
Andrew Lunn wrote on Sat, Oct 31, 2020: > net/9p/client.c:420: warning: Function parameter or member 'c' not described > in 'p9_client_cb' > net/9p/client.c:420: warning: Function parameter or member 'req' not > described in 'p9_client_cb' > net/9p/client.c:420: warning: Function parameter or member 'status' not > described in 'p9_client_cb' > net/9p/client.c:568: warning: Function parameter or member 'uidata' not > described in 'p9_check_zc_errors' > net/9p/trans_common.c:23: warning: Function parameter or member 'nr_pages' > not described in 'p9_release_pages' > net/9p/trans_common.c:23: warning: Function parameter or member 'pages' not > described in 'p9_release_pages' > net/9p/trans_fd.c:132: warning: Function parameter or member 'rreq' not > described in 'p9_conn' > net/9p/trans_fd.c:132: warning: Function parameter or member 'wreq' not > described in 'p9_conn' > net/9p/trans_fd.c:56: warning: Function parameter or member 'privport' not > described in 'p9_fd_opts' > net/9p/trans_rdma.c:113: warning: Function parameter or member 'cqe' not > described in 'p9_rdma_context' > net/9p/trans_rdma.c:129: warning: Function parameter or member 'privport' not > described in 'p9_rdma_opts' > net/9p/trans_virtio.c:215: warning: Function parameter or member 'limit' not > described in 'pack_sg_list_p' > net/9p/trans_virtio.c:83: warning: Function parameter or member 'chan_list' > not described in 'virtio_chan' > net/9p/trans_virtio.c:83: warning: Function parameter or member > 'p9_max_pages' not described in 'virtio_chan' > net/9p/trans_virtio.c:83: warning: Function parameter or member > 'ring_bufs_avail' not described in 'virtio_chan' > net/9p/trans_virtio.c:83: warning: Function parameter or member 'tag' not > described in 'virtio_chan' > net/9p/trans_virtio.c:83: warning: Function parameter or member 'vc_wq' not > described in 'virtio_chan' > > Signed-off-by: Andrew Lunn Thanks, LGTM I'll take this for next cycle unless someone is grabbing these -- Dominique
[GIT PULL] 9p update for 5.10-rc1
Hi Linus, another harmless cycle. (sorry latest commit's message isn't great, I was half expecting a v2 but it didn't come and I remembered too late/didn't want to reword it myself; and it's still worth taking as is) Thanks, The following changes since commit 549738f15da0e5a00275977623be199fbbf7df50: Linux 5.9-rc8 (2020-10-04 16:04:34 -0700) are available in the Git repository at: https://github.com/martinetd/linux tags/9p-for-5.10-rc1 for you to fetch changes up to 7ca1db21ef8e0e6725b4d25deed1ca196f7efb28: net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid (2020-10-12 10:05:47 +0200) 9p pull request for inclusion in 5.10 A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized variable warning) and code cleanup (xen) Anant Thazhemadam (1): net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid Matthew Wilcox (Oracle) (1): 9P: Cast to loff_t before multiplying Ye Bin (1): 9p/xen: Fix format argument warning fs/9p/vfs_file.c | 4 ++-- net/9p/trans_fd.c | 2 +- net/9p/trans_xen.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) -- Dominique
Re: [PATCH net] net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
Anant Thazhemadam wrote on Mon, Oct 12, 2020: > You mentioned how a fully zeroed address isn't exactly faulty. By extension, > wouldn't that > mean that an address that simply begins with a 0 isn't faulty as well? That is correct. If you have a look at the unix(7) man page that describes AF_UNIX, it describes what 'abstract' addresses are and unix_mkname() in linux's net/unix/af_unix.c shows how it's handled. > This is an interesting point, because if the condition is modified to > checking for addr[0] directly, > addresses that simply begin with 0 (but have more non-zero content following) > wouldn't be > copied over either, right? Yes, we would reject any address that starts with a nul byte -- but that is already exactly what your patch does with strlen() already: a '\0' at the start of the string is equivalent to strlen(addr) == 0. The only difference is that checking for addr[0] won't run through all the string if it doesn't start with a nul byte; but this is a one-time thing at mount so it really doesn't matter. > In the end, it comes down to what you define as a "valid" value that sun_path > can have. > We've already agreed that a fully zeroed address wouldn't qualify as a valid > value for sun_path. > Are addresses that aren't fully zeroed, but only begin with a 0 also to be > considered as an > unacceptable value for sun_path? Yes, because the strcpy() a few lines below would copy nothing, leaving sun_server.sun_path uninitialized like your example. At that point you could ask why not "fix" that strcpy to properly copy the address passed instead but that doesn't really make sense given where 'addr' comes from: it's passed from userspace as a nul-terminated string, so nothing after the first '\0' is valid. There probably are ways to work around that (e.g. iproute's ss will display abstract addresses with a leading '@' instead) but given nobody ever seemed to care I think it's safe to just return EINVAL there like you did ; there's nothing wrong with your patch as far as I'm concerned. -- Dominique
Re: [PATCH net] net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
Anant Thazhemadam wrote on Mon, Oct 12, 2020: > In p9_fd_create_unix, checking is performed to see if the addr (passed > as an argument) is NULL or not. > However, no check is performed to see if addr is a valid address, i.e., > it doesn't entirely consist of only 0's. > The initialization of sun_server.sun_path to be equal to this faulty > addr value leads to an uninitialized variable, as detected by KMSAN. > Checking for this (faulty addr) and returning a negative error number > appropriately, resolves this issue. I'm not sure I agree a fully zeroed address is faulty but I agree we can probably refuse it given userspace can't pass useful abstract addresses here. Just one nitpick but this is otherwise fine - good catch! > Reported-by: syzbot+75d51fe5bf4ebe988...@syzkaller.appspotmail.com > Tested-by: syzbot+75d51fe5bf4ebe988...@syzkaller.appspotmail.com > Signed-off-by: Anant Thazhemadam > --- > net/9p/trans_fd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c > index c0762a302162..8f528e783a6c 100644 > --- a/net/9p/trans_fd.c > +++ b/net/9p/trans_fd.c > @@ -1023,7 +1023,7 @@ p9_fd_create_unix(struct p9_client *client, const char > *addr, char *args) > > csocket = NULL; > > - if (addr == NULL) > + if (!addr || !strlen(addr)) Since we don't care about the actual length here, how about checking for addr[0] directly? That'll spare a strlen() call in the valid case. Well, I guess it doesn't really matter -- I'll queue this up anyway and update if you resend. Thanks, -- Dominique
Re: INFO: task can't die in p9_fd_close
Mark Brown wrote on Wed, Aug 26, 2020: > On Wed, Aug 26, 2020 at 03:38:15AM -0700, syzbot wrote: > > > console output: https://syzkaller.appspot.com/x/log.txt?x=10615b3690 > > kernel config: https://syzkaller.appspot.com/x/.config?x=a61d44f28687f508 > > dashboard link: https://syzkaller.appspot.com/bug?extid=fbe34b643e462f65e542 > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15920a0590 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13a7853990 > > > > The issue was bisected to: > > > > commit af3acca3e35c01920fe476f730dca7345d0a48df > > Author: Daniel Baluta > > Date: Tue Feb 20 12:53:10 2018 + > > > > ASoC: ak5558: Fix style for SPDX identifier > > This bisection is clearly not accurate, I'm guessing the bug is > intermittent and it was just luck that landed it on this commit. It's a bug that's been present since day 1 pretty much. I have a fix that had been overcooking for a while which I had planned to take in this cycle -- I'll submit to -next during next week, so hopefully syzbot will be able to spend its time more usefully after that. -- Dominique
[GIT PULL] 9p update for 5.9-rc1
Hi Linus, the usual small stuff. Thanks! The following changes since commit 74d6a5d5662975aed7f25952f62efbb6f6dadd29: 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work (2020-07-19 14:58:47 +0200) are available in the Git repository at: https://github.com/martinetd/linux tags/9p-for-5.9-rc1 for you to fetch changes up to 2ed0b7578170c3bab10cde09d4440897b305e40c: 9p: Remove unneeded cast from memory allocation (2020-07-31 07:28:25 +0200) 9p pull request for inclusion in 5.9 - some code cleanup - a couple of static analysis fixes - setattr: try to pick a fid associated with the file rather than the dentry, which might sometimes matter Alexander Kapshuk (1): net/9p: Fix sparse endian warning in trans_fd.c Jianyong Wu (2): 9p: retrieve fid from file when file instance exist. 9p: remove unused code in 9p Li Heng (1): 9p: Remove unneeded cast from memory allocation Zheng Bin (1): 9p: Fix memory leak in v9fs_mount fs/9p/v9fs.c | 5 ++--- fs/9p/vfs_inode.c | 65 - fs/9p/vfs_inode_dotl.c | 9 +++-- net/9p/trans_fd.c | 2 +- 4 files changed, 18 insertions(+), 63 deletions(-)
Re: 9p/trans_fd lockup
Alexey Kardashevskiy wrote on Thu, Aug 06, 2020: > I am seeing another bug in 9p under syzkaller, the reprocase is: > > r0 = open$dir(&(0x7f40)='./file0\x00', 0x88142, 0x182) > > r1 = openat$null(0xff9c, &(0x7f000640)='/dev/null\x00', > 0x0, 0x0) > mount$9p_fd(0x0, &(0x7f00)='./file0\x00', > &(0x7fc0)='9p\x00', 0x0, &(0x7f000100)={'trans=fd,', > {'rfdno', 0x3d, r1}, 0x2$, {'wfdno', 0x3d, r0}}) > > > > The default behaviour of syzkaller is to call syscalls concurrently (I > think), at least it forks by default and executes the same sequence in > both threads. > > In this example both threads makes it to: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n757 > > and sit there with the only difference which is thread#1 goes via > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n767 > > I am pretty sure things should not have gone that far but I cannot > clearly see what needs fixing. Ideas? Thanks, Unkillable threads there happen with the current p9_client_rpc when there is no real server (or server bug etc); the code is really stupid. Basically what happens is that when you send a first signal (^C or whatever), the function catches the signal, sends a flush, and indefinitely waits for the flush to come back. If you send another signal there no more flush comes but it goes back to waiting -- it's using wait_event_killable but it's a lie it's not really killable it just loops on that wait until the flush finally comes, which will never come in your case. (the rpc that came by the way is probably version or whatever is first done on mount) Dmitry reported that to me ages ago and I have a fix which is just to stop waiting for the flush -- just make it asynchronous, send and forget. That removes the whole signal handling logic and it won't hang there anymore. I sent the patches to the list last year, but didn't get much feedback and didn't have time to run all the tests I wanted to run on it. I have some free time at the end of the month so I was planning to finish it for 5.10 (e.g. won't send it for 5.9 but once 5.9 initial merge window passed leave it in -next for a couple of months and push it for 5.10), so your timing is pretty good :) An extra pair of eyes would be more than appreciated. You can find the original mails there: https://lore.kernel.org/lkml/1544532108-21689-3-git-send-email-asmad...@codewreck.org/ They're also in my 9p-test branch on git://github.com/martinetd/linux Cheers & thanks for the attention, -- Dominique
[GIT PULL] 9p update for 5.8
Sorry for the late request, I took some time to fix my test setup and to be convinced these two patches are worth sending now and not wait until the next merge window. (the "weird" -2 at the end of the tag is because I had already used 9p-for-5.8 for the original -rc1 pull) The following changes since commit 11ba468877bb23f28956a35e896356252d63c983: Linux 5.8-rc5 (2020-07-12 16:34:50 -0700) are available in the Git repository at: https://github.com/martinetd/linux tags/9p-for-5.8-2 for you to fetch changes up to 74d6a5d5662975aed7f25952f62efbb6f6dadd29: 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work (2020-07-19 14:58:47 +0200) A couple of syzcaller fixes for 5.8 the first one in particular has been quite noisy ("broke" in -rc5) so this would be worth landing even this late even if users likely won't see a difference Christoph Hellwig (1): net/9p: validate fds in p9_fd_open Wang Hai (1): 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work net/9p/trans_fd.c | 39 ++- 1 file changed, 30 insertions(+), 9 deletions(-) -- Dominique
Re: [V9fs-developer] [PATCH kernel] 9p/trans_fd: Check file mode at opening
Greg Kurz wrote on Tue, Jul 28, 2020: > > The "fd" transport layer uses 2 file descriptors passed externally > > and calls kernel_write()/kernel_read() on these. If files were opened > > without FMODE_WRITE/FMODE_READ, WARN_ON_ONCE() will fire. There already is a fix in linux-next as a39c46067c84 ("net/9p: validate fds in p9_fd_open") > > This adds file mode checking in p9_fd_open; this returns -EBADF to > > preserve the original behavior. > > So this would cause open() to fail with EBADF, which might look a bit > weird to userspace since it didn't pass an fd... Is this to have a > different error than -EIO that is returned when either rfd or wfd > doesn't point to an open file descriptor ? If yes, why do we care ? FWIW the solution taken just returns EIO as it would if an invalid fd was given, but since it did pass an fd EBADF actually makes sense to me? However to the second question I'm not sure I care :) > > Found by syzkaller. I'm starting to understand where David comment came from the other day, I guess it's still time to change my mind and submit to linus now I've had time to test it... -- Dominique
Re: [PATCH] net/9p: validate fds in p9_fd_open
David Miller wrote on Wed, Jul 15, 2020: > From: Dominique Martinet > Date: Wed, 15 Jul 2020 15:47:56 +0200 > > It's honestly just a warn on something that would fail anyway so I'd > > rather let it live in -next first, I don't get why syzbot is so verbose > > about this - it sent a mail when it found a c repro and one more once it > > bisected the commit yesterday but it should not be sending more? > > I honestly find it hard to understand the resistence to fixing the > warning in mainline. > > I merge such fixes aggressively. Well, if it was something a user could ever see with normal (even exotic) 9p workloads, sure; I would want that in mainline asap and do what's required in a hurry. But this warning only happens when passing fd that are invalid, so the mount would fail with EIO anyway, and it's not a dos either -- I don't see the harm really. Someone who'd get errors anyway will just get slightly more verbose errors (and for people like me with kernel.panic_on_warn set it'll even crash their machines sure), and "normal" users won't ever see it -- I see no reason to rush this. It's not about the "extra work" of sending things to linus in a single patch PR (it's honestly a wonder 9p gets maintainers at all, the volume of patches doesn't really mandate it), but I need to fix tests first anyway as said previously. I've spent a couple of hours on it yesterday, and should be able to get things running again soonish -- meanwhile I'm not comfortable sending any patch anywhere anyway. Yes given the patch content it's probably fine but syzbot doesn't test that a 9p mount with a fd argument works, just that there's no warning / crash, so for all I know we could just be returning -EIO early and calling it a fix. I don't see any reason this would fail, but the point of tests is to... test things work the things we think they do? Anyways, if you care about this feel free to take the patch and send it along with your process earlier. I'm just stubborn in not wanting to send things I could test untested and it came at a bad time / don't think this is critical enough to manually test. Then again I probably just spent more time arguing about it than it would have taken to test... (if you do please fix the goto as pointed out in a review) Thanks, -- Dominique
Re: [PATCH] net/9p: validate fds in p9_fd_open
Christoph Hellwig wrote on Wed, Jul 15, 2020: > FYI, this is now generating daily syzbot reports, so I'd love to see > the fix going into Linus' tree ASAP.. Yes, I'm getting some syzbot warnings as well now. I had however only planned to get this in linux-next, since that is what the syzbot mails were complaining about, but I see this got into -rc5... It's honestly just a warn on something that would fail anyway so I'd rather let it live in -next first, I don't get why syzbot is so verbose about this - it sent a mail when it found a c repro and one more once it bisected the commit yesterday but it should not be sending more? (likewise it should pick up the fix tag even if it only gets in -next, or would it keep being noisy unless this gets merged to mainline?) FWIW this is along with the 5 other patches I have queued for 5.9 waiting for my tests as my infra is still down, I've stopped trying to make promises, but I could push at least just this one to -next if that really helps. Sorry for that, things should be smoother once I've taken the time to put things back in place. -- Dominique
Re: [PATCH net-next 01/20] net: 9p: kerneldoc fixes
Andrew Lunn wrote on Mon, Jul 13, 2020: > Simple fixes which require no deep knowledge of the code. > > Cc: Eric Van Hensbergen > Cc: Latchesar Ionkov > Cc: Dominique Martinet Ack. > Signed-off-by: Andrew Lunn Can't hurt & looks correct; thanks. David, assuming you'll take the whole set so ignoring this patch; please let me know if I should pick it up. -- Dominique
Re: [PATCH] net/9p: validate fds in p9_fd_open
Doug Nazar wrote on Fri, Jul 10, 2020: > On 2020-07-10 04:57, Christoph Hellwig wrote: > > >diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c > >index 13cd683a658ab6..1cd8ea0e493617 100644 > >--- a/net/9p/trans_fd.c > >+++ b/net/9p/trans_fd.c > >@@ -803,20 +803,28 @@ static int p9_fd_open(struct p9_client *client, int > >rfd, int wfd) > > return -ENOMEM; > > ts->rd = fget(rfd); > >+if (!ts->rd) > >+goto out_free_ts; > >+if (!(ts->rd->f_mode & FMODE_READ)) > >+goto out_put_wr; > > goto out_put_rd; > > unless I'm mistaken. Good catch, I've amended the commit so feel free to skip resending unless want to change something https://github.com/martinetd/linux/commit/28e987a0dc66744fb119e18150188fd8e3debd40 -- Dominique
Re: [PATCH] net/9p: validate fds in p9_fd_open
Christoph Hellwig wrote on Fri, Jul 10, 2020: > p9_fd_open just fgets file descriptors passed in from userspace, but > doesn't verify that they are valid for read or writing. This gets > cought down in the VFS when actually attemping a read or write, but a > new warning added in linux-next upsets syzcaller. > > Fix this by just verifying the fds early on. > > Reported-by: syzbot+e6f77e16ff68b2434...@syzkaller.appspotmail.com > Signed-off-by: Christoph Hellwig Looks good to me, I'll pick it up shortly. -- Dominique
Re: [PATCH] net/9p: Validate current->sighand in client.c
Alexander Kapshuk wrote on Sun, Jun 21, 2020: > Fix rcu not being dereferenced cleanly by using the task > helpers (un)lock_task_sighand instead of spin_lock_irqsave and > spin_unlock_irqrestore to ensure current->sighand is a valid pointer as > suggested in the email referenced below. Ack. I'll take this once symbol issue is resolved ; thanks for your time. -- Dominique
Re: [PATCH] net/9p: Validate current->sighand in client.c
Alexander Kapshuk wrote on Sun, Jun 21, 2020: > Thanks for your feedback. > Shall I simply resend the v2 patch with the commit message reworded as > you suggested, or should I make it a v3 patch instead? Yes please resend the same commit reworded. v2 or v3 doesn't matter much, the [PATCH whatever] part of the mail isn't included in the commit message; I don't receive so much patches to be fussy about that :) > One other thing I wanted to clarify is I got a message from kernel > test robot , > https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/TMTLPYU6A522JH2VCN3PNZVAP6EE5MDF/, > saying that on parisc my patch resulted in __lock_task_sighand being > undefined during modpost'ing. > I've noticed similar messages about other people's patches on the > linux-kernel mailing list with the responses stating that the issue > was at the compilation site rather than with the patch itself. > As far as I understand, that is the case with my patch also. So no > further action on that is required of me, is it? Thanks, I hadn't noticed this mail -- unfortunately I don't have anything setup to test pa risc, but __lock_task_sighand is defined in kernel/signal.c which is unconditionally compiled so shouldn't have anything to do with arch-specific failures, so I will run into the same problem when I test this. >From just looking at the code, it looks like a real problem to me - __lock_task_sighand is never passed to EXPORT_SYMBOL so when building 9p as a module we cannot use the function. Only exported symbols can be called from code that can be built as modules. That looks like an oversight to me more than something on purpose, but it does mean I cannot take the patch right now -- we need to first get the symbol exported before we can use it in 9p. As things stand I'd rather have this patch wait one cycle for this than revert to manipulating rcu directly like you did first -- if you're up for it you can send a patch to get it exported first and I'll pick this patch up next cycle, or I can take care of that too if you don't want to bother. Letting you tell me which you prefer, -- Dominique
Re: [PATCH] net/9p: Validate current->sighand in client.c
Alexander Kapshuk wrote on Sat, Jun 20, 2020: > Use (un)lock_task_sighand instead of spin_lock_irqsave and > spin_unlock_irqrestore to ensure current->sighand is a valid pointer as > suggested in the email referenced below. Thanks for v2! Patch itself looks good to me. I always add another `Link:` tag to the last version of the patch at the time of applying, so the message might be a bit confusing. Feel free to keep the link to the previous discussion but I'd rather just repeat a bit more of what we discussed (e.g. fix rcu not being dereferenced cleanly by using the task helpers as suggested) rather than just link to the thread Sorry for nitpicking but I think commit messages are important and it's better if they're understandable out of context, even if you give a link for further details for curious readers, it helps being able to just skim through git log. Either way I'll include the patch in my test run today or tomorrow, had promised it for a while... Cheers, -- Dominique
Re: [PATCH] net/9p: Fix sparse endian warning in trans_fd.c
Alexander Kapshuk wrote on Thu, Jun 18, 2020: > Address sparse endian warning: > net/9p/trans_fd.c:932:28: warning: incorrect type in assignment (different > base types) > net/9p/trans_fd.c:932:28:expected restricted __be32 [addressable] > [assigned] [usertype] s_addr > net/9p/trans_fd.c:932:28:got unsigned long > > Signed-off-by: Alexander Kapshuk INADDR_ANY is 0 so this really is noop but sure, less warnings is always good. I'll take this one for 5.9. Thanks! -- Dominique
Re: [PATCH] net/9p: Fix sparse rcu warnings in client.c
Alexander Kapshuk wrote on Thu, Jun 18, 2020: > Address sparse nonderef rcu warnings: > net/9p/client.c:790:17: warning: incorrect type in argument 1 (different > address spaces) > net/9p/client.c:790:17:expected struct spinlock [usertype] *lock > net/9p/client.c:790:17:got struct spinlock [noderef] * > net/9p/client.c:792:48: warning: incorrect type in argument 1 (different > address spaces) > net/9p/client.c:792:48:expected struct spinlock [usertype] *lock > net/9p/client.c:792:48:got struct spinlock [noderef] * > net/9p/client.c:872:17: warning: incorrect type in argument 1 (different > address spaces) > net/9p/client.c:872:17:expected struct spinlock [usertype] *lock > net/9p/client.c:872:17:got struct spinlock [noderef] * > net/9p/client.c:874:48: warning: incorrect type in argument 1 (different > address spaces) > net/9p/client.c:874:48:expected struct spinlock [usertype] *lock > net/9p/client.c:874:48:got struct spinlock [noderef] * > > Signed-off-by: Alexander Kapshuk Thanks for this patch. >From what I can see, there are tons of other parts of the code doing the same noderef access pattern to access current->sighand->siglock and I don't see much doing that. A couple of users justify this by saying SLAB_TYPESAFE_BY_RCU ensures we'll always get a usable lock which won't be reinitialized however we access it... It's a bit dubious we'll get the same lock than unlock to me, so I agree to some change though. After a second look I think we should use something like the following: if (!lock_task_sighand(current, &flags)) warn & skip (or some error, we'd null deref if this happened currently); recalc_sigpending(); unlock_task_sighand(current, &flags); As you can see, the rcu_read_lock() isn't kept until the unlock so I'm not sure it will be enough to please sparse, but I've convinced myself current->sighand cannot change while we hold the lock and there just are too many such patterns in the kernel. Please let me know if I missed something or if there is an ongoing effort to change how this works; I'll wait for a v2. -- Dominique
Re: [PATCH v2] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work
Wang Hai wrote on Fri, Jun 12, 2020: > p9_read_work and p9_fd_cancelled may be called concurrently. > In some cases, req->req_list may be deleted by both p9_read_work > and p9_fd_cancelled. > > We can fix it by ignoring replies associated with a cancelled > request and ignoring cancelled request if message has been received > before lock. > > Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference to > "cancelled" function") > Reported-by: syzbot+77a25acfa0382e06a...@syzkaller.appspotmail.com > Signed-off-by: Wang Hai Thanks! looks good to me, I'll queue for 5.9 as well unless you're in a hurry. -- Dominique
Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work
wanghai (M) wrote on Fri, Jun 12, 2020: > You are right, I got a syzkaller bug. > > "p9_read_work+0x7c3/0xd90" points to list_del(&m->rreq->req_list); > > [ 62.733598] kasan: CONFIG_KASAN_INLINE enabled > [ 62.734484] kasan: GPF could be caused by NULL-ptr deref or user memory > access > [ 62.735670] general protection fault: [#1] SMP KASAN PTI > [ 62.736577] CPU: 3 PID: 82 Comm: kworker/3:1 Not tainted 4.19.124+ #2 > [ 62.737582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.10.2-1ubuntu1 04/01/2014 > [ 62.738988] Workqueue: events p9_read_work > [ 62.739642] RIP: 0010:p9_read_work+0x7c3/0xd90 > [ 62.740348] Code: 48 c1 e9 03 80 3c 01 00 0f 85 cb 05 00 00 48 8d 7a 08 48 > b9 00 00 00 00 00 fc ff df 49 8b 87 b8 00 00 00 48 89 fe 48 c1 ee 03 <80> 3c > 0e 00 0f 85 89 05 00 00 48 89 c6 48 b9 00 00 00 00 00 fc ff > [ 62.743236] RSP: 0018:8883ece17ca0 EFLAGS: 00010a06 > [ 62.744059] RAX: dead0200 RBX: 8883d45666b0 RCX: > dc00 > [ 62.745173] RDX: dead0100 RSI: 1bd5a021 RDI: > dead0108 > [ 62.746279] RBP: 8883d4566590 R08: ed107a8acf31 R09: > ed107a8acf31 > [ 62.747398] R10: 0001 R11: ed107a8acf30 R12: > 11107d9c2f9b > [ 62.748505] R13: 8883d45665d0 R14: 8883d4566608 R15: > 8883e1f1c000 > [ 62.749615] FS: () GS:8883ef18() > knlGS: > [ 62.750881] CS: 0010 DS: ES: CR0: 80050033 > [ 62.751784] CR2: CR3: 9c622003 CR4: > 007606e0 > [ 62.752898] DR0: DR1: DR2: > > [ 62.754011] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 62.755126] PKRU: 5554 > [ 62.755561] Call Trace: > [ 62.755963] ? p9_write_work+0xa00/0xa00 > [ 62.756592] process_one_work+0xae4/0x1b20 > [ 62.757252] ? apply_wqattrs_commit+0x3e0/0x3e0 > [ 62.757985] worker_thread+0x8c/0xe80 > [ 62.758600] ? __kthread_parkme+0xe9/0x190 > [ 62.759254] ? process_one_work+0x1b20/0x1b20 > [ 62.759950] kthread+0x341/0x410 > [ 62.760479] ? kthread_create_worker_on_cpu+0xf0/0xf0 > [ 62.761296] ret_from_fork+0x3a/0x50 > [ 62.761874] Modules linked in: > [ 62.762378] Dumping ftrace buffer: > [ 62.762942](ftrace buffer empty) > [ 62.763547] ---[ end trace 69672816613947a3 ]--- This looks like: https://syzkaller.appspot.com/bug?id=5df4f85d764ee89863d0294b4e0c87ef2fd2c624 I'm not sure how active this still is but please also add this Reported-by tag: Reported-by: syzbot+77a25acfa0382e06a...@syzkaller.appspotmail.com (can keep both) > Yes,In this case, all further 9p messages will not be read. > >p9_read_work probably should handle REQ_STATUS_FLSHD in a special case > >that just throws the message away without error as well. > > Can it be solved like this? > > --- a/net/9p/trans_fd.c > +++ b/net/9p/trans_fd.c > @@ -362,7 +362,7 @@ static void p9_read_work(struct work_struct *work) >    if (m->rreq->status == REQ_STATUS_SENT) { >    list_del(&m->rreq->req_list); >    p9_client_cb(m->client, m->rreq, REQ_STATUS_RCVD); > -  } else { > +  } else if (m->rreq->status != REQ_STATUS_FLSHD) { >    spin_unlock(&m->client->lock); >    p9_debug(P9_DEBUG_ERROR, > "Request tag %d errored out while > we were reading the reply\n", Yes that is probably correct. Please add a comment above saying we ignore replies associated with a cancelled request. > This patch "afd8d65411" just moved list_del into cancelled ops. It > is not actually the initial patch that caused the bug > > In 60ff779c4abb ("9p: client: remove unused code and any reference > to "cancelled" function") > > It moved spin_lock under "if (oldreq->status == REQ_STATUS_FLSH)" . > > After "if (oldreq->status == REQ_STATUS_FLSH)", oldreq may be > changed by other thread. Ok, thank you for explaining; I agree now. -- Dominique
Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work
Wang Hai wrote on Thu, Jun 11, 2020: > p9_read_work and p9_fd_cancelled may be called concurrently. Good catch. I'm sure this fixes some of the old syzbot bugs... I'll check other transports handle this properly as well. > Before list_del(&m->rreq->req_list) in p9_read_work is called, > the req->req_list may have been deleted in p9_fd_cancelled. > We can fix it by setting req->status to REQ_STATUS_FLSHD after > list_del(&req->req_list) in p9_fd_cancelled. hm if you do that read_work will fail with EIO and all further 9p messages will not be read? p9_read_work probably should handle REQ_STATUS_FLSHD in a special case that just throws the message away without error as well. > Before list_del(&req->req_list) in p9_fd_cancelled is called, > the req->req_list may have been deleted in p9_read_work. > We should return when req->status = REQ_STATUS_RCVD which means > we just received a response for oldreq, so we need do nothing > in p9_fd_cancelled. I'll need some time to convince myself the refcounting is correct in this case. Pre-ref counting this definitely was wrong, but now it might just work by chance I'll double-check. > Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference > to "cancelled" function") I don't understand how this commit is related? At least make it afd8d65411 ("9P: Add cancelled() to the transport functions.") which adds the op, not something that removed a previous version of cancelled even earlier. > diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c > index f868cf6fba79..a563699629cb 100644 > --- a/net/9p/trans_fd.c > +++ b/net/9p/trans_fd.c > @@ -718,11 +718,18 @@ static int p9_fd_cancelled(struct p9_client *client, > struct p9_req_t *req) > { > p9_debug(P9_DEBUG_TRANS, "client %p req %p\n", client, req); > > - /* we haven't received a response for oldreq, > - * remove it from the list. > + /* If req->status == REQ_STATUS_RCVD, it means we just received a > + * response for oldreq, we need do nothing here. Else, remove it from > + * the list. (nitpick) this feels a bit hard to read, and does not give any information: you're just paraphrasing the C code. I would suggest moving the comment after the spinlock and say what we really do ; something as simple as "ignore cancelled request if message has been received before lock" is enough. >*/ > spin_lock(&client->lock); > + if (req->status == REQ_STATUS_RCVD) { > + spin_unlock(&client->lock); > + return 0; > + } > + > list_del(&req->req_list); > + req->status = REQ_STATUS_FLSHD; > spin_unlock(&client->lock); > p9_req_put(req); > -- Dominique
Re: [PATCH -next] 9p/xen: Add cleanup path in p9_trans_xen_init
YueHaibing wrote on Tue, Apr 30, 2019: > If xenbus_register_frontend() fails in p9_trans_xen_init, > we should call v9fs_unregister_trans() to do cleanup. > > Fixes: 868eb122739a ("xen/9pfs: introduce Xen 9pfs transport driver") > Signed-off-by: YueHaibing Thanks, queued both in my repo - sorry for the delay. -- Dominique
Re: [PATCH] 9p/net: fix memory leak in p9_client_create
zhengbin wrote on Wed, Mar 13, 2019: > If msize is less than 4096, we should close and put trans, destroy > tagpool, not just free client. This patch fixes that. > > Fixes: 574d356b7a02 ("9p/net: put a lower bound on msize") > Reported-by: Hulk Robot > Signed-off-by: zhengbin Ack, definitely. Thanks for the fix, I'll take this for 5.1 even if it's a bit late. Please do Cc linux-kernel@vger.k.o next time though, that list should get all the patches. -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Tom Herbert wrote on Fri, Feb 22, 2019: > > > So basically it sounds like you're interested in supporting TCP > > > connections that are half closed. I believe that the error in half > > > closed is EPIPE, so if the TCP socket returns that it can be ignored > > > and the socket can continue being attached and used to send data. > > > > Did you mean 'can continue being attached and used to receive data'? > > No, I meant shutdown on receive side when FIN is receved. TX is still > allowed to drain an queued bytes. To support shutdown on the TX side > would require additional logic since we need to effectively detach the > transmit path but retain the receive path. I'm not sure this is a > compelling use case to support. Hm, it must be a matter of how we see thing but from what I understand it's exactly the other way around. The remote closed the connection, so trying to send anything would just yield a RST, so TX doesn't make sense. On the other hand, anything that had been sent by the remote before the FIN and is on the local side's memory should still be receivable. When you think about it as a TCP stream it's really weird: data coming, data coming, data coming, FIN received. But in the networking stack that received FIN short-circuits all the data that was left around and immediately raises an EPIPE error. I don't see what makes this FIN packet so great that it should be processed before the data; we should only see that EPIPE when we're done reading the data before it or trying to send something. I'll check tomorrow/next week but I'm pretty sure the packets before that have been ack'd at a tcp level as well, so losing them in the application level is really unexpected. > > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see > > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on > > both the csock and kcm socket will do many needless wakeups on only the > > csock from what I can see, so I'd need some holdoff timer or something. > > I guess it's possible though. > > We might need to clear the error somehow. May a read of zero bytes? Can try. > > After a bit more debugging, this part works (__strp_recv() is called > > again); but the next packet that is treated properly is rejected because > > by the time __strp_recv() was called again a new skb was read and the > > length isn't large enough to go all the way into the new packet, so this > > test fails: > > } else if (len <= (ssize_t)head->len - > > skb->len - stm->strp.offset) { > > /* Length must be into new skb (and also > > * greater than zero) > > */ > > STRP_STATS_INCR(strp->stats.bad_hdr_len); > > strp_parser_err(strp, -EPROTO, desc); > > > > So I need to figure a way to say "call this function again without > > reading more data" somehow, or make this check more lax e.g. accept any > > len > 0 after a retry maybe... > > Removing that branch altogether seems to work at least but I'm not sure > > we'd want to? > > I like the check since it's conservative and covers the normal case. > Maybe just need some more logic? I can add a "retrying" state and not fail here if we ewre retrying for whatever reason perhaps... But I'm starting to wonder how this would work if my client didn't keep on sending data, I'll try to fail on the last client's packet and see how __strp_recv is called again through the timer, with the same skb perhaps? -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Tom Herbert wrote on Wed, Feb 20, 2019: > > When the client closes the socket, some messages are obviously still "in > > flight", and the server will recv a POLLERR notification on the csock at > > some point with many messages left. > > The documentation says to unattach the csock when you get POLLER. If I > > do that, the kcm socket will no longer give me any message, so all the > > messages still in flight at the time are lost. > > So basically it sounds like you're interested in supporting TCP > connections that are half closed. I believe that the error in half > closed is EPIPE, so if the TCP socket returns that it can be ignored > and the socket can continue being attached and used to send data. Did you mean 'can continue being attached and used to receive data'? I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see how to efficiently ignore EPIPE until POLLIN gets unset -- polling on both the csock and kcm socket will do many needless wakeups on only the csock from what I can see, so I'd need some holdoff timer or something. I guess it's possible though. > Another possibility is to add some linger semantics to an attached > socket. For instance, a large message might be sent so that part of > the messge is queued in TCP and part is queued in the KCM socket. > Unattach would probably break that message. We probably want to linger > option similar to SO_LINGER (or maybe just use the option on the TCP > socket) that means don't complete the detach until any message being > transmitted on the lower socket has been queued. That would certainly work, even if non-obvious from a user standpoint. > > > I'd like to see some retry on ENOMEM before this is merged though, so > > > while I'm there I'll resend this with a second patch doing that > > > retry,.. I think just not setting strp->interrupted and not reporting > > > the error up might be enough? Will have to try either way. > > > > I also tried playing with that without much success. > > I had assumed just not calling strp_parser_err() (which calls the > > abort_parser cb) would be enough, eventually calling strp_start_timer() > > like the !len case, but no can do. > > I think you need to ignore the ENOMEM and have a timer or other > callback to retry the operation in the future. Yes, that's what I had intended to try; basically just break out and schedule timer as said above. After a bit more debugging, this part works (__strp_recv() is called again); but the next packet that is treated properly is rejected because by the time __strp_recv() was called again a new skb was read and the length isn't large enough to go all the way into the new packet, so this test fails: } else if (len <= (ssize_t)head->len - skb->len - stm->strp.offset) { /* Length must be into new skb (and also * greater than zero) */ STRP_STATS_INCR(strp->stats.bad_hdr_len); strp_parser_err(strp, -EPROTO, desc); So I need to figure a way to say "call this function again without reading more data" somehow, or make this check more lax e.g. accept any len > 0 after a retry maybe... Removing that branch altogether seems to work at least but I'm not sure we'd want to? (grmbl at this slow VM and strparser not being possible to enable as a module, it takes ages to test) > > With that said, returning 0 from the parse function also raises POLLERR > > on the csock and hangs netparser, so things aren't that simple... > > Can you point to where this is happening. If the parse_msg callback > returns 0 that is suppose to indicate that more bytes are needed. I just blindly returned 0 "from time to time" in the kcm parser function, but looking at above it was failing on the same check. This somewhat makes sense for this one to fail here if a new packet was read, no sane parser function should give a length smaller than what they require to determine the length. Thanks, -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Dominique Martinet wrote on Fri, Feb 15, 2019: > With all that said I guess my patch should work correctly then, I'll try > to find some time to check the error does come back up the tcp socket in > my reproducer but I have no reason to believe it doesn't. Ok, so I can confirm this part - the 'csock' does come back up with POLLERR if the parse function returns ENOMEM in the current code. It also comes back up with POLLERR when the remote side closes the connection, which is expected, but I'm having a very hard time understanding how an application is supposed to deal with these POLLERR after having read the documentation and a bit of experimentation. I'm not sure how much it would matter for real life (if the other end closes the connection most servers would not care about what they said just before closing, but I can imagine some clients doing that in real life e.g. a POST request they don't care if it succeeds or not)... My test program works like this: - client connects, sends 10k messages and close()s the socket - server loops recving and close()s after 10k messages; it used to be recvmsg() directly but it's now using poll then recvmsg. When the client closes the socket, some messages are obviously still "in flight", and the server will recv a POLLERR notification on the csock at some point with many messages left. The documentation says to unattach the csock when you get POLLER. If I do that, the kcm socket will no longer give me any message, so all the messages still in flight at the time are lost. If I just ignore the csock like I used to, all the messages do come just fine, but as said previously on a real error this will just make recvmsg or the polling hang forever and I see no way to distinguish a "real" error vs. a connection shut down from the remote side with data left in the pipe. I thought of checking POLLERR on csock and POLLIN not set on kcmsock, but even that seems to happen fairly regularily - the kcm sock hasn't been filled up, it's still reading from the csock. On the other hand, checking POLLIN on the csock does say there is still data left, so I know there is data left on the csock, but this is also the case on a real error (e.g. if parser returns -ENOMEM) ... And this made me try to read from the csock after detaching it and I can resume manual tcp parsing for a few messages until read() fails with EPROTO ?! and I cannot seem to be able to get anything out of attaching it back to kcm (for e.g. an ENOMEM error that was transient)... I'm honestly not sure how the POLLERR notification mechanism works but I think it would be much easier to use KCM if we could somehow delay that error until KCM is done feeding from the csock (when netparser really stops reading from it like on real error, e.g. abort callback maybe?) I think it's fine if the csock is closed before the kcm sock message is read, but we should not lose messages like this. > I'd like to see some retry on ENOMEM before this is merged though, so > while I'm there I'll resend this with a second patch doing that > retry,.. I think just not setting strp->interrupted and not reporting > the error up might be enough? Will have to try either way. I also tried playing with that without much success. I had assumed just not calling strp_parser_err() (which calls the abort_parser cb) would be enough, eventually calling strp_start_timer() like the !len case, but no can do. With that said, returning 0 from the parse function also raises POLLERR on the csock and hangs netparser, so things aren't that simple... I could use a bit of help again. Thanks, -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Tom Herbert wrote on Thu, Feb 14, 2019: > On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet > wrote: > > Yes, the parser fails with -ENOMEM ; that is not handled gracefully at > > all: from a user point of view, the connection just hangs (recvmsg never > > returns), without so much as a message in dmesg either. > > That's by design. Here is the description in kcm.txt: > > "When a TCP socket is attached to a KCM multiplexor data ready > (POLLIN) and write space available (POLLOUT) events are handled by the > multiplexor. If there is a state change (disconnection) or other error > on a TCP socket, an error is posted on the TCP socket so that a > POLLERR event happens and KCM discontinues using the socket. When the > application gets the error notification for a TCP socket, it should > unattach the socket from KCM and then handle the error condition (the > typical response is to close the socket and create a new connection if > necessary)." Sigh, that's what I get for relying on pieces of code found on the internet. It does make "trivial" kcm sockets difficult to use though, the old ganesha code I adapted to kcm was the horrible (naive?) kind spawning one thread per connection and just waiting on read(), so making it just create a kcm socket from the tcp one and wait on recvmsg() until an error pops up does not seem an option. That being said I'm a bit confused, I thought part of the point of kcm was the multiplexing so a simple server could just keep attaching tcp sockets to a single kcm socket and only have a single trivial loop reading from the kcm socket ; but I guess there's no harm in also looking for POLLERR on the tcp sockets... It would still need to close them once clients disconnect I guess, this really only affects my naive server. > > strparser might be able to retry somehow. > > We could retry on -ENOMEM since it is potentially a transient error, Yes, I think we should aim to retry on -ENOMEM; I agree all errors are not meant to be retried on but there already are different cases based on what the parse cb returned; but that can be a different patch. > but generally for errors (like connection is simply broken) it seems > like it's working as intended. I suppose we could add a socket option > to fail the KCM socket when all attached lower sockets have failed, > but I not sure that would be a significant benefit for additional > complexity. Right, I agree it's probably not worth doing, now I'm aware of this I can probably motivate myself to change this old code to use epoll properly. I think it could make sense to have this option for simpler programs, but as things stand I guess we can require kcm users to do this much, thanks for the explanation, and sorry for having failed to notice it. With all that said I guess my patch should work correctly then, I'll try to find some time to check the error does come back up the tcp socket in my reproducer but I have no reason to believe it doesn't. I'd like to see some retry on ENOMEM before this is merged though, so while I'm there I'll resend this with a second patch doing that retry,.. I think just not setting strp->interrupted and not reporting the error up might be enough? Will have to try either way. Thanks for the feedback, -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Tom Herbert wrote on Thu, Feb 14, 2019: > > This second patch[2] (the current thread) now does an extra clone if > > there is an offset, but the problem really isn't in the clone but the > > pull itself that can fail and return NULL when there is memory pressure. > > For some reason I hadn't been able to reproduce that behaviour with > > strparser doing the pull, but I assume it would also be possible to hit > > in extreme situations, I'm not sure... > > This option looks the best to me, at least as a way to fix the issue > without requiring a change to the API. If the pull fails, doesn't that > just mean that the parser fails? Is there some behavior with this > patch that is not being handled gracefully? Yes, the parser fails with -ENOMEM ; that is not handled gracefully at all: from a user point of view, the connection just hangs (recvmsg never returns), without so much as a message in dmesg either. It took me a while to figure out what failed exactly as I did indeed expect strparser/kcm to handle that better, but ultimately as things stand if the parser fails it calls strp_parser_err() with the error which ends up in strp_abort_strp that should call sk->sk_error_report(sk) but in kcm case sk is the csk and I guess failing csk does not make a pending recv on the kcm sock to fail... I'm not sure whether propagating the error to the upper socket is the right thing to do, kcm is meant to be able to work with multiple underlying sockets so I feel we must be cautious about that, but strparser might be able to retry somehow. This is what I said below: > > [,,,] > > - the current patch, that I could only get to fail with KASAN, but does > > complexify kcm a bit; this also does not fix bpf sockmap at all. > > Would still require to fix the hang, so make strparser retry a few times > > if strp->cb.parse_msg failed maybe? Or at least get the error back to > > userspace somehow. Thanks, -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Tom Herbert wrote on Thu, Feb 14, 2019: > > The best alternative I see is adding a proper helper to get > > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as > > lost as I have been; I'm not quite sure how/where to add such a helper > > though as I've barely looked at the bpf code until now, but should we go > > for that? > > I'd rather not complicate the bpf code for this. Can we just always do > an pskb_pull after skb_clone? Which skb_clone are you thinking of? If you're referring to the one in strparser, that was my very first approach here[1], but Dordon shot it down saying that this is not an strparser bug but a kcm bug since there are ways for users to properly get the offset and use it -- and the ktls code does it right. Frankly, my opinion still is that it'd be better in strparser because there also is some bad use in bpf sockmap (they never look at the offset either, while kcm uses it for rcv but not for parse), but looking at it from an optimal performance point of view I agree the user can make better decision than strparser so I understand where he comes from. This second patch[2] (the current thread) now does an extra clone if there is an offset, but the problem really isn't in the clone but the pull itself that can fail and return NULL when there is memory pressure. For some reason I hadn't been able to reproduce that behaviour with strparser doing the pull, but I assume it would also be possible to hit in extreme situations, I'm not sure... So anyway, we basically have three choices that I can see: - push harder on strparser and go back to my first patch ; it's simple and makes using strparser easier/safer but has a small overhead for ktls, which uses the current strparser implementation correctly. I hadn't been able to get this to fail with KASAN last time I tried but I assume it should still be possible somehow. - the current patch, that I could only get to fail with KASAN, but does complexify kcm a bit; this also does not fix bpf sockmap at all. Would still require to fix the hang, so make strparser retry a few times if strp->cb.parse_msg failed maybe? Or at least get the error back to userspace somehow. - my last suggestion to document the kcm behaviour, and if possible add a bpf helper to make proper usage easier ; this will make kcm harder to use on end users but it's better than not being documented and seeing random unappropriate lengths being parsed. [1] first strparser patch http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmad...@codewreck.org [2] current thread's patch http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmad...@codewreck.org Thanks, -- Dominique
Re: [PATCH v2] kcm: remove any offset before parsing messages
Dominique Martinet wrote on Wed, Oct 31, 2018: > Anyway, that probably explains I have no problem with bigger VM > (uselessly more memory available) or without KASAN (I guess there's > overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB > of ram, so if there's an allocation failure there I think there's a > problem ! . . . > > So, well, I'm not sure on the way forward. Adding a bpf helper and > document that kcm users should mind the offset? bump on this - I had mostly forgotten about it but the nfs-ganesha community that could make use of KCM reminded me of my patch that's waiting for this. Summary for people coming back after four months: - kcm is great, but the bpf function that's supposed to be called for each packet does not automatically adjust the offset so that it can assume the skb starts with the packet it needs to look at - there is some workaround code that is far from obvious and undocumented, see the original thread[1]: [1] http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf...@davejwatson-mba.dhcp.thefacebook.com - my patch here tried to automatically pull the corresponding packet to the front, but apparently with KASAN can trigger out of memory behaviours on "small" VMs, so even if it doesn't seem to impact performance much without KASAN I don't think it's really ok to potentially hang the connection due to oom under severe conditions. The best alternative I see is adding a proper helper to get "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as lost as I have been; I'm not quite sure how/where to add such a helper though as I've barely looked at the bpf code until now, but should we go for that? (it really feels wrong to me that some random person who just started by trying to use kcm has to put this much effort to keep the ball rolling, if nobody cares about kcm I'm also open to have it removed completely despite the obvious performance gain I benchmarked for ganesha[2] ; barely maintained feature is worse than no feature) [2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314 Thanks, -- Dominique
Re: [PATCH net-next] 9p: mark expected switch fall-through
Gustavo A. R. Silva wrote on Wed, Jan 23, 2019: > In preparation to enabling -Wimplicit-fallthrough, mark switch cases > where we are expecting to fall through. > > This patch fixes the following warning: > > net/9p/trans_xen.c:514:6: warning: this statement may fall through > [-Wimplicit-fallthrough=] > > Warning level 3 was used: -Wimplicit-fallthrough=3 > > This patch is part of the ongoing efforts to enabling > -Wimplicit-fallthrough > > Signed-off-by: Gustavo A. R. Silva Sure, taking this to my queue for -next -- Dominique
[GIT PULL] 9p updates for 4.21
Small pull request this time around with only two commits; some missing prototype warning fix and a syzkaller fix when a 9p server advertises a too small msize. The commit date is close-ish because I reworded a commit message to add a Cc to stable for the msize fix recently, but the patchs themselves have been in linux-next since Nov 20~ish Thanks, The following changes since commit ccda4af0f4b92f7b4c308d3acc262f4a7e3affad: Linux 4.20-rc2 (2018-11-11 17:12:31 -0600) are available in the Git repository at: git://github.com/martinetd/linux tags/9p-for-4.21 for you to fetch changes up to 574d356b7a02c7e1b01a1d9cba8a26b3c2888f45: 9p/net: put a lower bound on msize (2018-12-25 17:07:49 +0900) Pull request for inclusion in 4.21 Missing prototype warning fix and a syzkaller fix when a 9p server advertises a too small msize Adeodato SimĂł (1): net/9p: include trans_common.h to fix missing prototype warning. Dominique Martinet (1): 9p/net: put a lower bound on msize net/9p/client.c | 21 + net/9p/trans_common.c | 1 + -- Dominique Martinet
Re: [PATCH 1/3] 9p/net: implement asynchronous rpc
Tomas Bortoli wrote on Mon, Dec 17, 2018: > sorry for the delay, I've been quite busy these days. No problem. > The patches looks good to me and should indeed speed up the code a bit. > I quickly tested them against Syzkaller tuned for the 9p subsystem and > everything seems fine. Thanks, can I add your Reviewed-by on all three? > And by the way, which refcount races? There's a problem with trans_fd read_work and cancelled callback; I'm not so sure about refcount but we can definitely get double list_del as we're not checking the status. I think when we incorrectly remove from the list we also mismanage the refcount, but honestly need to test.. -- Dominique
[PATCH 1/3] 9p/net: implement asynchronous rpc
From: Dominique Martinet Add p9_client_async_rpc that will let us send an rpc without waiting for the reply, as well as an async handler hook. The work done in the hook could mostly be done in the client callback, but I prefered not to for a couple of reasons: - the callback can be called in an IRQ for some transports, and the handler needs to call p9_tag_remove which takes the client lock that has recently been reworked to not be irq-compatible (as it was never taken in IRQs until now) - the handled replies can be handled anytime, there is nothing the userspace would care about and want to be notified of quickly - the async request list and this function would be needed anyway for when we disconnect the client, in order to not leak async requests that haven't been replied to Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- I've been sitting on these patches for almost a month now because I wanted to fix the cancel race, but I think it's what the most recent syzbot email is about so if it could find it without this it probably won't hurt to at least get some reviews for these three patches first. I won't submit these for next cycle, so will only put them into -next after 4.20 is released; hopefully I'll have time to look at the two pending refcount races around that time. Until then, comments please! include/net/9p/client.h | 9 +- net/9p/client.c | 64 + 2 files changed, 72 insertions(+), 1 deletion(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 947a570307a6..a4ded7666c73 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -89,7 +89,8 @@ enum p9_req_status_t { * @tc: the request fcall structure * @rc: the response fcall structure * @aux: transport specific data (provided for trans_fd migration) - * @req_list: link for higher level objects to chain requests + * @req_list: link used by trans_fd + * @async_list: link used to check on async requests */ struct p9_req_t { int status; @@ -100,6 +101,7 @@ struct p9_req_t { struct p9_fcall rc; void *aux; struct list_head req_list; + struct list_head async_list; }; /** @@ -110,6 +112,9 @@ struct p9_req_t { * @trans_mod: module API instantiated with this client * @status: connection state * @trans: tranport instance state and API + * @fcall_cache: kmem cache for client request data (msize-specific) + * @async_req_lock: protects @async_req_list + * @async_req_list: list of requests waiting a reply * @fids: All active FID handles * @reqs: All active requests. * @name: node name used as client id @@ -125,6 +130,8 @@ struct p9_client { enum p9_trans_status status; void *trans; struct kmem_cache *fcall_cache; + spinlock_t async_req_lock; + struct list_head async_req_list; union { struct { diff --git a/net/9p/client.c b/net/9p/client.c index 357214a51f13..0a67c3ccd4fd 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -729,6 +729,62 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c, return ERR_PTR(err); } +static struct p9_req_t * +p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...) +{ + va_list ap; + int err; + struct p9_req_t *req; + + va_start(ap, fmt); + req = p9_client_prepare_req(c, type, c->msize, fmt, ap); + va_end(ap); + if (IS_ERR(req)) + return req; + + err = c->trans_mod->request(c, req); + if (err < 0) { + /* bail out without flushing... */ + p9_req_put(req); + p9_tag_remove(c, req); + if (err != -ERESTARTSYS && err != -EFAULT) + c->status = Disconnected; + return ERR_PTR(safe_errno(err)); + } + + return req; +} + +static void p9_client_handle_async(struct p9_client *c, bool free_all) +{ + struct p9_req_t *req, *nreq; + + /* it's ok to miss some async replies here, do a quick check without +* lock first unless called with free_all +*/ + if (!free_all && list_empty(&c->async_req_list)) + return; + + spin_lock_irq(&c->async_req_lock); + list_for_each_entry_safe(req, nreq, &c->async_req_list, async_list) { + if (free_all && req->status < REQ_STATUS_RCVD) { + p9_debug(P9_DEBUG_ERROR, +"final async handler found req tag %d type %d status %d\n", +req->tc.tag, req->tc.id, req->status); + if (c->trans_mod->cancelled) + c->trans_mod->cancelled(c, req); + /* won't recei
[PATCH 2/3] 9p/net: make clunk asynchronous
From: Dominique Martinet clunk is defined as making the fid invalid whatever the server returns, and we should ignore errors so it is a good candidate for async call. The change should make 9p slightly faster (many vfs systeme calls won't need to wait for that clunk), but more importantly the flush rework means we won't clear pending signals and the current implementation of "retry twice then leak the fid" will stop working, so this needed improving. Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- include/net/9p/client.h | 4 ++ net/9p/client.c | 124 +++- 2 files changed, 62 insertions(+), 66 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index a4ded7666c73..75d7f83e5b94 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -91,6 +91,7 @@ enum p9_req_status_t { * @aux: transport specific data (provided for trans_fd migration) * @req_list: link used by trans_fd * @async_list: link used to check on async requests + * @clunked_fid: for clunk, points to fid */ struct p9_req_t { int status; @@ -102,6 +103,9 @@ struct p9_req_t { void *aux; struct list_head req_list; struct list_head async_list; + union { + struct p9_fid *clunked_fid; + }; }; /** diff --git a/net/9p/client.c b/net/9p/client.c index 0a67c3ccd4fd..a47b5a54573d 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -649,6 +649,51 @@ static int p9_check_zc_errors(struct p9_client *c, struct p9_req_t *req, return err; } +static struct p9_fid *p9_fid_create(struct p9_client *clnt) +{ + int ret; + struct p9_fid *fid; + + p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); + fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); + if (!fid) + return NULL; + + memset(&fid->qid, 0, sizeof(struct p9_qid)); + fid->mode = -1; + fid->uid = current_fsuid(); + fid->clnt = clnt; + fid->rdir = NULL; + fid->fid = 0; + + idr_preload(GFP_KERNEL); + spin_lock_irq(&clnt->lock); + ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1, + GFP_NOWAIT); + spin_unlock_irq(&clnt->lock); + idr_preload_end(); + + if (!ret) + return fid; + + kfree(fid); + return NULL; +} + +static void p9_fid_destroy(struct p9_fid *fid) +{ + struct p9_client *clnt; + unsigned long flags; + + p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid); + clnt = fid->clnt; + spin_lock_irqsave(&clnt->lock, flags); + idr_remove(&clnt->fids, fid->fid); + spin_unlock_irqrestore(&clnt->lock, flags); + kfree(fid->rdir); + kfree(fid); +} + static struct p9_req_t * p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...); @@ -778,6 +823,9 @@ static void p9_client_handle_async(struct p9_client *c, bool free_all) } if (free_all || req->status >= REQ_STATUS_RCVD) { /* Put old refs whatever reqs actually returned */ + if (req->tc.id == P9_TCLUNK) { + p9_fid_destroy(req->clunked_fid); + } list_del(&req->async_list); p9_tag_remove(c, req); } @@ -959,51 +1007,6 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type, return ERR_PTR(safe_errno(err)); } -static struct p9_fid *p9_fid_create(struct p9_client *clnt) -{ - int ret; - struct p9_fid *fid; - - p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt); - fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL); - if (!fid) - return NULL; - - memset(&fid->qid, 0, sizeof(struct p9_qid)); - fid->mode = -1; - fid->uid = current_fsuid(); - fid->clnt = clnt; - fid->rdir = NULL; - fid->fid = 0; - - idr_preload(GFP_KERNEL); - spin_lock_irq(&clnt->lock); - ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1, - GFP_NOWAIT); - spin_unlock_irq(&clnt->lock); - idr_preload_end(); - - if (!ret) - return fid; - - kfree(fid); - return NULL; -} - -static void p9_fid_destroy(struct p9_fid *fid) -{ - struct p9_client *clnt; - unsigned long flags; - - p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid); - clnt = fid->clnt; - spin_lock_irqsave(&clnt->lock, flags); - idr_remove(&clnt->fids, fid->fid); - spin_unlock_irqrestore(&clnt->lock, flags); - kfree(fid-
[PATCH 3/3] 9p/net: make flush asynchronous
From: Dominique Martinet Make the client flush asynchronous so we don't need to ignore signals in rpc functions: - on signal, send a flush request as we previously did but use the new asynchronous helper and return immediately - when the reply has been received or connection is destroyed, free both tags in the handler Signed-off-by: Dominique Martinet Cc: Eric Van Hensbergen Cc: Latchesar Ionkov Cc: Tomas Bortoli Cc: Dmitry Vyukov --- include/net/9p/client.h | 2 + net/9p/client.c | 172 ++-- 2 files changed, 78 insertions(+), 96 deletions(-) diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 75d7f83e5b94..dcd40e7ef202 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -91,6 +91,7 @@ enum p9_req_status_t { * @aux: transport specific data (provided for trans_fd migration) * @req_list: link used by trans_fd * @async_list: link used to check on async requests + * @flushed_req: for flush, points to matching flushed req * @clunked_fid: for clunk, points to fid */ struct p9_req_t { @@ -104,6 +105,7 @@ struct p9_req_t { struct list_head req_list; struct list_head async_list; union { + struct p9_req_t *flushed_req; struct p9_fid *clunked_fid; }; }; diff --git a/net/9p/client.c b/net/9p/client.c index a47b5a54573d..666a722088e9 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -694,50 +694,6 @@ static void p9_fid_destroy(struct p9_fid *fid) kfree(fid); } -static struct p9_req_t * -p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...); - -/** - * p9_client_flush - flush (cancel) a request - * @c: client state - * @oldreq: request to cancel - * - * This sents a flush for a particular request and links - * the flush request to the original request. The current - * code only supports a single flush request although the protocol - * allows for multiple flush requests to be sent for a single request. - * - */ - -static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq) -{ - struct p9_req_t *req; - int16_t oldtag; - int err; - - err = p9_parse_header(&oldreq->tc, NULL, NULL, &oldtag, 1); - if (err) - return err; - - p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag); - - req = p9_client_rpc(c, P9_TFLUSH, "w", oldtag); - if (IS_ERR(req)) - return PTR_ERR(req); - - /* -* if we haven't received a response for oldreq, -* remove it from the list -*/ - if (oldreq->status == REQ_STATUS_SENT) { - if (c->trans_mod->cancelled) - c->trans_mod->cancelled(c, oldreq); - } - - p9_tag_remove(c, req); - return 0; -} - static struct p9_req_t *p9_client_prepare_req(struct p9_client *c, int8_t type, int req_size, const char *fmt, va_list ap) @@ -800,6 +756,39 @@ p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...) return req; } +/** + * p9_client_flush - flush (cancel) a request + * @c: client state + * @oldreq: request to cancel + * + * This sents a flush for a particular request and links + * the flush request to the original request. The current + * code only supports a single flush request although the protocol + * allows for multiple flush requests to be sent for a single request. + * + */ + +static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq) +{ + struct p9_req_t *req; + int16_t oldtag = oldreq->tc.tag; + + p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag); + req = p9_client_async_rpc(c, P9_TFLUSH, "w", oldtag); + if (IS_ERR(req)) { + return PTR_ERR(req); + } + + p9_debug(P9_DEBUG_MUX, "sent flush for oldreq %d type %d with flush tag %d\n", +oldtag, oldreq->tc.id, req->tc.tag); + req->flushed_req = oldreq; + spin_lock_irq(&c->lock); + list_add(&req->async_list, &c->async_req_list); + spin_unlock_irq(&c->lock); + + return 0; +} + static void p9_client_handle_async(struct p9_client *c, bool free_all) { struct p9_req_t *req, *nreq; @@ -823,6 +812,24 @@ static void p9_client_handle_async(struct p9_client *c, bool free_all) } if (free_all || req->status >= REQ_STATUS_RCVD) { /* Put old refs whatever reqs actually returned */ + if (req->tc.id == P9_TFLUSH) { + p9_debug(P9_DEBUG_MUX, +"flushing oldreq tag %d status %d, flush_req tag %d\n", +
9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > > Now we are talking! > > We generally assume that all modules are simply compiled into kernel. > > At least that's we have on syzbot. If somebody can't compile them in, > > we can suggest to add modprobe into init. > > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add. > > This fails for me: > > root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add > [20992.905406] rdma_rxe: interface syz1 not found > bash: echo: write error: Invalid argument Works here, I just did: [root@f2 ~]# modprobe rdma_rxe [root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add dmesg says: [ 35.595534] rdma_rxe: set rxe0 active [ 35.595541] rdma_rxe: added rxe0 to ens3 Actually for a dummy interface if I try the echo directly the echo works, and a verb device is created, and I just confirmed I can use it... so not sure why rxe_cfg said EINVAL earlier... [root@f2 ~]# ip link add dummy0 type dummy [root@f2 ~]# ip link set dummy0 up [root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0 [root@f2 ~]# modprobe rdma_rxe [root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add (then using my test client: [root@f2 src]# ./rcat -s INFO: trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED INFO: trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT... [root@f2 src]# ./rcat -c 10.1.1.1 INFO: trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED ^C ) I assume your syz1 interface is a tap device as you were saying earlier? Got anything in dmesg? -- Dominique
9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > But again we don't need to support all of the available hardware. I agree with that, I just have no idea what the "librxe-rdmav16.so" lib could be doing and described something I am slightly more familiar with (e.g. libmlx5) I talked about a common subset of the verb abi because I didn't want to look into what it's doing, but if it's not enough there's always that possibility. > For example, we are testing net stack from external side using tun. > tun is a very simple, virtual abstraction of a network card. It allows > us to test all of generic net stack starting from L2 without messing > with any real drivers and their differences entirely. I had impression > that we are talking about something similar here too. Or not? That sounds about right, rxe is a software implementation that should work on most network interfaces ; at least from what I tried it worked on a VM's virtio net down to my laptop's wifi interface so it's a good start... I'm not saying all because I just tried a dummy interface and that returned EINVAL. The only point I disagree is the 'very simple', even getting that to work will be a far cry from a socket() call... :) > Also I am a bit missing context about rdma<->9p interface. Do we need > to setup all these ring buffers to satisfy the parts that 9p needs? Is > it that 9p actually reads data directly from these ring buffers? Or > there is some higher-level rdma interface that 9p uses? It needs an "RDMA_PS_TCP" connection established, that requires everything I described unfortunately... Once that's established we need to register some memory to the driver and post some recv buffers (even if we won't read it, the client would get errors if we aren't ready to receive anything - at least it does with real hardware), and also use some registered memory to send data. Thinking back though I think that my server implementation isn't very far from the raw level in what I'm doing, I recall libibverbs fallback implementation (e.g. if the driver lib doesn't implement it otherwise) of the functions I looked at like ibv_post_send to mostly be just serializing the arguments, slapping the command from an enum in front of it and sending it to the kernel, so it might be enough to just reimplement that shim in or figure a way to generate the binary commands once and then use these values; now I'm comparing two runs of strace of my test server I definitely see a pattern. I'll give it a try but don't expect something fast, and it's probably not going to be very pretty either... To give a concrete example, here are all the read/write/fcntl calls looking just at /dev/infiniband in a hello world program that just establishes connection (server side), receive and send two messages and quits: This part apparently sets up the listening connection of the server: 1430 1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3 1430 1539262699.126155 write(3, "\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 32) = 32 1430 1539262699.126192 write(3, "\24\0\0\0\210\0\0\0\0\0\0\\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144 1430 1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 \326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24 1430 1539262699.126274 write(3, "\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126303 close(3)= 0 1430 1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", O_RDWR|O_CLOEXEC) = 3 1430 1539262699.126429 write(3, "\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0", 32) = 32 1430 1539262699.126472 write(3, "\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 144) = 144 1430 1539262699.126501 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.126534 write(3, "\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24 1430 1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16 1430 1539262699.127149 write(3, "\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24 1430 1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) 1430 1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE Then the client connects (had some epoll on read on fd 3, but no read?!) 1446 1539262706.
Re: BUG: corrupted list in p9_read_work
Dmitry Vyukov wrote on Thu, Oct 11, 2018: > > That's still the tricky part, I'm afraid... Making a separate server > > would have been easy because I could have reused some of my junk for the > > actual connection handling (some rdma helper library I wrote ages > > ago[1]), but if you're going to just embed C code you'll probably want > > something lower level? I've never seen syzkaller use any library call > > but I'm not even sure I would know how to create a qp without > > libibverbs, would standard stuff be OK ? > > Raw syscalls preferably. > What does 'rxe_cfg start ens3' do on syscall level? Some netlink? modprobe rdma_rxe (and a bunch of other rdma modules before that) then writes the interface name in /sys/module/rdma_rxe/parameters/add apparently; then checks it worked. this part could be done in C directly without too much trouble, but as long as the proper kernel configuration/modules are available > Any libraries and utilities are hell pain in linux world. Will it work > in Android userspace? gVisor? Who will explain all syzkaller users > where they get this for their who-knows-what distro, which is 10 years > old because of corp policies, and debug how their version of the > library has a slightly incompatible version? > For example, after figuring out that rxe_cfg actually comes from > rdma-core (which is a separate delight on linux), my debian > destribution failed to install it because of some conflicts around > /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about > such package. And we've just started :) The rdma ecosystem is a pain, I'll easily agree with that... > Syscalls tend to be simpler and more reliable. If it gives ENOSUPP, > ok, that's it. If it works, great, we can use it. I'll have to look into it a bit more; libibverbs abstracts a lot of stuff into per-nic userspace drivers (the files I cited in a previous mail) and basically with the mellanox cards I'm familiar with the whole user session looks like this: * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and /dev/infiniband/uverbs0 (plus a bunch of files to figure out abi version, what user driver to load etc) * it and the userspace driver issue "commands" over these two files' fd to setup the connection ; some commands are standard but some are specific to the interface and defined in the driver. There are many facets to a connection in RDMA: a protection domain used to register memory with the nic, a queue pair that is the actual tx/rx connection, optionally a completion channel that will be another fd to listen on for events that tell you something happened and finally some memory regions to directly communicate with the nic from userspace depending on the specific driver. * then there's the actual usage, more commands through the uverbs0 char device to register the memory you'll use, and once that's done it's entierly up to the driver - for example the mellanox lib can do everything in userspace playing with the memory regions it registered, but I'd wager the rxe driver does more calls through the uverbs0 fd... Honestly I'm not keen on reimplementing all of this; the interface itself pretty much depends on your version of the kernel (there is a common ABI defined, but as far as specific nics are concerned if your kernel module doesn't match the user library version you can get some nasty surprises), and it's far from the black or white of a good ol' ENOSUPP error. I'll look if I can figure out if there is a common subset of verbs commands that are standard and sufficient to setup a listening connection and exchange data that should be supported for all devices and would let us reimplement just that, but while I hear your point about android and ten years in the future I think it's more likely than ten years in the future the verb abi will have changed but libibverbs will just have the new version implemented and hide the change :P -- Dominique
Re: Re: BUG: corrupted list in p9_read_work
Dominique Martinet wrote on Wed, Oct 10, 2018: > It works though, is it just picky because I didn't end it in .git? let's > try again, sorry for the noise... > > #syz test: git://github.com/martinetd/linux.git > e4ca13f7d075e551dc158df6af18fb412a1dba0a And I guess the commit hash needs to be in the default clone branch to work ? ('git fetch ' happily fetches the commit in a new clone for me... But that feels like a github specific behaviour maybe) Oh, well; made a branch for it, last try for me. #syz test: git://github.com/martinetd/linux.git for-syzbot -- Dominique
Re: [PATCH] 9p: potential NULL dereference
Dan Carpenter wrote on Wed, Sep 26, 2018: > p9_tag_alloc() is supposed to return error pointers, but we accidentally > return a NULL here. It would cause a NULL dereference in the caller. > > Fixes: 996d5b4db4b1 ("9p: Use a slab for allocating requests") > Signed-off-by: Dan Carpenter Good catch, the culprit commit is only in -next so just adding this to the queue right away. Thanks! -- Dominique
Re: KASAN: invalid-free in p9stat_free
syzbot wrote on Sun, Aug 26, 2018: > HEAD commit:e27bc174c9c6 Add linux-next specific files for 20180824 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=15dc19a640 > kernel config: https://syzkaller.appspot.com/x/.config?x=28446088176757ea > dashboard link: https://syzkaller.appspot.com/bug?extid=d4252148d198410b864f > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f8efba40 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1178256a40 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+d4252148d198410b8...@syzkaller.appspotmail.com > > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > random: sshd: uninitialized urandom read (32 bytes read) > == > BUG: KASAN: double-free or invalid-free in p9stat_free+0x35/0x100 > net/9p/protocol.c:48 That looks straight-forward enough, p9pdu_vreadf does p9stat_free on error then v9fs_dir_readdir does the same ; there is nothing else that could return an error without going through the first free so we could just remove the later one... There are a couple other users of the 'S' pdu read (that reads the stat struct and frees it on error), so it's probably best to keep the current behaviour as far as this is concerned, what we could do though is make the free function idempotent (write NULLs in the freed fields), but I do not see this being done often, do you know what the policy is about this kind of pattern nowadays? The struct is cleanly zeroed before being read so there is no risk of double-frees between iterations so zeroing pointers is not strictly required, but it does make things safer in general. -- Dominique Martinet
[PATCH] 9p/xen: fix check for xenbus_read error in front_probe
From: Dominique Martinet If the xen bus exists but does not expose the proper interface, it is possible to get a non-zero length but still some error, leading to strcmp failing trying to load invalid memory addresses e.g. fffe. There is then no need to check length when there is no error, as the xenbus driver guarantees that the string is nul-terminated. Signed-off-by: Dominique Martinet Cc: Stefano Stabellini Cc: Eric Van Hensbergen Cc: Latchesar Ionkov --- This is a trivial bug I stumbled on when setting up xen with p9fs and running the VM in pvm: it had enough in the bus to trigger the probe but then there was no version and it tried to return ENOENT but len was set to the lower-level message size. net/9p/trans_xen.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/9p/trans_xen.c b/net/9p/trans_xen.c index 1a5b38892eb4..f76beadddfc3 100644 --- a/net/9p/trans_xen.c +++ b/net/9p/trans_xen.c @@ -391,9 +391,9 @@ static int xen_9pfs_front_probe(struct xenbus_device *dev, unsigned int max_rings, max_ring_order, len = 0; versions = xenbus_read(XBT_NIL, dev->otherend, "versions", &len); - if (!len) - return -EINVAL; + if (IS_ERR(versions)) + return PTR_ERR(versions); if (strcmp(versions, "1")) { kfree(versions); return -EINVAL; } -- 2.17.1
Re: KCM - recvmsg() mangles packets?
Tom Herbert wrote on Thu, Aug 09, 2018: > > diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c > > index 625acb27efcc..348ff5945591 100644 > > --- a/net/strparser/strparser.c > > +++ b/net/strparser/strparser.c > > @@ -222,6 +222,16 @@ static int __strp_recv(read_descriptor_t *desc, struct > > sk_buff *orig_skb, > > if (!stm->strp.full_len) { > > ssize_t len; > > > > + /* Can only parse if there is no offset */ > > + if (unlikely(stm->strp.offset)) { > > + if (!pskb_pull(skb, stm->strp.offset)) { > > + > > STRP_STATS_INCR(strp->stats.mem_fail); > > + strp_parser_err(strp, -ENOMEM, > > desc); > > + break; > > + } > > + stm->strp.offset = 0; > > + } > > + > > Seems okay to me for a fix. Hmm, if you say so, I'll send this as a patch for broader comments right away. > Looks like strp.offset is only set in one place and read in one > place. With this pull maybe that just can go away? Good point, when strp.offset is set the full_len is also being init'd so we will necessarily do the pull next... But the way tls uses strparser is also kind of weird, since they modify the strp_msg's offset and full_len, I wouldn't want to assume we can't have full_len == 0 *again* later with a non zero offset... On the other hand they do handle non-zero offset in their parse function so they'd be ok with that... Ultimately it's probably closer to a design choice than anything else. I'll still send a v0 of the patch as is, because I feel it's easier to understand that we pull because the existing parse_msg functions do not handle it properly, and will write a note that I intend to move it up a few lines as a comment. Thanks, -- Dominique Martinet
Re: KCM - recvmsg() mangles packets?
Dominique Martinet wrote on Sun, Aug 05, 2018: > It's getting late but I'll try adding a pskb_pull in there tomorrow, it > would be better to make the bpf program start with an offset but I don't > think that'll be easy to change... I can confirm the following patch fixes the issue for me: -8<- diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c index 625acb27efcc..348ff5945591 100644 --- a/net/strparser/strparser.c +++ b/net/strparser/strparser.c @@ -222,6 +222,16 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb, if (!stm->strp.full_len) { ssize_t len; + /* Can only parse if there is no offset */ + if (unlikely(stm->strp.offset)) { + if (!pskb_pull(skb, stm->strp.offset)) { + STRP_STATS_INCR(strp->stats.mem_fail); + strp_parser_err(strp, -ENOMEM, desc); + break; + } + stm->strp.offset = 0; + } + len = (*strp->cb.parse_msg)(strp, head); if (!len) { 8<-- Now, I was looking at other users of strparser (I see sockmap, kcm and tls) and it looks like sockmap does not handle offsets either but tls does by using skb_copy_bits -- they're copying the tls header to a buffer on the stack. kcm cannot do that because we do not know how much data the user expects to read, and I'm not comfortable doing pskb_pull in the kcm callback either, but the cost of this pull is probably non-negligible if some user can make do without it... On the other hand, I do not see how to make the bpf program handle an offset in the skb as that offset is strparser-specific. Maybe add a flag in the cb that specifies wether the callback allows non-zero offset? I'll let you see if you can reproduce this and will wait for advices on how to solve this properly so we can work on a proper fix. Thanks, -- Dominique
Re: KCM - recvmsg() mangles packets?
Dominique Martinet wrote on Sun, Aug 05, 2018: > (I'm not sure about offset, since we pass the full skb to parse message, > wouldn't it look at the start of the buffer everytime? Well, offset > seems to be 0 everytime the first time that check fails so I can > probably ignore that for now...) Oh, this might actually not have been such a bad remark; if I have the client write two "messages" in a single write() kcm seems to reliably fail the same way... Conversely, if I setsockopt(s, IPPROTO_TCP, TCP_NODELAY...) on the sender socket, *and* make it wait till the kcm socket has been created to start sending data, then it dramatically reduces the probability of this happening (I had to let the reproducer run in a loop for 5 minutes, wheras it used to happen within seconds). So I think the problem is packet aggregation, and strparser not handling that properly... The first packet still fails with TCP_NODELAY but there's probably aggregation on the recv side as well before the socket is attached to the multiplexor... I guess the low probability failure that is still happening could be similar. (I also noticed that I've mistakedly believed that the problem was the first packet contained the 2nd packet's data because of an off-by-one in the receiver, it really is the second packet, it only has the first packet's length) I've moved my check from kcm_rcv_strparser to just before parse_msg in __strp_recv and surely enough it fails everytime there's an offset. It's getting late but I'll try adding a pskb_pull in there tomorrow, it would be better to make the bpf program start with an offset but I don't think that'll be easy to change... I'm almost done spamming, thanks for being a good rubber duck! :p -- Dominique Martinet
Re: KCM - recvmsg() mangles packets?
Dominique Martinet wrote on Sat, Aug 04, 2018: > I talked too fast, I can get this to fail on later packets e.g. > Got 18, expected 31 on 452nd message: 453453453453453453; flags: 80 > > The content is 453 in a loop so this really is the 453rd packet... > > But being slower e.g. doing that usleep after every single packets and I > could let the loop run until 100k without a hintch. > > > There really has to be something wrong, I just can't tell what from > looking at the code with my naive eyes. I've spent a few hours debugging this weekend, I'm still not confident I understand how this all works but it's a bit better than Friday. I've tried updating kcm_recvmsg()'s use of kcm_wait_data() to use skb_dequeue() instead of skb_peek(), with the idea that if the skb is off the list it would have less chance to be messed with, and that didn't seem to change anything so I was probably looking at the wrong place. So I went one step up to how we split the packet and the interface with strparser and added an extra check of the protocol in kcm_rcv_strparser. If I understand correctly, with stm = strp_msg(skb), stm->full_len contains the output of the bpf progran and must match something like this in the network byte-order version: ntohl(*((u32 *)(skb->data + stm->offset))) (I'm not sure about offset, since we pass the full skb to parse message, wouldn't it look at the start of the buffer everytime? Well, offset seems to be 0 everytime the first time that check fails so I can probably ignore that for now...) Just like the test program, that extra check seems to work but will fail everytime the test program detects and error, although the data I can see does not always match what the program sees, the packet would seem to already be incorrect when it is added to the kcm queue... So there must be something happening to the skb between parse_msg() and rcv_msg() in __strp_recv() ...? >From what I can see it's looking at a cloned skb so the headers are private but the data is shared, but it looks like the tcp socket (csock in kcm_attach) is properly locked at this time, so I'm not sure what could corrupt the buffer. I'll keep looking at this a bit more but any help is appreciated, Thanks, -- Dominique Martinet | Asmadeus
Re: KCM - recvmsg() mangles packets?
Dominique Martinet wrote on Sat, Aug 04, 2018: > Actually, now I'm looking closer to the timing, it looks specific to the > connection setup. This send loop works: > int i = 1; > while(i <= 1000) { > int len = (i++ * 1312739ULL) % 31 + 1; > my_msg.hdr.len = htonl(len); > for (int j = 0; j < len; ) { > j += snprintf(my_msg.data + j, len - j, > "%i", i - 1); > } > my_msg.data[len-1] = '\0'; > //printf("%d: writing %d\n", i-1, len); > len = write(s, &my_msg, sizeof(my_msg.hdr) + len); > if (error == -1) > err(EXIT_FAILURE, "write"); > if (i == 2) > usleep(1); > } > > But removing the usleep(1) after the first packet makes recvmsg() > "fail": it reads the content of the second packet with the size of the > first. I talked too fast, I can get this to fail on later packets e.g. Got 18, expected 31 on 452nd message: 453453453453453453; flags: 80 The content is 453 in a loop so this really is the 453rd packet... But being slower e.g. doing that usleep after every single packets and I could let the loop run until 100k without a hintch. There really has to be something wrong, I just can't tell what from looking at the code with my naive eyes. Maybe we need to lock both the tcp and the kcm sockets? Thanks, -- Dominique
Re: KCM - recvmsg() mangles packets?
Tom Herbert wrote on Fri, Aug 03, 2018: > On Fri, Aug 3, 2018 at 4:20 PM, Dominique Martinet > wrote: > > Tom Herbert wrote on Fri, Aug 03, 2018: > >> struct my_proto { > >>struct _hdr { > >>uint32_t len; > >> } hdr; > >> char data[32]; > >> } __attribute__((packed)); > >> > >> // use htons to use LE header size, since load_half does a first convertion > >> // from network byte order > >> const char *bpf_prog_string = " \ > >> ssize_t bpf_prog1(struct __sk_buff *skb) \ > >> { \ > >> return bpf_htons(load_half(skb, 0)) + 4; \ > >> }"; > > > > (Just to make sure I did fix it to htonl(load_word()) and I can confirm > > there is no difference) > > You also need to htonl for > > my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1; Thanks, but this looks correct to me - I was writing the header in little endian order here and doing the double-swap dance in the bpf prog because the protocol I was considering making a KCM implementation for uses that. Just to make sure, I rewrote it using network byte order e.g. these three points and this makes no difference: ---8<-- diff --git a/kcm.c b/kcm.c index cb48df1..d437226 100644 --- a/kcm.c +++ b/kcm.c @@ -36,7 +36,7 @@ struct my_proto { const char *bpf_prog_string = "\ ssize_t bpf_prog1(struct __sk_buff *skb) \ { \ - return bpf_htons(load_half(skb, 0)) + 4;\ + return load_word(skb, 0) + 4; \ }"; int servsock_init(int port) @@ -110,13 +110,15 @@ void client(int port) int i = 1; while(1) { - my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1; - for (int j = 0; j < my_msg.hdr.len; ) { - j += snprintf(my_msg.data + j, my_msg.hdr.len - j, "%i", i - 1); + int len = (i++ * 1312739ULL) % 31 + 1; + my_msg.hdr.len = htonl(len); + for (int j = 0; j < len; ) { + j += snprintf(my_msg.data + j, len - j, + "%i", i - 1); } - my_msg.data[my_msg.hdr.len-1] = '\0'; - //printf("%d: writing %d\n", i-1, my_msg.hdr.len); - len = write(s, &my_msg, sizeof(my_msg.hdr) + my_msg.hdr.len); + my_msg.data[len-1] = '\0'; + //printf("%d: writing %d\n", i-1, len); + len = write(s, &my_msg, sizeof(my_msg.hdr) + len); if (error == -1) err(EXIT_FAILURE, "write"); //usleep(1); @@ -171,9 +173,10 @@ void process(int kcmfd) len = recvmsg(kcmfd, &msg, 0); if (len == -1) err(EXIT_FAILURE, "recvmsg"); - if (len != my_msg.hdr.len + 4) { + if (len != ntohl(my_msg.hdr.len) + 4) { printf("Got %d, expected %d on %dth message: %s; flags: %x\n", - len - 4, my_msg.hdr.len, i, my_msg.data, msg.msg_flags); + len - 4, ntohl(my_msg.hdr.len), i, + my_msg.data, msg.msg_flags); exit(1); } i++; 8<--- Frankly I do not believe this is a rule problem, as if the length splitting was incorrect the program would not work at all, but just uncommenting the usleep on the sender side makes this work. Actually, now I'm looking closer to the timing, it looks specific to the connection setup. This send loop works: int i = 1; while(i <= 1000) { int len = (i++ * 1312739ULL) % 31 + 1; my_msg.hdr.len = htonl(len); for (int j = 0; j < len; ) { j += snprintf(my_msg.data + j, len - j, "%i", i - 1); } my_msg.data[len-1] = '\0'; //printf("%d: writing %d\n", i-1, len); len = write(s, &my_msg, sizeof(my_msg.hdr) + len); if (error == -1) err(EXIT_FAILURE, "write"); if (i == 2) usleep(1); } But removing the usleep(1) after the first packet makes recvmsg() "fail": it reads the content of the second packet with the size of the first. I assume that usleep gives the server time to finish setting up the kcm socket, because it does accept(); ioctl(SIOCKCMATTACH); recvmsg(); but the client does not wait to send packets so there could be some sort of race with the attach and multiple packets? FWIW I took the time to look at older kernel and this has been happening ever since KCM got introduced in 4.6 Thanks, -- Dominique
Re: KCM - recvmsg() mangles packets?
Tom Herbert wrote on Fri, Aug 03, 2018: > struct my_proto { >struct _hdr { >uint32_t len; > } hdr; > char data[32]; > } __attribute__((packed)); > > // use htons to use LE header size, since load_half does a first convertion > // from network byte order > const char *bpf_prog_string = " \ > ssize_t bpf_prog1(struct __sk_buff *skb) \ > { \ > return bpf_htons(load_half(skb, 0)) + 4; \ > }"; > > The length in hdr is uint32_t above, but this looks like it's being > read as a short. Err, I agree this is obviously wrong here (I can blame my lack of attention to this and the example I used), but this isn't the problem as the actual size is between 0 and 32 -- I could use any size I want here and the result would the same. A "real" problem with the conversion program would mean that my example would not work if I slow it down, but I can send as many packet as I want if I uncomment the usleep() on the client side or if I just throttle the network stack with a loud tcpdump writing to stdout -- that means the algorithm is working even if it's making some badly-sized conversions. (Just to make sure I did fix it to htonl(load_word()) and I can confirm there is no difference) Thanks, -- Dominique Martinet
KCM - recvmsg() mangles packets?
I've been playing with KCM on a 4.18.0-rc7 kernel and I'm running in a problem where the iovec filled by recvmsg() is mangled up: it is filled by the length of one packet, but contains (truncated) data from another packet, rendering KCM unuseable. (I haven't tried old kernels to see for how long this is broken/try to bisect; I might if there's no progress but this might be simpler than I think) I've attached a reproducer, a simple program that forks, creates a tcp server/client, attach the server socket to a kcm socket, and in an infinite loop sends varying-length messages from the client to the server. The loop stops when the server gets a message which length is not the length indicated in the packet header, rather fast (I can make it run for a while if I slow down emission, or if I run a verbose tcpdump for example) In the quiet version on a VM on my laptop, I get this output: [root@f2 ~]# gcc -g -l bcc -o kcm kcm.c [root@f2 ~]# ./kcm client is starting server is starting server is receiving data Got 14, expected 27 on 1th message: 22; flags: 80 The client sends message deterministacally, first one is 14 bytes filled with 1, second one is 27 bytes filled with 2, third one is 9 bytes filled with 3 etc (final digit is actually a \0 instead) As we can see, the server received 14 '2', and the header size matches the second message header, so something went wrong™. Flags 0x80 is MSG_EOR meaning recvmsg copied the full message. This happens even if I reduce the VMs CPU to 1, so I was thinking some irq messes with the sock between skb_peek and the actual copy of the data (as this deos work if I send slowly!), but even disabling irq/preempt doesn't seem to help so I'm not sure what to try next. Any idea? Thanks, -- Dominique Martinet /* * A sample program of KCM. * Originally https://gist.github.com/peo3/fd0e266a3852d3422c08854aba96bff5 * * $ gcc -lbcc kcm-sample.c * $ ./a.out 1 */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct my_proto { struct _hdr { uint32_t len; } hdr; char data[32]; } __attribute__((packed)); // use htons to use LE header size, since load_half does a first convertion // from network byte order const char *bpf_prog_string = "\ ssize_t bpf_prog1(struct __sk_buff *skb) \ { \ return bpf_htons(load_half(skb, 0)) + 4; \ }"; int servsock_init(int port) { int s, error; struct sockaddr_in addr; s = socket(AF_INET, SOCK_STREAM, 0); addr.sin_family = AF_INET; addr.sin_port = htons(port); addr.sin_addr.s_addr = INADDR_ANY; error = bind(s, (struct sockaddr *)&addr, sizeof(addr)); if (error == -1) err(EXIT_FAILURE, "bind"); error = listen(s, 10); if (error == -1) err(EXIT_FAILURE, "listen"); return s; } int bpf_init(void) { int fd, map_fd; void *mod; int key; long long value = 0; mod = bpf_module_create_c_from_string(bpf_prog_string, 0, NULL, 0); fd = bpf_prog_load( BPF_PROG_TYPE_SOCKET_FILTER, "bpf_prog1", bpf_function_start(mod, "bpf_prog1"), bpf_function_size(mod, "bpf_prog1"), bpf_module_license(mod), bpf_module_kern_version(mod), 0, NULL, 0); if (fd == -1) exit(1); return fd; } void client(int port) { int s, error; struct sockaddr_in addr; struct hostent *host; struct my_proto my_msg; int len; printf("client is starting\n"); s = socket(AF_INET, SOCK_STREAM, 0); if (s == -1) err(EXIT_FAILURE, "socket"); memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_port = htons(port); host = gethostbyname("localhost"); if (host == NULL) err(EXIT_FAILURE, "gethostbyname"); memcpy(&addr.sin_addr, host->h_addr, host->h_length); error = connect(s, (struct sockaddr *)&addr, sizeof(addr)); if (error == -1) err(EXIT_FAILURE, "connect"); len = sprintf(my_msg.data, "1234567890123456789012345678901"); my_msg.data[len] = '\0'; my_msg.hdr.len = len + 1; int i = 1; while(1) { my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1; for (int j = 0; j < my_msg.hdr.len; ) { j += snprintf(my_msg.data + j, my_msg.hdr.len - j, "%i", i - 1); } my_msg.data[my_msg.hdr.len-1] = '\0'; //printf("%d: writing %d\n", i-1, my_msg.hdr.len); len = write(s, &my_msg, sizeof(my_msg.hdr) + my_msg.hdr.len); if (error == -1) err(EXIT_FAILURE, "write"); //usleep(1); } close(s); } int kcm_init(void) { int kcmfd; kcmfd = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED); if (kcmfd == -1) err(EXIT_FAILURE, "socket(AF_KCM)"); return kcmfd; } int kcm_attach(int kcmfd, int csock, int bpf_prog_fd) { int error; struct kcm_attach attach_info = { .fd = csock, .bpf_fd = bpf_prog_fd, }; error = ioctl(k
Re: [PATCH PATCH net-next 10/18] 9p: fix whitespace issues
(removed tons of Cc to make v9fs-developer@ happy - please do include linux-ker...@vger.kernel.org next time though, I only saw this patch by chance snooping at netdev) Stephen Hemminger wrote on Tue, Jul 24, 2018: > Remove trailing whitespace and blank lines at EOF > > Signed-off-by: Stephen Hemminger LGTM, I'm not sure if someone would pick the whole series but as other maintainers seem to have taken individual patches I've taken this one as there's potential conflicts with what I have (planning to remove the whole net/9p/util.c file) > --- > net/9p/client.c | 4 ++-- > net/9p/trans_virtio.c | 2 +- > net/9p/util.c | 1 - > 3 files changed, 3 insertions(+), 4 deletions(-) > > diff --git a/net/9p/client.c b/net/9p/client.c > index 5c1343195292..ff02826e0407 100644 > --- a/net/9p/client.c > +++ b/net/9p/client.c > @@ -341,7 +341,7 @@ struct p9_req_t *p9_tag_lookup(struct p9_client *c, u16 > tag) >* buffer to read the data into */ > tag++; > > - if(tag >= c->max_tag) > + if (tag >= c->max_tag) > return NULL; > > row = tag / P9_ROW_MAXTAG; > @@ -1576,7 +1576,7 @@ p9_client_read(struct p9_fid *fid, u64 offset, struct > iov_iter *to, int *err) > int count = iov_iter_count(to); > int rsize, non_zc = 0; > char *dataptr; > - > + > rsize = fid->iounit; > if (!rsize || rsize > clnt->msize-P9_IOHDRSZ) > rsize = clnt->msize - P9_IOHDRSZ; > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c > index 05006cbb3361..279b24488d79 100644 > --- a/net/9p/trans_virtio.c > +++ b/net/9p/trans_virtio.c > @@ -446,7 +446,7 @@ p9_virtio_zc_request(struct p9_client *client, struct > p9_req_t *req, > out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM, > out_pages, out_nr_pages, offs, outlen); > } > - > + > /* >* Take care of in data >* For example TREAD have 11. > diff --git a/net/9p/util.c b/net/9p/util.c > index 59f278e64f58..55ad98277e85 100644 > --- a/net/9p/util.c > +++ b/net/9p/util.c > @@ -138,4 +138,3 @@ int p9_idpool_check(int id, struct p9_idpool *p) > return idr_find(&p->pool, id) != NULL; > } > EXPORT_SYMBOL(p9_idpool_check); > - -- Dominique Martinet
Re: tcp hang when socket fills up ?
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > Thanks for the testing! One more line is required, however: we have to get > the assured bit set for the connection, see the new patch below. I think it actually was better before. If I understand things correctly at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet. Leaving old state as it was will not set the assured bit, but that will be set on the next packet because old_state == new_state == established at that point and the connection will really be setup then. I don't think anything will blow up if we do either way, but strictly speaking I'm more comfortable with the former. I'll test the new patch regardless, I left work so can't reproduce anymore but will yell tomorrow if it does explode ;) > The tcp_conntracks state table could be fixed with introducing a new > state, but that part is exposed to userspace (ctnetlink) and ugly > compatibility code would be required for backward compatibility. I agree a new state is more work than it is worth, I'm happy to leave it as is. -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Dominique Martinet wrote on Wed, Apr 18, 2018: > Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > > Yes, the state transition is wrong for simultaneous open, because the > > tcp_conntracks table is not (cannot be) smart enough. Could you verify the > > next untested patch? > > Thanks for the patch; I'll give it a try (probably won't make it today > so will report tomorrow) Actually had time; I can confirm (added printks) we did get in that if that was pointed at, and we no longer get there now. The connection no longer gets in invalid state, so that looks like it nailed it. I'm now confused what this has to do with tcp_timestamp though, since setting that off also seemed to work around the issue, but if we get something like that in I'll be happy anyway. Big props to everyone involved, I would have taken much longer alone, -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018: > Yes, the state transition is wrong for simultaneous open, because the > tcp_conntracks table is not (cannot be) smart enough. Could you verify the > next untested patch? Thanks for the patch; I'll give it a try (probably won't make it today so will report tomorrow) -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Michal Kubecek wrote on Tue, Apr 17, 2018: > Data (21 bytes) packet in reply direction. And somewhere between the > first and second debugging print, we ended up with sender scale=0 and > that value is then preserved from now on. > > The only place between the two debug prints where we could change only > one of the td_sender values are the two calls to tcp_options() but > neither should be called now unless I missed something. I'll try to > think about it some more. Could it have something to do with the way I setup the connection? I don't think the "both remotes call connect() with carefully selected source/dest port" is a very common case.. If you look at the tcpdump outputs I attached the sequence usually is something like server > client SYN client > server SYN server > client SYNACK client > server ACK ultimately it IS a connection, but with an extra SYN packet in front of it (that first SYN opens up the conntrack of the nat so that the client's syn can come in, the client's conntrack will be that of a normal connection since its first SYN goes in directly after the server's (it didn't see the server's SYN)) Looking at my logs again, I'm seeing the same as you: This looks like the actual SYN/SYN/SYNACK/ACK: - 14.364090 seq=505004283 likely SYN coming out of server - 14.661731 seq=1913287797 on next line it says receiver end=505004284 so likely the matching SYN from client Which this time gets a proper SYNACK from server: 14.662020 seq=505004283 ack=1913287798 And following final dataless ACK: 14.687570 seq=1913287798 ack=505004284 Then as you point out some data ACK, where the scale poofs: 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 14.688824 tcp_in_window: 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7 As you say, only tcp_options() will clear only on side of the scales. We don't have sender->td_maxwin == 0 (printed) so I see no other way than we are in the last else if: - we have after(end, sender->td_end) (end=1913287819 > sender end=1913287798) - I assume the tcp state machine must be confused because of the SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, but since this is a data packet it doesn't have the tcp option for scale thus scale resets. At least peeling the logs myself helped me follow the process, I'll sprinkle some carefully crafted logs tomorrow to check if this is true and will let you figure what is best of trying to preserve scale if it was set before, setting a default to 14 or something else. Thanks! -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Dominique Martinet wrote on Mon, Apr 16, 2018: > . . . Oh, there is something interesting there, the connection doesn't > come up with -G? Hm, sorry, I take this last part back. I cannot reproduce -G not working reliably. I'll dig around the conntrack table a bit more. -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
rack -G --protonum tcp --src server --dst client --sport 15080 --dport 32979 tcp 6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=3 conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown. hang: # conntrack -G --protonum tcp --src server --dst client --sport 21308 --dport 37552 conntrack v1.4.4 (conntrack-tools): 0 flow entries have been shown. So something happened that makes it show up in -L (table dump) but not when querying...? And only when there is enough traffic: I have previously kept such a connection without workaround for hours just fine as long as I made sure not to display more than a screen at a time. Thanks again, -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
bled (after running my reproducer, e.g. outputing a big chunk of text quickly): reno wscale:7,7 rto:228 rtt:27.267/1.423 ato:40 mss:1386 pmtu:1500 rcvmss:1248 advmss:1460 cwnd:10 ssthresh:12 bytes_acked:17311070 bytes_received:40445 segs_out:13331 segs_in:7523 data_segs_out:13279 data_segs_in:947 send 4.1Mbps lastsnd:5 lastrcv:5 lastack:5 pacing_rate 4.9Mbps delivery_rate 431.4Kbps app_limited busy:36064ms unacked:1 retrans:0/6 rcv_rtt:9112.95 rcv_space:29233 rcv_ssthresh:39184 minrtt:25.566 So I guess lost:33 matching unacked:33 might be another hint? I'll need a bit more time reading the code to understand what this all implies ; feel free to beat me to it. Thanks, -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Eric Dumazet wrote on Fri, Apr 13, 2018: > Ah sorry, your trace was truncated, we need more packets _before_ the excerpt. Ah, sorry as well. I tried to go far back enough to include the replayed packets, but I see it didn't include the original ack for that packet. I'm resending both full traces from connect to fin this time. > That might be caused by some TS val/ecr breakage : > > Many acks were received by the server tcpdump, > but none of them was accepted by TCP stack, for some reason. > > Try to disable TCP timestamps, it will give some hint if bug does not > reproduce. I unfortunately cannot reproduce on a local network, but will give this a try on Monday and report. (sysctl net.ipv4.tcp_timestamps=0, correct?) Here's the traces again, server-capture first: 16:49:22.374452 IP .13317 > .31872: Flags [S], seq 2026966826, win 29200, options [mss 1460,sackOK,TS val 1313933281 ecr 0,nop,wscale 7], length 0 16:49:22.700553 IP .31872 > .13317: Flags [S], seq 882075258, win 29200, options [mss 1386,sackOK,TS val 1617125446 ecr 0,nop,wscale 7], length 0 16:49:22.700577 IP .13317 > .31872: Flags [S.], seq 2026966826, ack 882075259, win 29200, options [mss 1460,sackOK,TS val 1313933607 ecr 1617125446,nop,wscale 7], length 0 16:49:22.727131 IP .31872 > .13317: Flags [.], ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 0 16:49:22.729193 IP .31872 > .13317: Flags [P.], seq 1:22, ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 21 16:49:22.729230 IP .13317 > .31872: Flags [.], ack 22, win 229, options [nop,nop,TS val 1313933635 ecr 1617125472], length 0 16:49:22.732726 IP .13317 > .31872: Flags [P.], seq 1:22, ack 22, win 229, options [nop,nop,TS val 1313933639 ecr 1617125472], length 21 16:49:22.759809 IP .31872 > .13317: Flags [.], ack 22, win 229, options [nop,nop,TS val 1617125503 ecr 1313933639], length 0 16:49:22.759835 IP .13317 > .31872: Flags [P.], seq 22:734, ack 22, win 229, options [nop,nop,TS val 1313933666 ecr 1617125503], length 712 16:49:22.761643 IP .31872 > .13317: Flags [P.], seq 22:1270, ack 22, win 229, options [nop,nop,TS val 1617125504 ecr 1313933639], length 1248 16:49:22.801860 IP .13317 > .31872: Flags [.], ack 1270, win 248, options [nop,nop,TS val 1313933708 ecr 1617125504], length 0 16:49:22.826902 IP .31872 > .13317: Flags [.], ack 734, win 240, options [nop,nop,TS val 1617125572 ecr 1313933666], length 0 16:49:22.827576 IP .31872 > .13317: Flags [P.], seq 1270:1318, ack 734, win 240, options [nop,nop,TS val 1617125573 ecr 1313933708], length 48 16:49:22.827600 IP .13317 > .31872: Flags [.], ack 1318, win 248, options [nop,nop,TS val 1313933734 ecr 1617125573], length 0 16:49:22.833028 IP .13317 > .31872: Flags [P.], seq 734:1122, ack 1318, win 248, options [nop,nop,TS val 1313933739 ecr 1617125573], length 388 16:49:22.858455 IP .31872 > .13317: Flags [.], ack 1122, win 251, options [nop,nop,TS val 1617125604 ecr 1313933739], length 0 16:49:22.865866 IP .31872 > .13317: Flags [P.], seq 1318:1334, ack 1122, win 251, options [nop,nop,TS val 1617125612 ecr 1313933739], length 16 16:49:22.906865 IP .13317 > .31872: Flags [.], ack 1334, win 248, options [nop,nop,TS val 1313933813 ecr 1617125612], length 0 16:49:22.944474 IP .31872 > .13317: Flags [P.], seq 1334:1386, ack 1122, win 251, options [nop,nop,TS val 1617125678 ecr 1313933813], length 52 16:49:22.944497 IP .13317 > .31872: Flags [.], ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], length 0 16:49:22.944747 IP .13317 > .31872: Flags [P.], seq 1122:1174, ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], length 52 16:49:22.971083 IP .31872 > .13317: Flags [P.], seq 1386:1454, ack 1174, win 251, options [nop,nop,TS val 1617125716 ecr 1313933851], length 68 16:49:22.971607 IP .13317 > .31872: Flags [P.], seq 1174:1258, ack 1454, win 248, options [nop,nop,TS val 1313933878 ecr 1617125716], length 84 16:49:22.998201 IP .31872 > .13317: Flags [P.], seq 1454:2082, ack 1258, win 251, options [nop,nop,TS val 1617125742 ecr 1313933878], length 628 16:49:22.998987 IP .13317 > .31872: Flags [P.], seq 1258:1838, ack 2082, win 268, options [nop,nop,TS val 1313933905 ecr 1617125742], length 580 16:49:23.065102 IP .31872 > .13317: Flags [.], ack 1838, win 262, options [nop,nop,TS val 1617125810 ecr 1313933905], length 0 16:49:24.811297 IP .31872 > .13317: Flags [P.], seq 2082:3254, ack 1838, win 262, options [nop,nop,TS val 1617127527 ecr 1313933905], length 1172 16:49:24.812562 IP .13317 > .31872: Flags [P.], seq 1838:1874, ack 3254, win 287, options [nop,nop,TS val 1313935719 ecr 1617127527], length 36 16:49:24.838210 IP .31872 > .13317: Flags [.], ack 1874, win 262, options [nop,nop,TS val 1617127583 ecr 1313935719], length 0 16:49:24.838236 IP .13317 > .31872: Flags [P.], seq 1874:2534, ack 3254, win 287, options [nop,nop,TS val 1313935744 ecr 1617127583], length 660 16:49:24.839175 I
Re: tcp hang when socket fills up ?
seq 64980:66354, ack 4190, win 307, options [nop,nop,TS val 1313937638 ecr 1617129473], length 1374 16:49:26.754585 IP .31872 > .13317: Flags [.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], length 0 16:49:26.755260 IP .13317 > .31872: Flags [.], seq 66354:67728, ack 4190, win 307, options [nop,nop,TS val 1313937638 ecr 1617129473], length 1374 16:49:26.755271 IP .31872 > .13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0 16:49:26.755493 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36 16:49:26.756076 IP .13317 > .31872: Flags [.], seq 67728:69102, ack 4190, win 307, options [nop,nop,TS val 1313937639 ecr 1617129473], length 1374 16:49:26.756391 IP .13317 > .31872: Flags [.], seq 69102:70476, ack 4190, win 307, options [nop,nop,TS val 1313937639 ecr 1617129473], length 1374 16:49:26.756402 IP .31872 > .13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0 16:49:26.757020 IP .13317 > .31872: Flags [.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374 16:49:26.757349 IP .13317 > .31872: Flags [.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 1617129473], length 1374 16:49:26.757360 IP .31872 > .13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0 16:49:26.757878 IP .13317 > .31872: Flags [.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374 16:49:26.758457 IP .13317 > .31872: Flags [.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 1617129473], length 1374 16:49:26.758465 IP .31872 > .13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0 16:49:26.828268 IP .13317 > .31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374 16:49:26.868867 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0 16:49:27.010733 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36 16:49:27.070035 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374 16:49:27.070062 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:27.266769 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36 16:49:27.540998 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374 16:49:27.541022 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:27.778687 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36 16:49:28.532876 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374 16:49:28.532907 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:28.802879 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36 16:49:30.452830 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374 16:49:30.452861 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:30.850734 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36 16:49:34.229420 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374 16:49:34.229447 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:35.202854 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36 16:49:42.165044 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374 16:49:42.165073 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:43.394773 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36 Once again, thank you both. -- Dominique Martinet | Asmadeus
Re: tcp hang when socket fills up ?
Note this is mostly me talking to myself, but in case anyone else hits the same issues I might as well post my meagre progress. Dominique Martinet wrote on Fri, Apr 06, 2018: > (current kernel: vanilla 4.14.29) reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64 kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen any patch about that kind of stuff recently so probably still valid) Other main difference is the qdisc, VM is using fq_codel, host is directly on wireless so mq with 4 pfifo_fast queues - it is harder to reproduce on the VM (even on another VM with the same kernel) so I'd put the difference down to the qdisc, but I was able to reproduce with both ultimately. (update: it actually was still fairly easy to reproduce until it got later (coworkers left?), going from ~5-15s to reproduce to multiple minutes, so this likely depends on net quality a lot. I couldn't reproduce by fiddling with netem on a local network though...) > [please find previous email for setup/tcpdump output] So I have a crash dump with a socket / inet_sock that are blocked, since this is a VM I even get gdb in bonus.. With the crash dump, I can confirm that the socket is not available for writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1), but that doesn't help much if I can't tell why we're not taking acks in With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I think that's the function that should handle my ack, and that gets the replay. First, abusing next, the flow seems to be something like (I folded if/else not taken) static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct tcp_sacktag_state sack_state; struct rate_sample rs = { .prior_delivered = 0 }; u32 prior_snd_una = tp->snd_una; bool is_sack_reneg = tp->is_sack_reneg; u32 ack_seq = TCP_SKB_CB(skb)->seq; u32 ack = TCP_SKB_CB(skb)->ack_seq; bool is_dupack = false; int prior_packets = tp->packets_out; u32 delivered = tp->delivered; u32 lost = tp->lost; int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */ u32 prior_fack; sack_state.first_sackt = 0; sack_state.rate = &rs; /* We very likely will need to access rtx queue. */ prefetch(sk->tcp_rtx_queue.rb_node); /* If the ack is older than previous acks * then we can probably ignore it. */ if (before(ack, prior_snd_una)) { } /* If the ack includes data we haven't sent yet, discard * this segment (RFC793 Section 3.9). */ if (after(ack, tp->snd_nxt)) if (after(ack, prior_snd_una)) { } prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una; rs.prior_in_flight = tcp_packets_in_flight(tp); /* ts_recent update must be made after we are sure that the packet * is in window. */ if (flag & FLAG_UPDATE_TS_RECENT) if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { } else { u32 ack_ev_flags = CA_ACK_SLOWPATH; if (ack_seq != TCP_SKB_CB(skb)->end_seq) else NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPUREACKS); flag |= tcp_ack_update_window(sk, skb, ack, ack_seq); if (TCP_SKB_CB(skb)->sacked) if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) { } if (flag & FLAG_WIN_UPDATE) tcp_in_ack_event(sk, ack_ev_flags); } /* We passed data and got it acked, remove any soft error * log. Something worked... */ sk->sk_err_soft = 0; icsk->icsk_probes_out = 0; tp->rcv_tstamp = tcp_jiffies32; if (!prior_packets) goto no_queue; no_queue: /* If data was DSACKed, see if we can undo a cwnd reduction. */ if (flag & FLAG_DSACKING_ACK) tcp_fastretrans_alert(sk, prior_snd_una, is_dupack, &flag, &rexmit); /* If this ack opens up a zero window, clear backoff. It was * being used to time the probes, and is probably far higher than * it needs to be for normal retransmission. */ tcp_ack_probe(sk); if (tp->tlp_high_seq) tcp_process_tlp_ack(sk, ack, flag); return 1; And here is 'info local' towards th
tcp hang when socket fills up ?
p,nop,TS val 1617129506 ecr 1313937638], length 0 16:49:26.762199 IP .31872 > .13317: Flags [.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 0 16:49:26.763547 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], length 36 16:49:26.763553 IP .31872 > .13317: Flags [.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], length 0 16:49:26.764298 IP .31872 > .13317: Flags [.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], length 0 16:49:26.764676 IP .31872 > .13317: Flags [.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], length 0 16:49:26.807754 IP .13317 > .31872: Flags [.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 1617129473], length 1374 16:49:26.876467 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], length 0 16:49:27.048760 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 1617129473], length 1374 16:49:27.051791 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 ecr 1313937714], length 36 16:49:27.076444 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:27.371182 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 ecr 1313937714], length 36 16:49:27.519862 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 1617129473], length 1374 16:49:27.547662 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:27.883372 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 ecr 1313937714], length 36 16:49:28.511861 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 1617129473], length 1374 16:49:28.538891 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:28.907197 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 ecr 1313937714], length 36 16:49:30.431864 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 1617129473], length 1374 16:49:30.459127 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:30.955388 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 ecr 1313937714], length 36 16:49:34.207879 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 1617129473], length 1374 16:49:34.235726 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:35.256285 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 ecr 1313937714], length 36 16:49:42.143864 IP .13317 > .31872: Flags [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 1617129473], length 1374 16:49:42.171531 IP .31872 > .13317: Flags [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 1313937714,nop,nop,sack 1 {32004:33378}], length 0 16:49:43.448262 IP .31872 > .13317: Flags [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 ecr 1313937714], length 36 (I still have the pcap files if someone wants to see them, but I'd rather not give my work IP publicly so will send it privately on request) At this point, only the same 3 packets keep being replayed over and over... According to 'ss' the Send-Q isn't empty on either side, the client has some ~1k to send but the server has much more (87k) After increasing the window size through net.*wmem sysctl it got stuck with over 1MB in Send-Q, which makes sense because the socket is full... So, what I don't get is, why are these acks continuously replayed? Given the timing it looks like the server doesn't take the client acks into account, despite having received that precise 33378 ack earlier and I believe it should accept a higher ack value anyway ? The ultimate question being, how can I go about debugging that? I'm working on getting perf probe/crash to work on the server so I can look at the kernel side now, I can reproduce this semi-easily so I'm sure I'll get down to it eventually, but if anyone has an idea I'm all ears. Thanks! -- Dominique Martinet | Asmadeus
Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API
Christoph Hellwig wrote on Thu, Mar 03, 2016: > New version with the nits fixed below. Now that checkpath started > a stupid warning about not using tabs for indentation which I've > ignored here and will take up in my usual fights against Joes > idicotic opinions separately.. Thanks for the nitpicks, I can confirm it works as expected as well so all good with me. I like the new CQ interface :) (if someone adds an Acked-by please use dominique.marti...@cea.fr for my mail; sorry for the split personality) -- Dominique
Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API
Hi, Couple of checkpatch complains: Christoph Hellwig wrote on Sat, Feb 27, 2016: > -struct p9_rdma_context { > - enum ib_wc_opcode wc_op; > +struct p9_rdma_context { trailing tab > - p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status); > + p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, > wc->status); line over 80 chars That aside it looks good ; I need to check on the new API (hadn't noticed the change) but it looks nice. Will do the actual testing likely only next week only though; Eric has been taking my patches for 9p/RDMA so I suspect he'll take your's as well eventually (get_maintainer.pl has a long-ish list of CC for us usually) BTW I think it's easy enough to do the testing if you have a server that can dish it out. diod[1] and nfs-ganesha[2] are the only two I'm aware of but there might be more (using ganesha myself; happy to help you set it up in private if you need) [1] https://github.com/chaos/diod [2] https://github.com/nfs-ganesha/nfs-ganesha -- Dominique Martinet
[PATCH v2] 9p: trans_fd, bail out if recv fcall if missing
req->rc is pre-allocated early on with p9_tag_alloc and shouldn't be missing Signed-off-by: Dominique Martinet --- net/9p/trans_fd.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) Feel free to adapt error code/message if you can think of something better. diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index a270dcc..a6d89c0 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -356,13 +356,12 @@ static void p9_read_work(struct work_struct *work) } if (m->req->rc == NULL) { - m->req->rc = kmalloc(sizeof(struct p9_fcall) + - m->client->msize, GFP_NOFS); - if (!m->req->rc) { - m->req = NULL; - err = -ENOMEM; - goto error; - } + p9_debug(P9_DEBUG_ERROR, +"No recv fcall for tag %d (req %p), disconnecting!\n", +m->rc.tag, m->req); + m->req = NULL; + err = -EIO; + goto error; } m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall); memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [V9fs-developer] [PATCH] 9p: trans_fd, initialize recv fcall properly if not set
Eric Van Hensbergen wrote on Sat, Sep 05, 2015: > On Thu, Sep 3, 2015 at 4:38 AM, Dominique Martinet > wrote: > > To be honest, I think it might be better to just bail out if we get in > > this switch (m->req->rc == NULL after p9_tag_lookup) and not try to > > allocate more, because if we get there it's likely a race condition and > > silently re-allocating will end up in more troubles than trying to > > recover is worth. > > Thoughts ? > > > > Hmmm...trying to rattle my brain and remember why I put it in there > back in 2008. > It might have just been over-defensive programming -- or more likely it just > pre-dated all the zero copy infrastructure which pretty much guaranteed we had > an rc allocated and what is there is vestigial. I'm happy to accept a > patch which > makes this an assert, or perhaps just resets the connection because something > has gone horribly wrong (similar to the ENOMEM path that is there now). Yeah, it looks like the safety comes from the zero-copy stuff that came much later. Let's go with resetting the connection then. Hmm. EIO is a bit too generic so would be good to avoid that if possible, but can't think of anything better... Speaking of zero-copy, I believe it should be fairly straight-forward to implement for trans_fd now I've actually looked at it, since we do the payload read after a p9_tag_lookup, would just need m->req to point to a zc buffer. Write is similar, if there's a zc buffer just send it after the header. The cost is a couple more pointers in req and an extra if in both workers, that seems pretty reasonable. Well, I'm not using trans_fd much here (and unfortunately zero-copy isn't possible at all given the transport protocol for RDMA, at least for recv), but if anyone cares it probably could be done without too much hassle for the fd workers. -- Dominique -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 9p: trans_fd, initialize recv fcall properly if not set
That code really should never be called (rc is allocated in tag_alloc), but if it had been it couldn't have worked... Signed-off-by: Dominique Martinet --- net/9p/trans_fd.c | 3 +++ 1 file changed, 3 insertions(+) To be honest, I think it might be better to just bail out if we get in this switch (m->req->rc == NULL after p9_tag_lookup) and not try to allocate more, because if we get there it's likely a race condition and silently re-allocating will end up in more troubles than trying to recover is worth. Thoughts ? diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index a270dcc..0d9831a 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -363,6 +363,9 @@ static void p9_read_work(struct work_struct *work) err = -ENOMEM; goto error; } + m->req->rc.capacity = m->client->msize; + m->req->rc.sdata = (char*)m->req->rc + + sizeof(struct p9_fcall); } m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall); memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 9p: trans_fd, read rework to use p9_parse_header
Most of the changes here are no-op and just renaming to use a fcall struct, needed for p9_parse_header It fixes the unaligned memory access to read the tag and defers to common functions for part of the protocol knowledge (although header length is still hard-coded...) Reported-By: Rob Landley Signed-Off-By: Dominique Martinet --- net/9p/trans_fd.c | 75 +-- 1 file changed, 40 insertions(+), 35 deletions(-) It ended up alot bigger than I thought it'd be, submitting it anyway but happy with either version - letting Eric decide what's better :) diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index bced8c0..a270dcc 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -108,9 +108,7 @@ struct p9_poll_wait { * @unsent_req_list: accounting for requests that haven't been sent * @req: current request being processed (if any) * @tmp_buf: temporary buffer to read in header - * @rsize: amount to read for current frame - * @rpos: read position in current frame - * @rbuf: current read buffer + * @rc: temporary fcall for reading current frame * @wpos: write position for current frame * @wsize: amount of data to write for current frame * @wbuf: current write buffer @@ -131,9 +129,7 @@ struct p9_conn { struct list_head unsent_req_list; struct p9_req_t *req; char tmp_buf[7]; - int rsize; - int rpos; - char *rbuf; + struct p9_fcall rc; int wpos; int wsize; char *wbuf; @@ -305,49 +301,56 @@ static void p9_read_work(struct work_struct *work) if (m->err < 0) return; - p9_debug(P9_DEBUG_TRANS, "start mux %p pos %d\n", m, m->rpos); + p9_debug(P9_DEBUG_TRANS, "start mux %p pos %zd\n", m, m->rc.offset); - if (!m->rbuf) { - m->rbuf = m->tmp_buf; - m->rpos = 0; - m->rsize = 7; /* start by reading header */ + if (!m->rc.sdata) { + m->rc.sdata = m->tmp_buf; + m->rc.offset = 0; + m->rc.capacity = 7; /* start by reading header */ } clear_bit(Rpending, &m->wsched); - p9_debug(P9_DEBUG_TRANS, "read mux %p pos %d size: %d = %d\n", -m, m->rpos, m->rsize, m->rsize-m->rpos); - err = p9_fd_read(m->client, m->rbuf + m->rpos, - m->rsize - m->rpos); + p9_debug(P9_DEBUG_TRANS, "read mux %p pos %zd size: %zd = %zd\n", +m, m->rc.offset, m->rc.capacity, +m->rc.capacity - m->rc.offset); + err = p9_fd_read(m->client, m->rc.sdata + m->rc.offset, +m->rc.capacity - m->rc.offset); p9_debug(P9_DEBUG_TRANS, "mux %p got %d bytes\n", m, err); - if (err == -EAGAIN) { + if (err == -EAGAIN) goto end_clear; - } if (err <= 0) goto error; - m->rpos += err; + m->rc.offset += err; - if ((!m->req) && (m->rpos == m->rsize)) { /* header read in */ - u16 tag; + /* header read in */ + if ((!m->req) && (m->rc.offset == m->rc.capacity)) { p9_debug(P9_DEBUG_TRANS, "got new header\n"); - n = le32_to_cpu(*(__le32 *) m->rbuf); /* read packet size */ - if (n >= m->client->msize) { + err = p9_parse_header(&m->rc, NULL, NULL, NULL, 0); + if (err) { p9_debug(P9_DEBUG_ERROR, -"requested packet size too big: %d\n", n); +"error parsing header: %d\n", err); + goto error; + } + + if (m->rc.size >= m->client->msize) { + p9_debug(P9_DEBUG_ERROR, +"requested packet size too big: %d\n", +m->rc.size); err = -EIO; goto error; } - tag = le16_to_cpu(*(__le16 *) (m->rbuf+5)); /* read tag */ p9_debug(P9_DEBUG_TRANS, -"mux %p pkt: size: %d bytes tag: %d\n", m, n, tag); +"mux %p pkt: size: %d bytes tag: %d\n", +m, m->rc.size, m->rc.tag); - m->req = p9_tag_lookup(m->client, tag); + m->req = p9_tag_lookup(m->client, m->rc.tag); if (!m->req || (m->req->status != REQ_STATUS_SENT)) { p9_debug(P9_DEBUG_ERROR, "Unexpected packet tag %d\n", -tag); +