Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match

2021-04-19 Thread Dominique MARTINET
Arnd Bergmann wrote on Mon, Apr 19, 2021 at 02:16:36PM +0200:
> In some cases, you can use the device_link infrastructure to deal
> with dependencies between devices. Not sure if this would help
> in your case, but have a look at device_link_add() etc in drivers/base/core.c

I'll need to actually try to convince myself but if creating the link
forces driver registration then it should be workable.

> > In this particular case the problem is that since 7d981405d0fd ("soc:
> > imx8m: change to use platform driver") the soc probe tries to use the
> > nvmem driver for ocotp fuses for imx8m devices, which isn't ready yet.
> > So soc loading gets pushed back to the end of the list because it gets
> > defered and other drivers relying on soc_device_match get confused
> > because they wrongly think a device doesn't match a quirk when it
> > actually does.
> >
> > If there is a way to ensure the nvmem driver gets loaded before the soc,
> > that would also solve the problem nicely, and avoid the need to mess
> > with all the ~50 drivers which use it.
> >
> > Is there a way to control in what order drivers get loaded? Something in
> > the dtb perhaps?
> 
> For built-in drivers, load order depends on the initcall level and
> link order (how things are lined listed in the Makefile hierarchy).
> 
> For loadable modules, this is up to user space in the end.
> 
> Which of the drivers in this scenario are loadable modules?

All the drivers involved in my case are built-in (nvmem, soc and final
soc_device_match consumer e.g. caam_jr that crashes the kernel if soc is
not identified properly).

I frankly don't like the idea of moving nvmem/ above soc/ in
drivers/Makefile as a "solution" to this (especially as there is one
that seems to care about what soc they run on...), so I'll have a look
at links first, hopefully that will work out.


Thanks,
-- 
Dominique


Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match

2021-04-19 Thread Dominique MARTINET


Geert Uytterhoeven wrote on Mon, Apr 19, 2021 at 11:03:24AM +0200:
> > This is going to need quite some more work to be acceptable, in my
> > opinion, but I think it should be possible.
> 
> In general, this is very hard to do, IMHO. Some drivers may be used on
> multiple platforms, some of them registering an SoC device, some of
> them not registering an SoC device.  So there is no way to know the
> difference between "SoC device not registered, intentionally", and
> "SoC device not yet registered".

Hm, good point, I was probably a bit too optimistic if there are devices
which don't register any soc yet have drivers which want one; I don't
see how to make the difference indeed... And that does mean we can't
just defer all the time.

> soc_device_match() should only be used as a last resort, to identify
> systems that cannot be identified otherwise.  Typically this is used for
> quirks, which should only be enabled on a very specific subset of
> systems.  IMHO such systems should make sure soc_device_match()
> is available early, by registering their SoC device early.

I definitely agree there, my suggestion to defer was only because I know
of no other way to influence the ordering of drivers loading reliably
and gave up on soc being init'd early.

In this particular case the problem is that since 7d981405d0fd ("soc:
imx8m: change to use platform driver") the soc probe tries to use the
nvmem driver for ocotp fuses for imx8m devices, which isn't ready yet.
So soc loading gets pushed back to the end of the list because it gets
defered and other drivers relying on soc_device_match get confused
because they wrongly think a device doesn't match a quirk when it
actually does.

If there is a way to ensure the nvmem driver gets loaded before the soc,
that would also solve the problem nicely, and avoid the need to mess
with all the ~50 drivers which use it.


Is there a way to control in what order drivers get loaded? Something in
the dtb perhaps?


Thanks,
-- 
Dominique


Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match

2021-04-18 Thread Dominique MARTINET
Alice Guo (OSS) wrote on Mon, Apr 19, 2021 at 12:27:22PM +0800:
> From: Alice Guo 
> 
> Update all the code that use soc_device_match

A single patch might be difficult to accept for all components, a each
maintainer will probably want to have a say on their subsystem?

I would suggest to split these for a non-RFC version; a this will really
need to be case-by-case handling.

> because add support for soc_device_match returning -EPROBE_DEFER.

(English does not parse here for me)

I've only commented a couple of places in the code itself, but this
doesn't seem to add much support for errors, just sweep the problem
under the rug.

> Signed-off-by: Alice Guo 
> ---
> 
> diff --git a/drivers/bus/ti-sysc.c b/drivers/bus/ti-sysc.c
> index 5fae60f8c135..00c59aa217c1 100644
> --- a/drivers/bus/ti-sysc.c
> +++ b/drivers/bus/ti-sysc.c
> @@ -2909,7 +2909,7 @@ static int sysc_init_soc(struct sysc *ddata)
>   }
>  
>   match = soc_device_match(sysc_soc_feat_match);
> - if (!match)
> + if (!match || IS_ERR(match))
>   return 0;

This function handles errors, I would recommend returning the error as
is if soc_device_match returned one so the probe can be retried later.

>  
>   if (match->data)
> diff --git a/drivers/clk/renesas/r8a7795-cpg-mssr.c 
> b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> index c32d2c678046..90a18336a4c3 100644
> --- a/drivers/clk/renesas/r8a7795-cpg-mssr.c
> +++ b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> @@ -439,6 +439,7 @@ static const unsigned int r8a7795es2_mod_nullify[] 
> __initconst = {
>  
>  static int __init r8a7795_cpg_mssr_init(struct device *dev)
>  {
> + const struct soc_device_attribute *match;
>   const struct rcar_gen3_cpg_pll_config *cpg_pll_config;
>   u32 cpg_mode;
>   int error;
> @@ -453,7 +454,8 @@ static int __init r8a7795_cpg_mssr_init(struct device 
> *dev)
>   return -EINVAL;
>   }
>  
> - if (soc_device_match(r8a7795es1)) {
> + match = soc_device_match(r8a7795es1);
> + if (!IS_ERR(match) && match) {

Same, return the error.
Assuming an error means no match will just lead to hard to debug
problems because the driver potentially assumed the wrong device when
it's just not ready yet.

>   cpg_core_nullify_range(r8a7795_core_clks,
>  ARRAY_SIZE(r8a7795_core_clks),
>  R8A7795_CLK_S0D2, R8A7795_CLK_S0D12);
> [...]
> diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
> index eaaec0a55cc6..13a06b613379 100644
> --- a/drivers/iommu/ipmmu-vmsa.c
> +++ b/drivers/iommu/ipmmu-vmsa.c
> @@ -757,17 +757,20 @@ static const char * const devices_allowlist[] = {
>  
>  static bool ipmmu_device_is_allowed(struct device *dev)
>  {
> + const struct soc_device_attribute *match1, *match2;
>   unsigned int i;
>  
>   /*
>* R-Car Gen3 and RZ/G2 use the allow list to opt-in devices.
>* For Other SoCs, this returns true anyway.
>*/
> - if (!soc_device_match(soc_needs_opt_in))
> + match1 = soc_device_match(soc_needs_opt_in);
> + if (!IS_ERR(match1) && !match1)

I'm not sure what you intended to do, but !match1 already means there is
no error so the original code is identical.

In this case ipmmu_device_is_allowed does not allow errors so this is
one of the "difficult" drivers that require slightly more thinking.
It is only called in ipmmu_of_xlate which does return errors properly,
so in this case the most straightforward approach would be to make
ipmmu_device_is_allowed return an int and forward errors as well.



...
This is going to need quite some more work to be acceptable, in my
opinion, but I think it should be possible.

Thanks,
-- 
Dominique


Re: [RFC v1 PATCH 1/3] drivers: soc: add support for soc_device_match returning -EPROBE_DEFER

2021-04-18 Thread Dominique MARTINET
First comment overall for the whole serie:
Since it is the solution I had suggested when I reported the problem[1]
I have no qualm on the approach, comments for individual patches
follow.

[1] http://lore.kernel.org/r/YGGZJjAxA1IO+/v...@atmark-techno.com


Alice Guo (OSS) wrote on Mon, Apr 19, 2021 at 12:27:20PM +0800:
> From: Alice Guo 
> 
> In i.MX8M boards, the registration of SoC device is later than caam
> driver which needs it. Caam driver needs soc_device_match to provide
> -EPROBE_DEFER when no SoC device is registered and no
> early_soc_dev_attr.

This patch should be last in the set: you can't have soc_device_match
return an error before its callers handle it.

> Signed-off-by: Alice Guo 

As the one who reported the problem I would have been appreciated being
at least added to Ccs... I only happened to notice you posted this by
chance.

There is also not a single Fixes tag -- I believe this commit should
have Fixes: 7d981405d0fd ("soc: imx8m: change to use platform driver")
but I'm not sure how such tags should be handled in case of multiple
patches fixing something.

-- 
Dominique


Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Dominique Martinet
Jisheng Zhang wrote on Tue, Mar 02, 2021 at 03:39:40PM +0800:
> > Rather than make an exception for 0, how about just removing the if as
> > follow ?
> 
> IMHO, we may need to keep the "if" in current logic. When count
> reaches zero, we need to break the "while(iov_iter_count(to))" loop, so 
> removing
> the "if" modifying the logic.

We're not looking at the same loop, the break will happen properly
without the if because it's the return value of p9_client_read_once()
now.

In the old code I remember what you're saying and it makes sense, I
guess that was the reason for the special case.
It's not longer required, let's remove it.

-- 
Dominique


Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Dominique Martinet
Jisheng Zhang wrote on Mon, Mar 01, 2021 at 11:01:57AM +0800:
> Per my understanding of iov_iter, we need to call iov_iter_advance()
> even when the read out count is 0. I believe we can see this common style
> in other fs.

I'm not sure where you see this style, but I don't see exceptions for
0-sized read not advancing the iov in general, and I guess this makes
sense.


Rather than make an exception for 0, how about just removing the if as
follow ?

I've checked that the non_zc case (copy_to_iter with 0 size) also works
to the same effect, so I'm not sure why the check got added in the
first place... But then again this is old code so maybe the semantics
changed since 2015.



diff --git a/net/9p/client.c b/net/9p/client.c
index 4f62f299da0c..0a0039255c5b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1623,11 +1623,6 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, 
struct iov_iter *to,
}

p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
-   if (!count) {
-   p9_tag_remove(clnt, req);
-   return 0;
-   }
-
if (non_zc) {
int n = copy_to_iter(dataptr, count, to);




If you're ok with that, would you mind resending that way?

I'd also want the commit message to be reworded a bit, at least the
first line (summary) doesn't make sense right now: I have no idea
what you mean by "free what was emitted".
Just "9p: advance iov on empty read" or something similar would do.


> > cat version? coreutils' doesn't seem to do that on their git)
> 
> busybox cat

Ok, could reproduce with busybox cat, thanks.
As expected I can't reproduce with older kernels so will run a bisect
for the sake of it as time allows

-- 
Dominique


Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-02-28 Thread Dominique Martinet
Jisheng Zhang wrote on Mon, Mar 01, 2021 at 10:33:36AM +0800:
> I met below warning when cating a small size(about 80bytes) txt file
> on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we
> miss iov_iter_advance() if the read count is 0, so we didn't truncate
> the pipe, then iov_iter_pipe() thinks the pipe is full. Fix it by
> calling iov_iter_advance() on the iov_iter "to" even if read count is 0

Hm, there are plenty of other error cases that don't call
iov_iter_advance() and shouldn't trigger this warning ; I'm not sure
just adding one particular call to this is a good solution.


How reproducible is this? From the description it should happen
everytime you cat a small file? (I'm surprised cat uses sendfile, what
cat version? coreutils' doesn't seem to do that on their git)

What kernel version do you get this on? Bonus points if you can confirm
this didn't use to happen, and full points for a bisect.


(cat on a small file is something I do all the time in my tests, I'd
like to be able to reproduce to understand the issue better as I'm not
familiar with that part of the code)

Thanks,
-- 
Dominique


Re: [PATCH] 9p: fix: Uninitialized variable p.

2020-12-29 Thread Dominique Martinet
YANG LI wrote on Wed, Dec 30, 2020:
> The pointer p is being used but it isn't being initialized,
> need to assign a NULL to it.

My understanding is p has to be initialized: the only way out of the
while(1) loop below sets it.


I don't mind fixing false positive warnings as it makes maintenance
easier for everyone, but:

 1/ the commit message needs to reflect that and at least name which
tool had a problem with it. I'm tempted to think this case is common
enough that the tool ought to be fixed instead...

 2/ the code needs to work in the p=NULL case if you set it that way
(right now, it doesn't, so in the event the code changes in the future
and there really comes a way to skip initialization this change would
actually hinder bug detection -- it'd need to add a p == NULL check
below, which is going to be useless code, but hopefully compilers will
optimize it away)


Thanks,
-- 
Dominique


[GIT PULL] 9p update for 5.11-rc1

2020-12-21 Thread Dominique Martinet


Hi Linus,

here's this cycle's update, finally finished on some very old patches
(originally april 2015!) to allow fixing open-unlink-fgetattr pattern.

Thanks to Eric, Greg and Jianyong for the bulk of the work, and Dan for
static analysis fixes on -next.


The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891:

  Linux 5.10-rc2 (2020-11-01 14:43:51 -0800)

are available in the Git repository at:

  https://github.com/martinetd/linux tags/9p-for-5.11-rc1

for you to fetch changes up to cfd1d0f524a87b7d6d14b41a14fa4cbe522cf8cc:

  9p: Remove unnecessary IS_ERR() check (2020-12-01 08:19:02 +0100)


9p for 5.11-rc1

- fix long-standing limitation on open-unlink-fop pattern
- add refcount to p9_fid (fixes the above and will allow for more
cleanups and simplifications in the future)


Dan Carpenter (2):
  9p: Uninitialized variable in v9fs_writeback_fid()
  9p: Remove unnecessary IS_ERR() check

Dominique Martinet (2):
  9p: apply review requests for fid refcounting
  9p: Fix writeback fid incorrectly being attached to dentry

Eric Van Hensbergen (1):
  fs/9p: fix create-unlink-getattr idiom

Greg Kurz (2):
  fs/9p: track open fids
  fs/9p: search open fids first

Jianyong Wu (1):
  9p: add refcount to p9_fid struct

 fs/9p/fid.c | 65  
-
 fs/9p/fid.h | 11 ++-
 fs/9p/vfs_dentry.c  |  2 ++
 fs/9p/vfs_dir.c |  6 +-
 fs/9p/vfs_file.c|  7 ---
 fs/9p/vfs_inode.c   | 47  ++-
 fs/9p/vfs_inode_dotl.c  | 35 +--
 fs/9p/vfs_super.c   |  1 +
 fs/9p/xattr.c   | 16 +---
 include/net/9p/client.h |  7 +++
 net/9p/client.c | 14 +-
 11 files changed, 178 insertions(+), 33 deletions(-)

-- 
Dominique


Re: [PATCH net-next] net: 9p: Fix kerneldoc warnings of missing parameters etc

2020-11-01 Thread Dominique Martinet
Andrew Lunn wrote on Sun, Nov 01, 2020:
> > > Signed-off-by: Andrew Lunn 

Acked-by: Dominique Martinet 

> > 
> > Thanks, LGTM I'll take this for next cycle unless someone is grabbing
> > these
> 
> I hope to turn on W=1 by default soon in most of /net. That patch is
> likely to go to net-next.

That would be nice!

> What route do your patches normally take to Linus? Do you send a pull
> request to net-next? Or straight to Linus?

I normally send pull requests straight to Linus (because I also have
fs/9p which isn't part of net/) ; but since it's really low volume I
don't like bugging him everytime for such churn and am not really sure
what to do -- that's why I asked :)

> If this patch is not in net-next, i cannot enable it for 9p. So
> either:
> [...]
> 4) Jakub takes this patch into net-next, and i can then enable W=1 in
>9p along with all the other sub-directories. We will get to know
>about new warnings in net-next, and next, but not in your tree.

Developers should use next for development anyway; I think that's the
easiest way forward if you want to enable W=1 ASAP.

I mean, if I take the patch the fixes will get in next in the next few
days sure but it'll make enabling W=1 difficult for the net-next tree
without it.
I've added Jakub to direct recipients, could you take this one?


Thanks,
-- 
Dominique


Re: [PATCH net-next] net: 9p: Fix kerneldoc warnings of missing parameters etc

2020-10-31 Thread Dominique Martinet
Andrew Lunn wrote on Sat, Oct 31, 2020:
> net/9p/client.c:420: warning: Function parameter or member 'c' not described 
> in 'p9_client_cb'
> net/9p/client.c:420: warning: Function parameter or member 'req' not 
> described in 'p9_client_cb'
> net/9p/client.c:420: warning: Function parameter or member 'status' not 
> described in 'p9_client_cb'
> net/9p/client.c:568: warning: Function parameter or member 'uidata' not 
> described in 'p9_check_zc_errors'
> net/9p/trans_common.c:23: warning: Function parameter or member 'nr_pages' 
> not described in 'p9_release_pages'
> net/9p/trans_common.c:23: warning: Function parameter or member 'pages' not 
> described in 'p9_release_pages'
> net/9p/trans_fd.c:132: warning: Function parameter or member 'rreq' not 
> described in 'p9_conn'
> net/9p/trans_fd.c:132: warning: Function parameter or member 'wreq' not 
> described in 'p9_conn'
> net/9p/trans_fd.c:56: warning: Function parameter or member 'privport' not 
> described in 'p9_fd_opts'
> net/9p/trans_rdma.c:113: warning: Function parameter or member 'cqe' not 
> described in 'p9_rdma_context'
> net/9p/trans_rdma.c:129: warning: Function parameter or member 'privport' not 
> described in 'p9_rdma_opts'
> net/9p/trans_virtio.c:215: warning: Function parameter or member 'limit' not 
> described in 'pack_sg_list_p'
> net/9p/trans_virtio.c:83: warning: Function parameter or member 'chan_list' 
> not described in 'virtio_chan'
> net/9p/trans_virtio.c:83: warning: Function parameter or member 
> 'p9_max_pages' not described in 'virtio_chan'
> net/9p/trans_virtio.c:83: warning: Function parameter or member 
> 'ring_bufs_avail' not described in 'virtio_chan'
> net/9p/trans_virtio.c:83: warning: Function parameter or member 'tag' not 
> described in 'virtio_chan'
> net/9p/trans_virtio.c:83: warning: Function parameter or member 'vc_wq' not 
> described in 'virtio_chan'
> 
> Signed-off-by: Andrew Lunn 

Thanks, LGTM I'll take this for next cycle unless someone is grabbing
these
-- 
Dominique


[GIT PULL] 9p update for 5.10-rc1

2020-10-22 Thread Dominique Martinet


Hi Linus,

another harmless cycle.
(sorry latest commit's message isn't great, I was half expecting a v2
but it didn't come and I remembered too late/didn't want to reword it
myself; and it's still worth taking as is)

Thanks,


The following changes since commit 549738f15da0e5a00275977623be199fbbf7df50:

  Linux 5.9-rc8 (2020-10-04 16:04:34 -0700)

are available in the Git repository at:

  https://github.com/martinetd/linux tags/9p-for-5.10-rc1

for you to fetch changes up to 7ca1db21ef8e0e6725b4d25deed1ca196f7efb28:

  net: 9p: initialize sun_server.sun_path to have addr's value only when addr 
is valid (2020-10-12 10:05:47 +0200)


9p pull request for inclusion in 5.10

A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized
variable warning) and code cleanup (xen)


Anant Thazhemadam (1):
  net: 9p: initialize sun_server.sun_path to have addr's value only when 
addr is valid

Matthew Wilcox (Oracle) (1):
  9P: Cast to loff_t before multiplying

Ye Bin (1):
  9p/xen: Fix format argument warning

 fs/9p/vfs_file.c   | 4 ++--
 net/9p/trans_fd.c  | 2 +-
 net/9p/trans_xen.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)
-- 
Dominique


Re: [PATCH net] net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid

2020-10-12 Thread Dominique Martinet
Anant Thazhemadam wrote on Mon, Oct 12, 2020:
> You mentioned how a fully zeroed address isn't exactly faulty. By extension, 
> wouldn't that
> mean that an address that simply begins with a 0 isn't faulty as well?

That is correct.
If you have a look at the unix(7) man page that describes AF_UNIX, it
describes what 'abstract' addresses are and unix_mkname() in linux's
net/unix/af_unix.c shows how it's handled.

> This is an interesting point, because if the condition is modified to 
> checking for addr[0] directly,
> addresses that simply begin with 0 (but have more non-zero content following) 
> wouldn't be
> copied over either, right?

Yes, we would reject any address that starts with a nul byte -- but that
is already exactly what your patch does with strlen() already: a '\0' at
the start of the string is equivalent to strlen(addr) == 0.
The only difference is that checking for addr[0] won't run through all
the string if it doesn't start with a nul byte; but this is a one-time
thing at mount so it really doesn't matter.

> In the end, it comes down to what you define as a "valid" value that sun_path 
> can have.
> We've already agreed that a fully zeroed address wouldn't qualify as a valid 
> value for sun_path.
> Are addresses that aren't fully zeroed, but only begin with a 0 also to be 
> considered as an
> unacceptable value for sun_path?

Yes, because the strcpy() a few lines below would copy nothing, leaving
sun_server.sun_path uninitialized like your example.

At that point you could ask why not "fix" that strcpy to properly copy
the address passed instead but that doesn't really make sense given
where 'addr' comes from: it's passed from userspace as a nul-terminated
string, so nothing after the first '\0' is valid.

There probably are ways to work around that (e.g. iproute's ss will
display abstract addresses with a leading '@' instead) but given nobody
ever seemed to care I think it's safe to just return EINVAL there like
you did ; there's nothing wrong with your patch as far as I'm concerned.

-- 
Dominique


Re: [PATCH net] net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid

2020-10-12 Thread Dominique Martinet
Anant Thazhemadam wrote on Mon, Oct 12, 2020:
> In p9_fd_create_unix, checking is performed to see if the addr (passed
> as an argument) is NULL or not.
> However, no check is performed to see if addr is a valid address, i.e.,
> it doesn't entirely consist of only 0's.
> The initialization of sun_server.sun_path to be equal to this faulty
> addr value leads to an uninitialized variable, as detected by KMSAN.
> Checking for this (faulty addr) and returning a negative error number
> appropriately, resolves this issue.

I'm not sure I agree a fully zeroed address is faulty but I agree we can
probably refuse it given userspace can't pass useful abstract addresses
here.

Just one nitpick but this is otherwise fine - good catch!

> Reported-by: syzbot+75d51fe5bf4ebe988...@syzkaller.appspotmail.com
> Tested-by: syzbot+75d51fe5bf4ebe988...@syzkaller.appspotmail.com
> Signed-off-by: Anant Thazhemadam 
> ---
>  net/9p/trans_fd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> index c0762a302162..8f528e783a6c 100644
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -1023,7 +1023,7 @@ p9_fd_create_unix(struct p9_client *client, const char 
> *addr, char *args)
>  
>   csocket = NULL;
>  
> - if (addr == NULL)
> + if (!addr || !strlen(addr))

Since we don't care about the actual length here, how about checking for
addr[0] directly?
That'll spare a strlen() call in the valid case.

Well, I guess it doesn't really matter -- I'll queue this up anyway and
update if you resend.


Thanks,
-- 
Dominique


Re: INFO: task can't die in p9_fd_close

2020-08-27 Thread Dominique Martinet
Mark Brown wrote on Wed, Aug 26, 2020:
> On Wed, Aug 26, 2020 at 03:38:15AM -0700, syzbot wrote:
> 
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10615b3690
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=a61d44f28687f508
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fbe34b643e462f65e542
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15920a0590
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13a7853990
> > 
> > The issue was bisected to:
> > 
> > commit af3acca3e35c01920fe476f730dca7345d0a48df
> > Author: Daniel Baluta 
> > Date:   Tue Feb 20 12:53:10 2018 +
> > 
> > ASoC: ak5558: Fix style for SPDX identifier
> 
> This bisection is clearly not accurate, I'm guessing the bug is
> intermittent and it was just luck that landed it on this commit.

It's a bug that's been present since day 1 pretty much.

I have a fix that had been overcooking for a while which I had planned
to take in this cycle -- I'll submit to -next during next week, so
hopefully syzbot will be able to spend its time more usefully after
that.

-- 
Dominique


[GIT PULL] 9p update for 5.9-rc1

2020-08-15 Thread Dominique Martinet


Hi Linus,

the usual small stuff.

Thanks!


The following changes since commit 74d6a5d5662975aed7f25952f62efbb6f6dadd29:

  9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work 
(2020-07-19 14:58:47 +0200)

are available in the Git repository at:

  https://github.com/martinetd/linux tags/9p-for-5.9-rc1

for you to fetch changes up to 2ed0b7578170c3bab10cde09d4440897b305e40c:

  9p: Remove unneeded cast from memory allocation (2020-07-31 07:28:25 +0200)


9p pull request for inclusion in 5.9

- some code cleanup
- a couple of static analysis fixes
- setattr: try to pick a fid associated with the file rather than the
dentry, which might sometimes matter


Alexander Kapshuk (1):
  net/9p: Fix sparse endian warning in trans_fd.c

Jianyong Wu (2):
  9p: retrieve fid from file when file instance exist.
  9p: remove unused code in 9p

Li Heng (1):
  9p: Remove unneeded cast from memory allocation

Zheng Bin (1):
  9p: Fix memory leak in v9fs_mount

 fs/9p/v9fs.c   |  5 ++---
 fs/9p/vfs_inode.c  | 65 
-
 fs/9p/vfs_inode_dotl.c |  9 +++--
 net/9p/trans_fd.c  |  2 +-
 4 files changed, 18 insertions(+), 63 deletions(-)


Re: 9p/trans_fd lockup

2020-08-06 Thread Dominique Martinet
Alexey Kardashevskiy wrote on Thu, Aug 06, 2020:
> I am seeing another bug in 9p under syzkaller, the reprocase is:
> 
> r0 = open$dir(&(0x7f40)='./file0\x00', 0x88142, 0x182)
> 
> r1 = openat$null(0xff9c, &(0x7f000640)='/dev/null\x00',
> 0x0, 0x0)
> mount$9p_fd(0x0, &(0x7f00)='./file0\x00',
> &(0x7fc0)='9p\x00', 0x0, &(0x7f000100)={'trans=fd,',
> {'rfdno', 0x3d, r1}, 0x2$, {'wfdno', 0x3d, r0}})
> 
> 
> 
> The default behaviour of syzkaller is to call syscalls concurrently (I
> think), at least it forks by default and executes the same sequence in
> both threads.
> 
> In this example both threads makes it to:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n757
> 
> and sit there with the only difference which is thread#1 goes via
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/9p/client.c?h=v5.8#n767
> 
> I am pretty sure things should not have gone that far but I cannot
> clearly see what needs fixing. Ideas? Thanks,

Unkillable threads there happen with the current p9_client_rpc when
there is no real server (or server bug etc); the code is really stupid.

Basically what happens is that when you send a first signal (^C or
whatever), the function catches the signal, sends a flush, and
indefinitely waits for the flush to come back.
If you send another signal there no more flush comes but it goes back to
waiting -- it's using wait_event_killable but it's a lie it's not really
killable it just loops on that wait until the flush finally comes, which
will never come in your case.
(the rpc that came by the way is probably version or whatever is first
done on mount)


Dmitry reported that to me ages ago and I have a fix which is just to
stop waiting for the flush -- just make it asynchronous, send and
forget. That removes the whole signal handling logic and it won't hang
there anymore.

I sent the patches to the list last year, but didn't get much feedback
and didn't have time to run all the tests I wanted to run on it.


I have some free time at the end of the month so I was planning to
finish it for 5.10 (e.g. won't send it for 5.9 but once 5.9 initial
merge window passed leave it in -next for a couple of months and push it
for 5.10), so your timing is pretty good :)
An extra pair of eyes would be more than appreciated.

You can find the original mails there:
https://lore.kernel.org/lkml/1544532108-21689-3-git-send-email-asmad...@codewreck.org/

They're also in my 9p-test branch on git://github.com/martinetd/linux


Cheers & thanks for the attention,
-- 
Dominique


[GIT PULL] 9p update for 5.8

2020-07-28 Thread Dominique Martinet
Sorry for the late request, I took some time to fix my test setup and to
be convinced these two patches are worth sending now and not wait until
the next merge window.

(the "weird" -2 at the end of the tag is because I had already used
9p-for-5.8 for the original -rc1 pull)


The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:

  Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)

are available in the Git repository at:

  https://github.com/martinetd/linux tags/9p-for-5.8-2

for you to fetch changes up to 74d6a5d5662975aed7f25952f62efbb6f6dadd29:

  9p/trans_fd: Fix concurrency del of req_list in
  p9_fd_cancelled/p9_read_work (2020-07-19 14:58:47 +0200)


A couple of syzcaller fixes for 5.8

the first one in particular has been quite noisy ("broke" in -rc5)
so this would be worth landing even this late even if users likely
won't see a difference


Christoph Hellwig (1):
  net/9p: validate fds in p9_fd_open

Wang Hai (1):
  9p/trans_fd: Fix concurrency del of req_list in 
p9_fd_cancelled/p9_read_work

 net/9p/trans_fd.c | 39 ++-
 1 file changed, 30 insertions(+), 9 deletions(-)
-- 
Dominique


Re: [V9fs-developer] [PATCH kernel] 9p/trans_fd: Check file mode at opening

2020-07-28 Thread Dominique Martinet
Greg Kurz wrote on Tue, Jul 28, 2020:
> > The "fd" transport layer uses 2 file descriptors passed externally
> > and calls kernel_write()/kernel_read() on these. If files were opened
> > without FMODE_WRITE/FMODE_READ, WARN_ON_ONCE() will fire.

There already is a fix in linux-next as a39c46067c84 ("net/9p: validate
fds in p9_fd_open")

> > This adds file mode checking in p9_fd_open; this returns -EBADF to
> > preserve the original behavior.
> 
> So this would cause open() to fail with EBADF, which might look a bit
> weird to userspace since it didn't pass an fd... Is this to have a
> different error than -EIO that is returned when either rfd or wfd
> doesn't point to an open file descriptor ? If yes, why do we care ?

FWIW the solution taken just returns EIO as it would if an invalid fd
was given, but since it did pass an fd EBADF actually makes sense to me?

However to the second question I'm not sure I care :)

> > Found by syzkaller.

I'm starting to understand where David comment came from the other day,
I guess it's still time to change my mind and submit to linus now I've
had time to test it...

-- 
Dominique


Re: [PATCH] net/9p: validate fds in p9_fd_open

2020-07-16 Thread Dominique Martinet
David Miller wrote on Wed, Jul 15, 2020:
> From: Dominique Martinet 
> Date: Wed, 15 Jul 2020 15:47:56 +0200
> > It's honestly just a warn on something that would fail anyway so I'd
> > rather let it live in -next first, I don't get why syzbot is so verbose
> > about this - it sent a mail when it found a c repro and one more once it
> > bisected the commit yesterday but it should not be sending more?
> 
> I honestly find it hard to understand the resistence to fixing the
> warning in mainline.
> 
> I merge such fixes aggressively.

Well, if it was something a user could ever see with normal (even
exotic) 9p workloads, sure; I would want that in mainline asap and do
what's required in a hurry.

But this warning only happens when passing fd that are invalid, so the
mount would fail with EIO anyway, and it's not a dos either -- I don't
see the harm really.
Someone who'd get errors anyway will just get slightly more verbose
errors (and for people like me with kernel.panic_on_warn set it'll even
crash their machines sure), and "normal" users won't ever see it -- I
see no reason to rush this.


It's not about the "extra work" of sending things to linus in a single
patch PR (it's honestly a wonder 9p gets maintainers at all, the volume
of patches doesn't really mandate it), but I need to fix tests first
anyway as said previously.
I've spent a couple of hours on it yesterday, and should be able to get
things running again soonish -- meanwhile I'm not comfortable sending
any patch anywhere anyway.

Yes given the patch content it's probably fine but syzbot doesn't test
that a 9p mount with a fd argument works, just that there's no warning /
crash, so for all I know we could just be returning -EIO early and
calling it a fix.
I don't see any reason this would fail, but the point of tests is to...
test things work the things we think they do?



Anyways, if you care about this feel free to take the patch and send it
along with your process earlier. I'm just stubborn in not wanting to
send things I could test untested and it came at a bad time / don't
think this is critical enough to manually test. Then again I probably
just spent more time arguing about it than it would have taken to
test...
(if you do please fix the goto as pointed out in a review)

Thanks,
-- 
Dominique


Re: [PATCH] net/9p: validate fds in p9_fd_open

2020-07-15 Thread Dominique Martinet
Christoph Hellwig wrote on Wed, Jul 15, 2020:
> FYI, this is now generating daily syzbot reports, so I'd love to see
> the fix going into Linus' tree ASAP..

Yes, I'm getting some syzbot warnings as well now.

I had however only planned to get this in linux-next, since that is what
the syzbot mails were complaining about, but I see this got into -rc5...


It's honestly just a warn on something that would fail anyway so I'd
rather let it live in -next first, I don't get why syzbot is so verbose
about this - it sent a mail when it found a c repro and one more once it
bisected the commit yesterday but it should not be sending more?

(likewise it should pick up the fix tag even if it only gets in -next,
or would it keep being noisy unless this gets merged to mainline?)


FWIW this is along with the 5 other patches I have queued for 5.9
waiting for my tests as my infra is still down, I've stopped trying to
make promises, but I could push at least just this one to -next if that
really helps.
Sorry for that, things should be smoother once I've taken the time to
put things back in place.
-- 
Dominique


Re: [PATCH net-next 01/20] net: 9p: kerneldoc fixes

2020-07-13 Thread Dominique Martinet
Andrew Lunn wrote on Mon, Jul 13, 2020:
> Simple fixes which require no deep knowledge of the code.
> 
> Cc: Eric Van Hensbergen 
> Cc: Latchesar Ionkov 
> Cc: Dominique Martinet 
Ack.

> Signed-off-by: Andrew Lunn 

Can't hurt & looks correct; thanks.

David, assuming you'll take the whole set so ignoring this patch; please
let me know if I should pick it up.
-- 
Dominique


Re: [PATCH] net/9p: validate fds in p9_fd_open

2020-07-11 Thread Dominique Martinet
Doug Nazar wrote on Fri, Jul 10, 2020:
> On 2020-07-10 04:57, Christoph Hellwig wrote:
> 
> >diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> >index 13cd683a658ab6..1cd8ea0e493617 100644
> >--- a/net/9p/trans_fd.c
> >+++ b/net/9p/trans_fd.c
> >@@ -803,20 +803,28 @@ static int p9_fd_open(struct p9_client *client, int 
> >rfd, int wfd)
> > return -ENOMEM;
> > ts->rd = fget(rfd);
> >+if (!ts->rd)
> >+goto out_free_ts;
> >+if (!(ts->rd->f_mode & FMODE_READ))
> >+goto out_put_wr;
> 
>   goto out_put_rd;
> 
> unless I'm mistaken.

Good catch, I've amended the commit so feel free to skip resending
unless want to change something
https://github.com/martinetd/linux/commit/28e987a0dc66744fb119e18150188fd8e3debd40

-- 
Dominique


Re: [PATCH] net/9p: validate fds in p9_fd_open

2020-07-10 Thread Dominique Martinet
Christoph Hellwig wrote on Fri, Jul 10, 2020:
> p9_fd_open just fgets file descriptors passed in from userspace, but
> doesn't verify that they are valid for read or writing.  This gets
> cought down in the VFS when actually attemping a read or write, but a
> new warning added in linux-next upsets syzcaller.
> 
> Fix this by just verifying the fds early on.
> 
> Reported-by: syzbot+e6f77e16ff68b2434...@syzkaller.appspotmail.com
> Signed-off-by: Christoph Hellwig 

Looks good to me, I'll pick it up shortly.

-- 
Dominique


Re: [PATCH] net/9p: Validate current->sighand in client.c

2020-06-21 Thread Dominique Martinet
Alexander Kapshuk wrote on Sun, Jun 21, 2020:
> Fix rcu not being dereferenced cleanly by using the task
> helpers (un)lock_task_sighand instead of spin_lock_irqsave and
> spin_unlock_irqrestore to ensure current->sighand is a valid pointer as
> suggested in the email referenced below.

Ack.
I'll take this once symbol issue is resolved ; thanks for your time.

-- 
Dominique


Re: [PATCH] net/9p: Validate current->sighand in client.c

2020-06-21 Thread Dominique Martinet
Alexander Kapshuk wrote on Sun, Jun 21, 2020:
> Thanks for your feedback.
> Shall I simply resend the v2 patch with the commit message reworded as
> you suggested, or should I make it a v3 patch instead?

Yes please resend the same commit reworded. v2 or v3 doesn't matter
much, the [PATCH whatever] part of the mail isn't included in the commit
message; I don't receive so much patches to be fussy about that :)

> One other thing I wanted to clarify is I got a message from kernel
> test robot ,
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/TMTLPYU6A522JH2VCN3PNZVAP6EE5MDF/,
> saying that on parisc my patch resulted in __lock_task_sighand being
> undefined during modpost'ing.
> I've noticed similar messages about other people's patches on the
> linux-kernel mailing list with the responses stating that the issue
> was at the compilation site rather than with the patch itself.
> As far as I understand, that is the case with my patch also. So no
> further action on that is required of me, is it?

Thanks, I hadn't noticed this mail -- unfortunately I don't have
anything setup to test pa risc, but __lock_task_sighand is defined in
kernel/signal.c which is unconditionally compiled so shouldn't have
anything to do with arch-specific failures, so I will run into the same
problem when I test this.

>From just looking at the code, it looks like a real problem to me -
__lock_task_sighand is never passed to EXPORT_SYMBOL so when building 9p
as a module we cannot use the function. Only exported symbols can be
called from code that can be built as modules.

That looks like an oversight to me more than something on purpose, but
it does mean I cannot take the patch right now -- we need to first get
the symbol exported before we can use it in 9p.


As things stand I'd rather have this patch wait one cycle for this than
revert to manipulating rcu directly like you did first -- if you're up
for it you can send a patch to get it exported first and I'll pick this
patch up next cycle, or I can take care of that too if you don't want to
bother.

Letting you tell me which you prefer,
-- 
Dominique


Re: [PATCH] net/9p: Validate current->sighand in client.c

2020-06-21 Thread Dominique Martinet
Alexander Kapshuk wrote on Sat, Jun 20, 2020:
> Use (un)lock_task_sighand instead of spin_lock_irqsave and
> spin_unlock_irqrestore to ensure current->sighand is a valid pointer as
> suggested in the email referenced below.

Thanks for v2! Patch itself looks good to me.

I always add another `Link:` tag to the last version of the patch at the
time of applying, so the message might be a bit confusing.
Feel free to keep the link to the previous discussion but I'd rather
just repeat a bit more of what we discussed (e.g. fix rcu not being
dereferenced cleanly by using the task helpers as suggested) rather than
just link to the thread

Sorry for nitpicking but I think commit messages are important and it's
better if they're understandable out of context, even if you give a link
for further details for curious readers, it helps being able to just
skim through git log.


Either way I'll include the patch in my test run today or tomorrow, had
promised it for a while...

Cheers,
-- 
Dominique


Re: [PATCH] net/9p: Fix sparse endian warning in trans_fd.c

2020-06-18 Thread Dominique Martinet
Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> Address sparse endian warning:
> net/9p/trans_fd.c:932:28: warning: incorrect type in assignment (different 
> base types)
> net/9p/trans_fd.c:932:28:expected restricted __be32 [addressable] 
> [assigned] [usertype] s_addr
> net/9p/trans_fd.c:932:28:got unsigned long
> 
> Signed-off-by: Alexander Kapshuk 

INADDR_ANY is 0 so this really is noop but sure, less warnings is always
good. I'll take this one for 5.9.
Thanks!
-- 
Dominique


Re: [PATCH] net/9p: Fix sparse rcu warnings in client.c

2020-06-18 Thread Dominique Martinet
Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> Address sparse nonderef rcu warnings:
> net/9p/client.c:790:17: warning: incorrect type in argument 1 (different 
> address spaces)
> net/9p/client.c:790:17:expected struct spinlock [usertype] *lock
> net/9p/client.c:790:17:got struct spinlock [noderef]  *
> net/9p/client.c:792:48: warning: incorrect type in argument 1 (different 
> address spaces)
> net/9p/client.c:792:48:expected struct spinlock [usertype] *lock
> net/9p/client.c:792:48:got struct spinlock [noderef]  *
> net/9p/client.c:872:17: warning: incorrect type in argument 1 (different 
> address spaces)
> net/9p/client.c:872:17:expected struct spinlock [usertype] *lock
> net/9p/client.c:872:17:got struct spinlock [noderef]  *
> net/9p/client.c:874:48: warning: incorrect type in argument 1 (different 
> address spaces)
> net/9p/client.c:874:48:expected struct spinlock [usertype] *lock
> net/9p/client.c:874:48:got struct spinlock [noderef]  *
> 
> Signed-off-by: Alexander Kapshuk 

Thanks for this patch.
>From what I can see, there are tons of other parts of the code doing the
same noderef access pattern to access current->sighand->siglock and I
don't see much doing that.
A couple of users justify this by saying SLAB_TYPESAFE_BY_RCU ensures
we'll always get a usable lock which won't be reinitialized however we
access it... It's a bit dubious we'll get the same lock than unlock to
me, so I agree to some change though.

After a second look I think we should use something like the following:

if (!lock_task_sighand(current, &flags))
warn & skip (or some error, we'd null deref if this happened currently);
recalc_sigpending();
unlock_task_sighand(current, &flags);

As you can see, the rcu_read_lock() isn't kept until the unlock so I'm
not sure it will be enough to please sparse, but I've convinced myself
current->sighand cannot change while we hold the lock and there just are
too many such patterns in the kernel.

Please let me know if I missed something or if there is an ongoing
effort to change how this works; I'll wait for a v2.

-- 
Dominique


Re: [PATCH v2] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

2020-06-12 Thread Dominique Martinet
Wang Hai wrote on Fri, Jun 12, 2020:
> p9_read_work and p9_fd_cancelled may be called concurrently.
> In some cases, req->req_list may be deleted by both p9_read_work
> and p9_fd_cancelled.
> 
> We can fix it by ignoring replies associated with a cancelled
> request and ignoring cancelled request if message has been received
> before lock.
> 
> Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference to 
> "cancelled" function")
> Reported-by: syzbot+77a25acfa0382e06a...@syzkaller.appspotmail.com
> Signed-off-by: Wang Hai 

Thanks! looks good to me, I'll queue for 5.9 as well unless you're in a
hurry.
-- 
Dominique


Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

2020-06-11 Thread Dominique Martinet
wanghai (M) wrote on Fri, Jun 12, 2020:
> You are right, I got a syzkaller bug.
> 
> "p9_read_work+0x7c3/0xd90" points to list_del(&m->rreq->req_list);
> 
> [   62.733598] kasan: CONFIG_KASAN_INLINE enabled
> [   62.734484] kasan: GPF could be caused by NULL-ptr deref or user memory 
> access
> [   62.735670] general protection fault:  [#1] SMP KASAN PTI
> [   62.736577] CPU: 3 PID: 82 Comm: kworker/3:1 Not tainted 4.19.124+ #2
> [   62.737582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1ubuntu1 04/01/2014
> [   62.738988] Workqueue: events p9_read_work
> [   62.739642] RIP: 0010:p9_read_work+0x7c3/0xd90
> [   62.740348] Code: 48 c1 e9 03 80 3c 01 00 0f 85 cb 05 00 00 48 8d 7a 08 48 
> b9 00 00 00 00 00 fc ff df 49 8b 87 b8 00 00 00 48 89 fe 48 c1 ee 03 <80> 3c 
> 0e 00 0f 85 89 05 00 00 48 89 c6 48 b9 00 00 00 00 00 fc ff
> [   62.743236] RSP: 0018:8883ece17ca0 EFLAGS: 00010a06
> [   62.744059] RAX: dead0200 RBX: 8883d45666b0 RCX: 
> dc00
> [   62.745173] RDX: dead0100 RSI: 1bd5a021 RDI: 
> dead0108
> [   62.746279] RBP: 8883d4566590 R08: ed107a8acf31 R09: 
> ed107a8acf31
> [   62.747398] R10: 0001 R11: ed107a8acf30 R12: 
> 11107d9c2f9b
> [   62.748505] R13: 8883d45665d0 R14: 8883d4566608 R15: 
> 8883e1f1c000
> [   62.749615] FS:  () GS:8883ef18() 
> knlGS:
> [   62.750881] CS:  0010 DS:  ES:  CR0: 80050033
> [   62.751784] CR2:  CR3: 9c622003 CR4: 
> 007606e0
> [   62.752898] DR0:  DR1:  DR2: 
> 
> [   62.754011] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   62.755126] PKRU: 5554
> [   62.755561] Call Trace:
> [   62.755963]  ? p9_write_work+0xa00/0xa00
> [   62.756592]  process_one_work+0xae4/0x1b20
> [   62.757252]  ? apply_wqattrs_commit+0x3e0/0x3e0
> [   62.757985]  worker_thread+0x8c/0xe80
> [   62.758600]  ? __kthread_parkme+0xe9/0x190
> [   62.759254]  ? process_one_work+0x1b20/0x1b20
> [   62.759950]  kthread+0x341/0x410
> [   62.760479]  ? kthread_create_worker_on_cpu+0xf0/0xf0
> [   62.761296]  ret_from_fork+0x3a/0x50
> [   62.761874] Modules linked in:
> [   62.762378] Dumping ftrace buffer:
> [   62.762942](ftrace buffer empty)
> [   62.763547] ---[ end trace 69672816613947a3 ]---

This looks like:
https://syzkaller.appspot.com/bug?id=5df4f85d764ee89863d0294b4e0c87ef2fd2c624
I'm not sure how active this still is but please also add this
Reported-by tag:
Reported-by: syzbot+77a25acfa0382e06a...@syzkaller.appspotmail.com

(can keep both)


> Yes,In this case,  all further 9p messages will not be read.
> >p9_read_work probably should handle REQ_STATUS_FLSHD in a special case
> >that just throws the message away without error as well.
> 
> Can it be solved like this?
> 
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -362,7 +362,7 @@ static void p9_read_work(struct work_struct *work)
>     if (m->rreq->status == REQ_STATUS_SENT) {
>     list_del(&m->rreq->req_list);
>     p9_client_cb(m->client, m->rreq, REQ_STATUS_RCVD);
> -   } else {
> +   } else if (m->rreq->status != REQ_STATUS_FLSHD) {
>     spin_unlock(&m->client->lock);
>     p9_debug(P9_DEBUG_ERROR,
>  "Request tag %d errored out while
> we were reading the reply\n",

Yes that is probably correct.
Please add a comment above saying we ignore replies associated with a
cancelled request.

> This patch "afd8d65411" just moved list_del into cancelled ops. It
> is not actually the initial patch that caused the bug
> 
> In 60ff779c4abb ("9p: client: remove unused code and any reference
> to "cancelled" function")
> 
> It moved spin_lock under "if (oldreq->status == REQ_STATUS_FLSH)" .
> 
> After "if (oldreq->status == REQ_STATUS_FLSH)", oldreq may be
> changed by other thread.

Ok, thank you for explaining; I agree now.

-- 
Dominique


Re: [PATCH] 9p/trans_fd: Fix concurrency del of req_list in p9_fd_cancelled/p9_read_work

2020-06-11 Thread Dominique Martinet
Wang Hai wrote on Thu, Jun 11, 2020:
> p9_read_work and p9_fd_cancelled may be called concurrently.

Good catch. I'm sure this fixes some of the old syzbot bugs...
I'll check other transports handle this properly as well.

> Before list_del(&m->rreq->req_list) in p9_read_work is called,
> the req->req_list may have been deleted in p9_fd_cancelled.
> We can fix it by setting req->status to REQ_STATUS_FLSHD after
> list_del(&req->req_list) in p9_fd_cancelled.

hm if you do that read_work will fail with EIO and all further 9p
messages will not be read?
p9_read_work probably should handle REQ_STATUS_FLSHD in a special case
that just throws the message away without error as well.

> Before list_del(&req->req_list) in p9_fd_cancelled is called,
> the req->req_list may have been deleted in p9_read_work.
> We should return when req->status = REQ_STATUS_RCVD which means
> we just received a response for oldreq, so we need do nothing
> in p9_fd_cancelled.

I'll need some time to convince myself the refcounting is correct in
this case.
Pre-ref counting this definitely was wrong, but now it might just work
by chance I'll double-check.

> Fixes: 60ff779c4abb ("9p: client: remove unused code and any reference
> to "cancelled" function")

I don't understand how this commit is related?
At least make it afd8d65411 ("9P: Add cancelled() to the transport
functions.") which adds the op, not something that removed a previous
version of cancelled even earlier.

> diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> index f868cf6fba79..a563699629cb 100644
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -718,11 +718,18 @@ static int p9_fd_cancelled(struct p9_client *client, 
> struct p9_req_t *req)
>  {
>   p9_debug(P9_DEBUG_TRANS, "client %p req %p\n", client, req);
>  
> - /* we haven't received a response for oldreq,
> -  * remove it from the list.
> + /* If req->status == REQ_STATUS_RCVD, it means we just received a
> +  * response for oldreq, we need do nothing here. Else, remove it from
> +  * the list.

(nitpick) this feels a bit hard to read, and does not give any
information: you're just paraphrasing the C code.

I would suggest moving the comment after the spinlock and say what we
really do ; something as simple as "ignore cancelled request if message
has been received before lock" is enough.

>*/
>   spin_lock(&client->lock);
> + if (req->status == REQ_STATUS_RCVD) {
> + spin_unlock(&client->lock);
> + return 0;
> + }
> +
>   list_del(&req->req_list);
> + req->status = REQ_STATUS_FLSHD;
>   spin_unlock(&client->lock);
>   p9_req_put(req);
>  
-- 
Dominique


Re: [PATCH -next] 9p/xen: Add cleanup path in p9_trans_xen_init

2019-05-15 Thread Dominique Martinet
YueHaibing wrote on Tue, Apr 30, 2019:
> If xenbus_register_frontend() fails in p9_trans_xen_init,
> we should call v9fs_unregister_trans() to do cleanup.
> 
> Fixes: 868eb122739a ("xen/9pfs: introduce Xen 9pfs transport driver")
> Signed-off-by: YueHaibing 

Thanks, queued both in my repo - sorry for the delay.

-- 
Dominique


Re: [PATCH] 9p/net: fix memory leak in p9_client_create

2019-03-13 Thread Dominique Martinet
zhengbin wrote on Wed, Mar 13, 2019:
> If msize is less than 4096, we should close and put trans, destroy
> tagpool, not just free client. This patch fixes that.
> 
> Fixes: 574d356b7a02 ("9p/net: put a lower bound on msize")
> Reported-by: Hulk Robot 
> Signed-off-by: zhengbin 

Ack, definitely. Thanks for the fix, I'll take this for 5.1 even if it's
a bit late.

Please do Cc linux-kernel@vger.k.o next time though, that list should
get all the patches.

-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-22 Thread Dominique Martinet
Tom Herbert wrote on Fri, Feb 22, 2019:
> > > So basically it sounds like you're interested in supporting TCP
> > > connections that are half closed. I believe that the error in half
> > > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > > and the socket can continue being attached and used to send data.
> >
> > Did you mean 'can continue being attached and used to receive data'?
>
> No, I meant shutdown on receive side when FIN is receved. TX is still
> allowed to drain an queued bytes. To support shutdown on the TX side
> would require additional logic since we need to effectively detach the
> transmit path but retain the receive path. I'm not sure this is a
> compelling use case to support.

Hm, it must be a matter of how we see thing but from what I understand
it's exactly the other way around. The remote closed the connection, so
trying to send anything would just yield a RST, so TX doesn't make
sense.
On the other hand, anything that had been sent by the remote before the
FIN and is on the local side's memory should still be receivable.

When you think about it as a TCP stream it's really weird: data coming,
data coming, data coming, FIN received.
But in the networking stack that received FIN short-circuits all the
data that was left around and immediately raises an EPIPE error.

I don't see what makes this FIN packet so great that it should be
processed before the data; we should only see that EPIPE when we're
done reading the data before it or trying to send something.

I'll check tomorrow/next week but I'm pretty sure the packets before
that have been ack'd at a tcp level as well, so losing them in the
application level is really unexpected.
 
> > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> > both the csock and kcm socket will do many needless wakeups on only the
> > csock from what I can see, so I'd need some holdoff timer or something.
> > I guess it's possible though.
> 
> We might need to clear the error somehow. May a read of zero bytes?

Can try.

> > After a bit more debugging, this part works (__strp_recv() is called
> > again); but the next packet that is treated properly is rejected because
> > by the time __strp_recv() was called again a new skb was read and the
> > length isn't large enough to go all the way into the new packet, so this
> > test fails:
> > } else if (len <= (ssize_t)head->len -
> >   skb->len - stm->strp.offset) {
> > /* Length must be into new skb (and also
> >  * greater than zero)
> >  */
> > STRP_STATS_INCR(strp->stats.bad_hdr_len);
> > strp_parser_err(strp, -EPROTO, desc);
> >
> > So I need to figure a way to say "call this function again without
> > reading more data" somehow, or make this check more lax e.g. accept any
> > len > 0 after a retry maybe...
> > Removing that branch altogether seems to work at least but I'm not sure
> > we'd want to?
> 
> I like the check since it's conservative and covers the normal case.
> Maybe just need some more logic?

I can add a "retrying" state and not fail here if we ewre retrying for
whatever reason perhaps...
But I'm starting to wonder how this would work if my client didn't keep
on sending data, I'll try to fail on the last client's packet and see
how __strp_recv is called again through the timer, with the same skb
perhaps?

-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-21 Thread Dominique Martinet
Tom Herbert wrote on Wed, Feb 20, 2019:
> > When the client closes the socket, some messages are obviously still "in
> > flight", and the server will recv a POLLERR notification on the csock at
> > some point with many messages left.
> > The documentation says to unattach the csock when you get POLLER. If I
> > do that, the kcm socket will no longer give me any message, so all the
> > messages still in flight at the time are lost.
> 
> So basically it sounds like you're interested in supporting TCP
> connections that are half closed. I believe that the error in half
> closed is EPIPE, so if the TCP socket returns that it can be ignored
> and the socket can continue being attached and used to send data.

Did you mean 'can continue being attached and used to receive data'?

I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
both the csock and kcm socket will do many needless wakeups on only the
csock from what I can see, so I'd need some holdoff timer or something.
I guess it's possible though.

> Another possibility is to add some linger semantics to an attached
> socket. For instance, a large message might be sent so that part of
> the messge is queued in TCP and part is queued in the KCM socket.
> Unattach would probably break that message. We probably want to linger
> option similar to SO_LINGER (or maybe just use the option on the TCP
> socket) that means don't complete the detach until any message being
> transmitted on the lower socket has been queued.

That would certainly work, even if non-obvious from a user standpoint.


> > > I'd like to see some retry on ENOMEM before this is merged though, so
> > > while I'm there I'll resend this with a second patch doing that
> > > retry,.. I think just not setting strp->interrupted and not reporting
> > > the error up might be enough? Will have to try either way.
> >
> > I also tried playing with that without much success.
> > I had assumed just not calling strp_parser_err() (which calls the
> > abort_parser cb) would be enough, eventually calling strp_start_timer()
> > like the !len case, but no can do.
> 
> I think you need to ignore the ENOMEM and have a timer or other
> callback to retry the operation in the future.

Yes, that's what I had intended to try; basically just break out and
schedule timer as said above.

After a bit more debugging, this part works (__strp_recv() is called
again); but the next packet that is treated properly is rejected because
by the time __strp_recv() was called again a new skb was read and the
length isn't large enough to go all the way into the new packet, so this
test fails:
} else if (len <= (ssize_t)head->len -
  skb->len - stm->strp.offset) {
/* Length must be into new skb (and also
 * greater than zero)
 */
STRP_STATS_INCR(strp->stats.bad_hdr_len);
strp_parser_err(strp, -EPROTO, desc);

So I need to figure a way to say "call this function again without
reading more data" somehow, or make this check more lax e.g. accept any
len > 0 after a retry maybe...
Removing that branch altogether seems to work at least but I'm not sure
we'd want to?
(grmbl at this slow VM and strparser not being possible to enable as a
module, it takes ages to test)


> > With that said, returning 0 from the parse function also raises POLLERR
> > on the csock and hangs netparser, so things aren't that simple...
> 
> Can you point to where this is happening. If the parse_msg callback
> returns 0 that is suppose to indicate that more bytes are needed.

I just blindly returned 0 "from time to time" in the kcm parser
function, but looking at above it was failing on the same check.
This somewhat makes sense for this one to fail here if a new packet was
read, no sane parser function should give a length smaller than what
they require to determine the length.


Thanks,
-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-19 Thread Dominique Martinet
Dominique Martinet wrote on Fri, Feb 15, 2019:
> With all that said I guess my patch should work correctly then, I'll try
> to find some time to check the error does come back up the tcp socket in
> my reproducer but I have no reason to believe it doesn't.

Ok, so I can confirm this part - the 'csock' does come back up with
POLLERR if the parse function returns ENOMEM in the current code.

It also comes back up with POLLERR when the remote side closes the
connection, which is expected, but I'm having a very hard time
understanding how an application is supposed to deal with these
POLLERR after having read the documentation and a bit of
experimentation.
I'm not sure how much it would matter for real life (if the other end
closes the connection most servers would not care about what they said
just before closing, but I can imagine some clients doing that in real
life e.g. a POST request they don't care if it succeeds or not)...
My test program works like this:
 - client connects, sends 10k messages and close()s the socket
 - server loops recving and close()s after 10k messages; it used to be
recvmsg() directly but it's now using poll then recvmsg.


When the client closes the socket, some messages are obviously still "in
flight", and the server will recv a POLLERR notification on the csock at
some point with many messages left.
The documentation says to unattach the csock when you get POLLER. If I
do that, the kcm socket will no longer give me any message, so all the
messages still in flight at the time are lost.

If I just ignore the csock like I used to, all the messages do come just
fine, but as said previously on a real error this will just make recvmsg
or the polling hang forever and I see no way to distinguish a "real"
error vs. a connection shut down from the remote side with data left in
the pipe.
I thought of checking POLLERR on csock and POLLIN not set on kcmsock,
but even that seems to happen fairly regularily - the kcm sock hasn't
been filled up, it's still reading from the csock.


On the other hand, checking POLLIN on the csock does say there is still
data left, so I know there is data left on the csock, but this is also
the case on a real error (e.g. if parser returns -ENOMEM)
... And this made me try to read from the csock after detaching it and I
can resume manual tcp parsing for a few messages until read() fails with
EPROTO ?! and I cannot seem to be able to get anything out of attaching
it back to kcm (for e.g. an ENOMEM error that was transient)...



I'm honestly not sure how the POLLERR notification mechanism works but I
think it would be much easier to use KCM if we could somehow delay that
error until KCM is done feeding from the csock (when netparser really
stops reading from it like on real error, e.g. abort callback maybe?)
I think it's fine if the csock is closed before the kcm sock message is
read, but we should not lose messages like this.



> I'd like to see some retry on ENOMEM before this is merged though, so
> while I'm there I'll resend this with a second patch doing that
> retry,.. I think just not setting strp->interrupted and not reporting
> the error up might be enough? Will have to try either way.

I also tried playing with that without much success.
I had assumed just not calling strp_parser_err() (which calls the
abort_parser cb) would be enough, eventually calling strp_start_timer()
like the !len case, but no can do.
With that said, returning 0 from the parse function also raises POLLERR
on the csock and hangs netparser, so things aren't that simple...


I could use a bit of help again.

Thanks,
-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Dominique Martinet
Tom Herbert wrote on Thu, Feb 14, 2019:
> On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
>  wrote:
> > Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> > all: from a user point of view, the connection just hangs (recvmsg never
> > returns), without so much as a message in dmesg either.
>
> That's by design. Here is the description in kcm.txt:
> 
> "When a TCP socket is attached to a KCM multiplexor data ready
> (POLLIN) and write space available (POLLOUT) events are handled by the
> multiplexor. If there is a state change (disconnection) or other error
> on a TCP socket, an error is posted on the TCP socket so that a
> POLLERR event happens and KCM discontinues using the socket. When the
> application gets the error notification for a TCP socket, it should
> unattach the socket from KCM and then handle the error condition (the
> typical response is to close the socket and create a new connection if
> necessary)."

Sigh, that's what I get for relying on pieces of code found on the
internet.

It does make "trivial" kcm sockets difficult to use though, the old
ganesha code I adapted to kcm was the horrible (naive?) kind spawning
one thread per connection and just waiting on read(), so making it just
create a kcm socket from the tcp one and wait on recvmsg() until an
error pops up does not seem an option.

That being said I'm a bit confused, I thought part of the point of kcm
was the multiplexing so a simple server could just keep attaching tcp
sockets to a single kcm socket and only have a single trivial loop
reading from the kcm socket ; but I guess there's no harm in also
looking for POLLERR on the tcp sockets... It would still need to close
them once clients disconnect I guess, this really only affects my naive
server.

> > strparser might be able to retry somehow.
> 
> We could retry on -ENOMEM since it is potentially a transient error,

Yes, I think we should aim to retry on -ENOMEM; I agree all errors are
not meant to be retried on but there already are different cases based
on what the parse cb returned; but that can be a different patch.

> but generally for errors (like connection is simply broken) it seems
> like it's working as intended. I suppose we could add a socket option
> to fail the KCM socket when all attached lower sockets have failed,
> but I not sure that would be a significant benefit for additional
> complexity.

Right, I agree it's probably not worth doing, now I'm aware of this I
can probably motivate myself to change this old code to use epoll
properly.

I think it could make sense to have this option for simpler programs,
but as things stand I guess we can require kcm users to do this much,
thanks for the explanation, and sorry for having failed to notice it.



With all that said I guess my patch should work correctly then, I'll try
to find some time to check the error does come back up the tcp socket in
my reproducer but I have no reason to believe it doesn't.

I'd like to see some retry on ENOMEM before this is merged though, so
while I'm there I'll resend this with a second patch doing that
retry,.. I think just not setting strp->interrupted and not reporting
the error up might be enough? Will have to try either way.



Thanks for the feedback,
-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Dominique Martinet
Tom Herbert wrote on Thu, Feb 14, 2019:
> > This second patch[2] (the current thread) now does an extra clone if
> > there is an offset, but the problem really isn't in the clone but the
> > pull itself that can fail and return NULL when there is memory pressure.
> > For some reason I hadn't been able to reproduce that behaviour with
> > strparser doing the pull, but I assume it would also be possible to hit
> > in extreme situations, I'm not sure...
>
> This option looks the best to me, at least as a way to fix the issue
> without requiring a change to the API. If the pull fails, doesn't that
> just mean that the parser fails? Is there some behavior with this
> patch that is not being handled gracefully?

Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
all: from a user point of view, the connection just hangs (recvmsg never
returns), without so much as a message in dmesg either.

It took me a while to figure out what failed exactly as I did indeed
expect strparser/kcm to handle that better, but ultimately as things
stand if the parser fails it calls strp_parser_err() with the error
which ends up in strp_abort_strp that should call
sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
failing csk does not make a pending recv on the kcm sock to fail...

I'm not sure whether propagating the error to the upper socket is the
right thing to do, kcm is meant to be able to work with multiple
underlying sockets so I feel we must be cautious about that, but
strparser might be able to retry somehow.
This is what I said below:
> > [,,,]
> >  - the current patch, that I could only get to fail with KASAN, but does
> > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > Would still require to fix the hang, so make strparser retry a few times
> > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > userspace somehow.

Thanks,
-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Dominique Martinet
Tom Herbert wrote on Thu, Feb 14, 2019:
> > The best alternative I see is adding a proper helper to get
> > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > lost as I have been; I'm not quite sure how/where to add such a helper
> > though as I've barely looked at the bpf code until now, but should we go
> > for that?
> 
> I'd rather not complicate the bpf code for this. Can we just always do
> an pskb_pull after skb_clone?

Which skb_clone are you thinking of?
If you're referring to the one in strparser, that was my very first
approach here[1], but Dordon shot it down saying that this is not an
strparser bug but a kcm bug since there are ways for users to properly
get the offset and use it -- and the ktls code does it right.

Frankly, my opinion still is that it'd be better in strparser because
there also is some bad use in bpf sockmap (they never look at the offset
either, while kcm uses it for rcv but not for parse), but looking at it
from an optimal performance point of view I agree the user can make
better decision than strparser so I understand where he comes from.


This second patch[2] (the current thread) now does an extra clone if
there is an offset, but the problem really isn't in the clone but the
pull itself that can fail and return NULL when there is memory pressure.
For some reason I hadn't been able to reproduce that behaviour with
strparser doing the pull, but I assume it would also be possible to hit
in extreme situations, I'm not sure...

So anyway, we basically have three choices that I can see:
 - push harder on strparser and go back to my first patch ; it's simple
and makes using strparser easier/safer but has a small overhead for
ktls, which uses the current strparser implementation correctly.
I hadn't been able to get this to fail with KASAN last time I tried but
I assume it should still be possible somehow.

 - the current patch, that I could only get to fail with KASAN, but does
complexify kcm a bit; this also does not fix bpf sockmap at all.
Would still require to fix the hang, so make strparser retry a few times
if strp->cb.parse_msg failed maybe? Or at least get the error back to
userspace somehow.

 - my last suggestion to document the kcm behaviour, and if possible add
a bpf helper to make proper usage easier ; this will make kcm harder to
use on end users but it's better than not being documented and seeing
random unappropriate lengths being parsed.



[1] first strparser patch
http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmad...@codewreck.org
[2] current thread's patch
http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmad...@codewreck.org


Thanks,
-- 
Dominique


Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Dominique Martinet
Dominique Martinet wrote on Wed, Oct 31, 2018:
> Anyway, that probably explains I have no problem with bigger VM
> (uselessly more memory available) or without KASAN (I guess there's
> overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> of ram, so if there's an allocation failure there I think there's a
> problem ! . . .
> 
> So, well, I'm not sure on the way forward. Adding a bpf helper and
> document that kcm users should mind the offset?

bump on this - I had mostly forgotten about it but the nfs-ganesha
community that could make use of KCM reminded me of my patch that's
waiting for this.

Summary for people coming back after four months:
 - kcm is great, but the bpf function that's supposed to be called for
each packet does not automatically adjust the offset so that it can
assume the skb starts with the packet it needs to look at

 - there is some workaround code that is far from obvious and
undocumented, see the original thread[1]:
[1] 
http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf...@davejwatson-mba.dhcp.thefacebook.com

 - my patch here tried to automatically pull the corresponding packet to
the front, but apparently with KASAN can trigger out of memory
behaviours on "small" VMs, so even if it doesn't seem to impact
performance much without KASAN I don't think it's really ok to
potentially hang the connection due to oom under severe conditions.


The best alternative I see is adding a proper helper to get
"kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
lost as I have been; I'm not quite sure how/where to add such a helper
though as I've barely looked at the bpf code until now, but should we go
for that?


(it really feels wrong to me that some random person who just started by
trying to use kcm has to put this much effort to keep the ball rolling,
if nobody cares about kcm I'm also open to have it removed completely
despite the obvious performance gain I benchmarked for ganesha[2] ;
barely maintained feature is worse than no feature)

[2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314

Thanks,
-- 
Dominique


Re: [PATCH net-next] 9p: mark expected switch fall-through

2019-01-22 Thread Dominique Martinet
Gustavo A. R. Silva wrote on Wed, Jan 23, 2019:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> This patch fixes the following warning:
> 
> net/9p/trans_xen.c:514:6: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
> 
> Warning level 3 was used: -Wimplicit-fallthrough=3
> 
> This patch is part of the ongoing efforts to enabling
> -Wimplicit-fallthrough
> 
> Signed-off-by: Gustavo A. R. Silva 

Sure, taking this to my queue for -next

-- 
Dominique


[GIT PULL] 9p updates for 4.21

2019-01-01 Thread Dominique Martinet
Small pull request this time around with only two commits; some missing
prototype warning fix and a syzkaller fix when a 9p server advertises a
too small msize.

The commit date is close-ish because I reworded a commit message to
add a Cc to stable for the msize fix recently, but the patchs themselves
have been in linux-next since Nov 20~ish


Thanks,

The following changes since commit ccda4af0f4b92f7b4c308d3acc262f4a7e3affad:

  Linux 4.20-rc2 (2018-11-11 17:12:31 -0600)

are available in the Git repository at:

  git://github.com/martinetd/linux tags/9p-for-4.21

for you to fetch changes up to 574d356b7a02c7e1b01a1d9cba8a26b3c2888f45:

  9p/net: put a lower bound on msize (2018-12-25 17:07:49 +0900)


Pull request for inclusion in 4.21

Missing prototype warning fix and a syzkaller fix when a 9p server
advertises a too small msize


Adeodato SimĂł (1):
  net/9p: include trans_common.h to fix missing prototype warning.

Dominique Martinet (1):
  9p/net: put a lower bound on msize

 net/9p/client.c   | 21 +
 net/9p/trans_common.c |  1 +

-- 
Dominique Martinet


Re: [PATCH 1/3] 9p/net: implement asynchronous rpc

2018-12-17 Thread Dominique Martinet
Tomas Bortoli wrote on Mon, Dec 17, 2018:
> sorry for the delay, I've been quite busy these days.

No problem.

> The patches looks good to me and should indeed speed up the code a bit.
> I quickly tested them against Syzkaller tuned for the 9p subsystem and
> everything seems fine.

Thanks, can I add your Reviewed-by on all three?

> And by the way, which refcount races?

There's a problem with trans_fd read_work and cancelled callback; I'm
not so sure about refcount but we can definitely get double list_del
as we're not checking the status. I think when we incorrectly remove
from the list we also mismanage the refcount, but honestly need to
test..

-- 
Dominique


[PATCH 1/3] 9p/net: implement asynchronous rpc

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

Add p9_client_async_rpc that will let us send an rpc without waiting
for the reply, as well as an async handler hook.

The work done in the hook could mostly be done in the client callback,
but I prefered not to for a couple of reasons:
 - the callback can be called in an IRQ for some transports, and the
handler needs to call p9_tag_remove which takes the client lock that has
recently been reworked to not be irq-compatible (as it was never taken
in IRQs until now)
 - the handled replies can be handled anytime, there is nothing the
userspace would care about and want to be notified of quickly
 - the async request list and this function would be needed anyway for
when we disconnect the client, in order to not leak async requests that
haven't been replied to

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---

I've been sitting on these patches for almost a month now because I
wanted to fix the cancel race, but I think it's what the most recent
syzbot email is about so if it could find it without this it probably
won't hurt to at least get some reviews for these three patches first.

I won't submit these for next cycle, so will only put them into -next
after 4.20 is released; hopefully I'll have time to look at the two
pending refcount races around that time.
Until then, comments please!


 include/net/9p/client.h |  9 +-
 net/9p/client.c | 64 +
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 947a570307a6..a4ded7666c73 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -89,7 +89,8 @@ enum p9_req_status_t {
  * @tc: the request fcall structure
  * @rc: the response fcall structure
  * @aux: transport specific data (provided for trans_fd migration)
- * @req_list: link for higher level objects to chain requests
+ * @req_list: link used by trans_fd
+ * @async_list: link used to check on async requests
  */
 struct p9_req_t {
int status;
@@ -100,6 +101,7 @@ struct p9_req_t {
struct p9_fcall rc;
void *aux;
struct list_head req_list;
+   struct list_head async_list;
 };
 
 /**
@@ -110,6 +112,9 @@ struct p9_req_t {
  * @trans_mod: module API instantiated with this client
  * @status: connection state
  * @trans: tranport instance state and API
+ * @fcall_cache: kmem cache for client request data (msize-specific)
+ * @async_req_lock: protects @async_req_list
+ * @async_req_list: list of requests waiting a reply
  * @fids: All active FID handles
  * @reqs: All active requests.
  * @name: node name used as client id
@@ -125,6 +130,8 @@ struct p9_client {
enum p9_trans_status status;
void *trans;
struct kmem_cache *fcall_cache;
+   spinlock_t async_req_lock;
+   struct list_head async_req_list;
 
union {
struct {
diff --git a/net/9p/client.c b/net/9p/client.c
index 357214a51f13..0a67c3ccd4fd 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -729,6 +729,62 @@ static struct p9_req_t *p9_client_prepare_req(struct 
p9_client *c,
return ERR_PTR(err);
 }
 
+static struct p9_req_t *
+p9_client_async_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
+{
+   va_list ap;
+   int err;
+   struct p9_req_t *req;
+
+   va_start(ap, fmt);
+   req = p9_client_prepare_req(c, type, c->msize, fmt, ap);
+   va_end(ap);
+   if (IS_ERR(req))
+   return req;
+
+   err = c->trans_mod->request(c, req);
+   if (err < 0) {
+   /* bail out without flushing... */
+   p9_req_put(req);
+   p9_tag_remove(c, req);
+   if (err != -ERESTARTSYS && err != -EFAULT)
+   c->status = Disconnected;
+   return ERR_PTR(safe_errno(err));
+   }
+
+   return req;
+}
+
+static void p9_client_handle_async(struct p9_client *c, bool free_all)
+{
+   struct p9_req_t *req, *nreq;
+
+   /* it's ok to miss some async replies here, do a quick check without
+* lock first unless called with free_all
+*/
+   if (!free_all && list_empty(&c->async_req_list))
+   return;
+
+   spin_lock_irq(&c->async_req_lock);
+   list_for_each_entry_safe(req, nreq, &c->async_req_list, async_list) {
+   if (free_all && req->status < REQ_STATUS_RCVD) {
+   p9_debug(P9_DEBUG_ERROR,
+"final async handler found req tag %d type %d 
status %d\n",
+req->tc.tag, req->tc.id, req->status);
+   if (c->trans_mod->cancelled)
+   c->trans_mod->cancelled(c, req);
+   /* won't recei

[PATCH 2/3] 9p/net: make clunk asynchronous

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

clunk is defined as making the fid invalid whatever the server returns,
and we should ignore errors so it is a good candidate for async call.

The change should make 9p slightly faster (many vfs systeme calls won't
need to wait for that clunk), but more importantly the flush rework
means we won't clear pending signals and the current implementation of
"retry twice then leak the fid" will stop working, so this needed
improving.

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---
 include/net/9p/client.h |   4 ++
 net/9p/client.c | 124 +++-
 2 files changed, 62 insertions(+), 66 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index a4ded7666c73..75d7f83e5b94 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -91,6 +91,7 @@ enum p9_req_status_t {
  * @aux: transport specific data (provided for trans_fd migration)
  * @req_list: link used by trans_fd
  * @async_list: link used to check on async requests
+ * @clunked_fid: for clunk, points to fid
  */
 struct p9_req_t {
int status;
@@ -102,6 +103,9 @@ struct p9_req_t {
void *aux;
struct list_head req_list;
struct list_head async_list;
+   union {
+   struct p9_fid *clunked_fid;
+   };
 };
 
 /**
diff --git a/net/9p/client.c b/net/9p/client.c
index 0a67c3ccd4fd..a47b5a54573d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -649,6 +649,51 @@ static int p9_check_zc_errors(struct p9_client *c, struct 
p9_req_t *req,
return err;
 }
 
+static struct p9_fid *p9_fid_create(struct p9_client *clnt)
+{
+   int ret;
+   struct p9_fid *fid;
+
+   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
+   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
+   if (!fid)
+   return NULL;
+
+   memset(&fid->qid, 0, sizeof(struct p9_qid));
+   fid->mode = -1;
+   fid->uid = current_fsuid();
+   fid->clnt = clnt;
+   fid->rdir = NULL;
+   fid->fid = 0;
+
+   idr_preload(GFP_KERNEL);
+   spin_lock_irq(&clnt->lock);
+   ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1,
+   GFP_NOWAIT);
+   spin_unlock_irq(&clnt->lock);
+   idr_preload_end();
+
+   if (!ret)
+   return fid;
+
+   kfree(fid);
+   return NULL;
+}
+
+static void p9_fid_destroy(struct p9_fid *fid)
+{
+   struct p9_client *clnt;
+   unsigned long flags;
+
+   p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid);
+   clnt = fid->clnt;
+   spin_lock_irqsave(&clnt->lock, flags);
+   idr_remove(&clnt->fids, fid->fid);
+   spin_unlock_irqrestore(&clnt->lock, flags);
+   kfree(fid->rdir);
+   kfree(fid);
+}
+
 static struct p9_req_t *
 p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
 
@@ -778,6 +823,9 @@ static void p9_client_handle_async(struct p9_client *c, 
bool free_all)
}
if (free_all || req->status >= REQ_STATUS_RCVD) {
/* Put old refs whatever reqs actually returned */
+   if (req->tc.id == P9_TCLUNK) {
+   p9_fid_destroy(req->clunked_fid);
+   }
list_del(&req->async_list);
p9_tag_remove(c, req);
}
@@ -959,51 +1007,6 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client 
*c, int8_t type,
return ERR_PTR(safe_errno(err));
 }
 
-static struct p9_fid *p9_fid_create(struct p9_client *clnt)
-{
-   int ret;
-   struct p9_fid *fid;
-
-   p9_debug(P9_DEBUG_FID, "clnt %p\n", clnt);
-   fid = kmalloc(sizeof(struct p9_fid), GFP_KERNEL);
-   if (!fid)
-   return NULL;
-
-   memset(&fid->qid, 0, sizeof(struct p9_qid));
-   fid->mode = -1;
-   fid->uid = current_fsuid();
-   fid->clnt = clnt;
-   fid->rdir = NULL;
-   fid->fid = 0;
-
-   idr_preload(GFP_KERNEL);
-   spin_lock_irq(&clnt->lock);
-   ret = idr_alloc_u32(&clnt->fids, fid, &fid->fid, P9_NOFID - 1,
-   GFP_NOWAIT);
-   spin_unlock_irq(&clnt->lock);
-   idr_preload_end();
-
-   if (!ret)
-   return fid;
-
-   kfree(fid);
-   return NULL;
-}
-
-static void p9_fid_destroy(struct p9_fid *fid)
-{
-   struct p9_client *clnt;
-   unsigned long flags;
-
-   p9_debug(P9_DEBUG_FID, "fid %d\n", fid->fid);
-   clnt = fid->clnt;
-   spin_lock_irqsave(&clnt->lock, flags);
-   idr_remove(&clnt->fids, fid->fid);
-   spin_unlock_irqrestore(&clnt->lock, flags);
-   kfree(fid-

[PATCH 3/3] 9p/net: make flush asynchronous

2018-12-11 Thread Dominique Martinet
From: Dominique Martinet 

Make the client flush asynchronous so we don't need to ignore signals in
rpc functions:
 - on signal, send a flush request as we previously did but use the new
asynchronous helper and return immediately
 - when the reply has been received or connection is destroyed, free
both tags in the handler

Signed-off-by: Dominique Martinet 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
Cc: Tomas Bortoli 
Cc: Dmitry Vyukov 
---
 include/net/9p/client.h |   2 +
 net/9p/client.c | 172 ++--
 2 files changed, 78 insertions(+), 96 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 75d7f83e5b94..dcd40e7ef202 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -91,6 +91,7 @@ enum p9_req_status_t {
  * @aux: transport specific data (provided for trans_fd migration)
  * @req_list: link used by trans_fd
  * @async_list: link used to check on async requests
+ * @flushed_req: for flush, points to matching flushed req
  * @clunked_fid: for clunk, points to fid
  */
 struct p9_req_t {
@@ -104,6 +105,7 @@ struct p9_req_t {
struct list_head req_list;
struct list_head async_list;
union {
+   struct p9_req_t *flushed_req;
struct p9_fid *clunked_fid;
};
 };
diff --git a/net/9p/client.c b/net/9p/client.c
index a47b5a54573d..666a722088e9 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -694,50 +694,6 @@ static void p9_fid_destroy(struct p9_fid *fid)
kfree(fid);
 }
 
-static struct p9_req_t *
-p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
-
-/**
- * p9_client_flush - flush (cancel) a request
- * @c: client state
- * @oldreq: request to cancel
- *
- * This sents a flush for a particular request and links
- * the flush request to the original request.  The current
- * code only supports a single flush request although the protocol
- * allows for multiple flush requests to be sent for a single request.
- *
- */
-
-static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
-{
-   struct p9_req_t *req;
-   int16_t oldtag;
-   int err;
-
-   err = p9_parse_header(&oldreq->tc, NULL, NULL, &oldtag, 1);
-   if (err)
-   return err;
-
-   p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag);
-
-   req = p9_client_rpc(c, P9_TFLUSH, "w", oldtag);
-   if (IS_ERR(req))
-   return PTR_ERR(req);
-
-   /*
-* if we haven't received a response for oldreq,
-* remove it from the list
-*/
-   if (oldreq->status == REQ_STATUS_SENT) {
-   if (c->trans_mod->cancelled)
-   c->trans_mod->cancelled(c, oldreq);
-   }
-
-   p9_tag_remove(c, req);
-   return 0;
-}
-
 static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
  int8_t type, int req_size,
  const char *fmt, va_list ap)
@@ -800,6 +756,39 @@ p9_client_async_rpc(struct p9_client *c, int8_t type, 
const char *fmt, ...)
return req;
 }
 
+/**
+ * p9_client_flush - flush (cancel) a request
+ * @c: client state
+ * @oldreq: request to cancel
+ *
+ * This sents a flush for a particular request and links
+ * the flush request to the original request.  The current
+ * code only supports a single flush request although the protocol
+ * allows for multiple flush requests to be sent for a single request.
+ *
+ */
+
+static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
+{
+   struct p9_req_t *req;
+   int16_t oldtag = oldreq->tc.tag;
+
+   p9_debug(P9_DEBUG_9P, ">>> TFLUSH tag %d\n", oldtag);
+   req = p9_client_async_rpc(c, P9_TFLUSH, "w", oldtag);
+   if (IS_ERR(req)) {
+   return PTR_ERR(req);
+   }
+
+   p9_debug(P9_DEBUG_MUX, "sent flush for oldreq %d type %d with flush tag 
%d\n",
+oldtag, oldreq->tc.id, req->tc.tag);
+   req->flushed_req = oldreq;
+   spin_lock_irq(&c->lock);
+   list_add(&req->async_list, &c->async_req_list);
+   spin_unlock_irq(&c->lock);
+
+   return 0;
+}
+
 static void p9_client_handle_async(struct p9_client *c, bool free_all)
 {
struct p9_req_t *req, *nreq;
@@ -823,6 +812,24 @@ static void p9_client_handle_async(struct p9_client *c, 
bool free_all)
}
if (free_all || req->status >= REQ_STATUS_RCVD) {
/* Put old refs whatever reqs actually returned */
+   if (req->tc.id == P9_TFLUSH) {
+   p9_debug(P9_DEBUG_MUX,
+"flushing oldreq tag %d status %d, 
flush_req tag %d\n",
+  

9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > Now we are talking!
> > We generally assume that all modules are simply compiled into kernel.
> > At least that's we have on syzbot. If somebody can't compile them in,
> > we can suggest to add modprobe into init.
> > So this boils down to just writing to /sys/module/rdma_rxe/parameters/add.
> 
> This fails for me:
> 
> root@syzkaller:~# echo -n syz1 > /sys/module/rdma_rxe/parameters/add
> [20992.905406] rdma_rxe: interface syz1 not found
> bash: echo: write error: Invalid argument

Works here, I just did:

[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n ens3 > /sys/module/rdma_rxe/parameters/add 

dmesg says:
[   35.595534] rdma_rxe: set rxe0 active
[   35.595541] rdma_rxe: added rxe0 to ens3


Actually for a dummy interface if I try the echo directly the echo
works, and a verb device is created, and I just confirmed I can use
it... so not sure why rxe_cfg said EINVAL earlier...

[root@f2 ~]# ip link add dummy0 type dummy
[root@f2 ~]# ip link set dummy0 up
[root@f2 ~]# ip addr add 10.1.1.1/24 dev dummy0
[root@f2 ~]# modprobe rdma_rxe
[root@f2 ~]# echo -n dummy0 > /sys/module/rdma_rxe/parameters/add 


(then using my test client:
[root@f2 src]# ./rcat -s
INFO:  trans_rdma.c (879), msk_cma_event_handler: CONNECT_REQUEST
INFO:  trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
INFO:  trans_rdma.c (917), msk_cma_event_handler: DISCONNECT EVENT...

[root@f2 src]# ./rcat -c 10.1.1.1
INFO:  trans_rdma.c (862), msk_cma_event_handler: ESTABLISHED
^C
)


I assume your syz1 interface is a tap device as you were saying earlier?
Got anything in dmesg?

-- 
Dominique


9p/RDMA for syzkaller (Was: BUG: corrupted list in p9_read_work)

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> But again we don't need to support all of the available hardware.

I agree with that, I just have no idea what the "librxe-rdmav16.so" lib
could be doing and described something I am slightly more familiar with
(e.g. libmlx5)
I talked about a common subset of the verb abi because I didn't want to
look into what it's doing, but if it's not enough there's always that
possibility.


> For example, we are testing net stack from external side using tun.
> tun is a very simple, virtual abstraction of a network card. It allows
> us to test all of generic net stack starting from L2 without messing
> with any real drivers and their differences entirely. I had impression
> that we are talking about something similar here too. Or not?

That sounds about right, rxe is a software implementation that should
work on most network interfaces ; at least from what I tried it worked
on a VM's virtio net down to my laptop's wifi interface so it's a good
start... I'm not saying all because I just tried a dummy interface and
that returned EINVAL.
The only point I disagree is the 'very simple', even getting that to
work will be a far cry from a socket() call... :)


> Also I am a bit missing context about rdma<->9p interface. Do we need
> to setup all these ring buffers to satisfy the parts that 9p needs? Is
> it that 9p actually reads data directly from these ring buffers? Or
> there is some higher-level rdma interface that 9p uses?

It needs an "RDMA_PS_TCP" connection established, that requires
everything I described unfortunately...
Once that's established we need to register some memory to the driver
and post some recv buffers (even if we won't read it, the client would
get errors if we aren't ready to receive anything - at least it does
with real hardware), and also use some registered memory to send data.

Thinking back though I think that my server implementation isn't very
far from the raw level in what I'm doing, I recall libibverbs fallback
implementation (e.g. if the driver lib doesn't implement it otherwise)
of the functions I looked at like ibv_post_send to mostly be just
serializing the arguments, slapping the command from an enum in front of
it and sending it to the kernel, so it might be enough to just
reimplement that shim in or figure a way to generate the binary commands
once and then use these values; now I'm comparing two runs of strace of
my test server I definitely see a pattern.

I'll give it a try but don't expect something fast, and it's probably
not going to be very pretty either...

To give a concrete example, here are all the read/write/fcntl calls
looking just at /dev/infiniband in a hello world program that just
establishes connection (server side), receive and send two messages and
quits:


This part apparently sets up the listening connection of the server:

1430  1539262699.126025 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", 
O_RDWR|O_CLOEXEC) = 3
1430  1539262699.126155 write(3, 
"\0\0\0\0\30\0\4\0@m'\1\0\0\0\0\344\327\375\271\374\177\0\0?\1\2\0\0\0\0\0", 
32) = 32
1430  1539262699.126192 write(3, 
"\24\0\0\0\210\0\0\0\0\0\0\\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\6\0\0\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
 144) = 144
1430  1539262699.126223 write(3, "\23\0\0\0\20\0\20\1 
\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126250 write(3, "\23\0\0\0\20\0\20\1 
\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430  1539262699.126274 write(3, 
"\1\0\0\0\20\0\4\0\324\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126303 close(3)= 0
1430  1539262699.126360 openat(AT_FDCWD, "/dev/infiniband/rdma_cm", 
O_RDWR|O_CLOEXEC) = 3
1430  1539262699.126429 write(3, 
"\0\0\0\0\30\0\4\0\240\217'\1\0\0\0\0t\330\375\271\374\177\0\0\6\1\2\0\0\0\0\0",
 32) = 32
1430  1539262699.126472 write(3, 
"\24\0\0\0\210\0\0\0\0\0\0\0\34\0\0\0\n\0\4\323\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
 144) = 144
1430  1539262699.126501 write(3, 
"\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.126534 write(3, 
"\23\0\0\0\20\0\20\1p\326\375\271\374\177\0\0\0\0\0\0\2\0\0\0", 24) = 24
1430  1539262699.127119 write(3, "\7\0\0\0\10\0\0\0\0\0\0\0@\0\0\0", 16) = 16
1430  1539262699.127149 write(3, 
"\23\0\0\0\20\0\20\1`\327\375\271\374\177\0\0\0\0\0\0\0\0\0\0", 24) = 24
1430  1539262699.127319 fcntl(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
1430  1539262699.127348 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE 


Then the client connects (had some epoll on read on fd 3, but no read?!)

1446  1539262706.

Re: BUG: corrupted list in p9_read_work

2018-10-11 Thread Dominique Martinet
Dmitry Vyukov wrote on Thu, Oct 11, 2018:
> > That's still the tricky part, I'm afraid... Making a separate server
> > would have been easy because I could have reused some of my junk for the
> > actual connection handling (some rdma helper library I wrote ages
> > ago[1]), but if you're going to just embed C code you'll probably want
> > something lower level? I've never seen syzkaller use any library call
> > but I'm not even sure I would know how to create a qp without
> > libibverbs, would standard stuff be OK ?
> 
> Raw syscalls preferably.
> What does 'rxe_cfg start ens3' do on syscall level? Some netlink?

modprobe rdma_rxe (and a bunch of other rdma modules before that) then
writes the interface name in /sys/module/rdma_rxe/parameters/add
apparently; then checks it worked.
this part could be done in C directly without too much trouble, but as
long as the proper kernel configuration/modules are available

> Any libraries and utilities are hell pain in linux world. Will it work
> in Android userspace? gVisor? Who will explain all syzkaller users
> where they get this for their who-knows-what distro, which is 10 years
> old because of corp policies, and debug how their version of the
> library has a slightly incompatible version?
> For example, after figuring out that rxe_cfg actually comes from
> rdma-core (which is a separate delight on linux), my debian
> destribution failed to install it because of some conflicts around
> /etc/modprobe.d/mlx4.conf, and my ubuntu distro does not know about
> such package. And we've just started :)

The rdma ecosystem is a pain, I'll easily agree with that...

> Syscalls tend to be simpler and more reliable. If it gives ENOSUPP,
> ok, that's it. If it works, great, we can use it.

I'll have to look into it a bit more; libibverbs abstracts a lot of
stuff into per-nic userspace drivers (the files I cited in a previous
mail) and basically with the mellanox cards I'm familiar with the whole
user session looks like this:
 * common libibverbs/rdmacm code opens /dev/infiniband/rdma_cm and
/dev/infiniband/uverbs0 (plus a bunch of files to figure out abi
version, what user driver to load etc) 
 * it and the userspace driver issue "commands" over these two files' fd
to setup the connection ; some commands are standard but some are
specific to the interface and defined in the driver.
There are many facets to a connection in RDMA: a protection domain used
to register memory with the nic, a queue pair that is the actual tx/rx
connection, optionally a completion channel that will be another fd to
listen on for events that tell you something happened and finally some
memory regions to directly communicate with the nic from userspace
depending on the specific driver.
 * then there's the actual usage, more commands through the uverbs0 char
device to register the memory you'll use, and once that's done it's
entierly up to the driver - for example the mellanox lib can do
everything in userspace playing with the memory regions it registered,
but I'd wager the rxe driver does more calls through the uverbs0 fd...

Honestly I'm not keen on reimplementing all of this; the interface
itself pretty much depends on your version of the kernel (there is a
common ABI defined, but as far as specific nics are concerned if your
kernel module doesn't match the user library version you can get some
nasty surprises), and it's far from the black or white of a good ol'
ENOSUPP error.


I'll look if I can figure out if there is a common subset of verbs
commands that are standard and sufficient to setup a listening
connection and exchange data that should be supported for all devices
and would let us reimplement just that, but while I hear your point
about android and ten years in the future I think it's more likely than
ten years in the future the verb abi will have changed but libibverbs
will just have the new version implemented and hide the change :P

-- 
Dominique


Re: Re: BUG: corrupted list in p9_read_work

2018-10-10 Thread Dominique Martinet
Dominique Martinet wrote on Wed, Oct 10, 2018:
> It works though, is it just picky because I didn't end it in .git? let's
> try again, sorry for the noise...
> 
> #syz test: git://github.com/martinetd/linux.git 
> e4ca13f7d075e551dc158df6af18fb412a1dba0a

And I guess the commit hash needs to be in the default clone branch to
work ?
('git fetch  ' happily fetches the commit in a new clone for
me... But that feels like a github specific behaviour maybe)

Oh, well; made a branch for it, last try for me.

#syz test: git://github.com/martinetd/linux.git for-syzbot

-- 
Dominique


Re: [PATCH] 9p: potential NULL dereference

2018-09-26 Thread Dominique Martinet
Dan Carpenter wrote on Wed, Sep 26, 2018:
> p9_tag_alloc() is supposed to return error pointers, but we accidentally
> return a NULL here.  It would cause a NULL dereference in the caller.
> 
> Fixes: 996d5b4db4b1 ("9p: Use a slab for allocating requests")
> Signed-off-by: Dan Carpenter 

Good catch, the culprit commit is only in -next so just adding this to
the queue right away.

Thanks!
-- 
Dominique


Re: KASAN: invalid-free in p9stat_free

2018-08-26 Thread Dominique Martinet
syzbot wrote on Sun, Aug 26, 2018:
> HEAD commit:e27bc174c9c6 Add linux-next specific files for 20180824
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15dc19a640
> kernel config:  https://syzkaller.appspot.com/x/.config?x=28446088176757ea
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4252148d198410b864f
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15f8efba40
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1178256a40
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+d4252148d198410b8...@syzkaller.appspotmail.com
> 
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> ==
> BUG: KASAN: double-free or invalid-free in p9stat_free+0x35/0x100
> net/9p/protocol.c:48

That looks straight-forward enough, p9pdu_vreadf does p9stat_free on
error then v9fs_dir_readdir does the same ; there is nothing else that
could return an error without going through the first free so we could
just remove the later one...

There are a couple other users of the 'S' pdu read (that reads the stat
struct and frees it on error), so it's probably best to keep the current
behaviour as far as this is concerned, what we could do though is make
the free function idempotent (write NULLs in the freed fields), but I do
not see this being done often, do you know what the policy is about
this kind of pattern nowadays?

The struct is cleanly zeroed before being read so there is no risk of
double-frees between iterations so zeroing pointers is not strictly
required, but it does make things safer in general.


-- 
Dominique Martinet


[PATCH] 9p/xen: fix check for xenbus_read error in front_probe

2018-08-14 Thread Dominique Martinet
From: Dominique Martinet 

If the xen bus exists but does not expose the proper interface, it is
possible to get a non-zero length but still some error, leading to
strcmp failing trying to load invalid memory addresses e.g.
fffe.

There is then no need to check length when there is no error, as the
xenbus driver guarantees that the string is nul-terminated.

Signed-off-by: Dominique Martinet 
Cc: Stefano Stabellini 
Cc: Eric Van Hensbergen 
Cc: Latchesar Ionkov 
---

This is a trivial bug I stumbled on when setting up xen with p9fs and
running the VM in pvm: it had enough in the bus to trigger the probe
but then there was no version and it tried to return ENOENT but len
was set to the lower-level message size.

 net/9p/trans_xen.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/9p/trans_xen.c b/net/9p/trans_xen.c
index 1a5b38892eb4..f76beadddfc3 100644
--- a/net/9p/trans_xen.c
+++ b/net/9p/trans_xen.c
@@ -391,9 +391,9 @@ static int xen_9pfs_front_probe(struct xenbus_device *dev,
unsigned int max_rings, max_ring_order, len = 0;
 
versions = xenbus_read(XBT_NIL, dev->otherend, "versions", &len);
-   if (!len)
-   return -EINVAL;
+   if (IS_ERR(versions))
+   return PTR_ERR(versions);
if (strcmp(versions, "1")) {
kfree(versions);
return -EINVAL;
}
-- 
2.17.1



Re: KCM - recvmsg() mangles packets?

2018-08-09 Thread Dominique Martinet
Tom Herbert wrote on Thu, Aug 09, 2018:
> > diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
> > index 625acb27efcc..348ff5945591 100644
> > --- a/net/strparser/strparser.c
> > +++ b/net/strparser/strparser.c
> > @@ -222,6 +222,16 @@ static int __strp_recv(read_descriptor_t *desc, struct 
> > sk_buff *orig_skb,
> > if (!stm->strp.full_len) {
> > ssize_t len;
> >
> > +   /* Can only parse if there is no offset */
> > +   if (unlikely(stm->strp.offset)) {
> > +   if (!pskb_pull(skb, stm->strp.offset)) {
> > +   
> > STRP_STATS_INCR(strp->stats.mem_fail);
> > +   strp_parser_err(strp, -ENOMEM, 
> > desc);
> > +   break;
> > +   }
> > +   stm->strp.offset = 0;
> > +   }
> > +
> 
> Seems okay to me for a fix.

Hmm, if you say so, I'll send this as a patch for broader comments right
away.

> Looks like strp.offset is only set in one place and read in one
> place. With this pull maybe that just can go away?

Good point, when strp.offset is set the full_len is also being init'd so
we will necessarily do the pull next...

But the way tls uses strparser is also kind of weird, since they modify
the strp_msg's offset and full_len, I wouldn't want to assume we can't
have full_len == 0 *again* later with a non zero offset...
On the other hand they do handle non-zero offset in their parse function
so they'd be ok with that... Ultimately it's probably closer to a design
choice than anything else.


I'll still send a v0 of the patch as is, because I feel it's easier to
understand that we pull because the existing parse_msg functions do not
handle it properly, and will write a note that I intend to move it up a
few lines as a comment.


Thanks,
-- 
Dominique Martinet


Re: KCM - recvmsg() mangles packets?

2018-08-05 Thread Dominique Martinet
Dominique Martinet wrote on Sun, Aug 05, 2018:
> It's getting late but I'll try adding a pskb_pull in there tomorrow, it
> would be better to make the bpf program start with an offset but I don't
> think that'll be easy to change...

I can confirm the following patch fixes the issue for me:
-8<-
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 625acb27efcc..348ff5945591 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -222,6 +222,16 @@ static int __strp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
if (!stm->strp.full_len) {
ssize_t len;
 
+   /* Can only parse if there is no offset */
+   if (unlikely(stm->strp.offset)) {
+   if (!pskb_pull(skb, stm->strp.offset)) {
+   STRP_STATS_INCR(strp->stats.mem_fail);
+   strp_parser_err(strp, -ENOMEM, desc);
+   break;
+   }
+   stm->strp.offset = 0;
+   }
+
len = (*strp->cb.parse_msg)(strp, head);
 
if (!len) {
8<--

Now, I was looking at other users of strparser (I see sockmap, kcm and
tls) and it looks like sockmap does not handle offsets either but tls
does by using skb_copy_bits -- they're copying the tls header to a
buffer on the stack.

kcm cannot do that because we do not know how much data the user expects
to read, and I'm not comfortable doing pskb_pull in the kcm callback
either, but the cost of this pull is probably non-negligible if some
user can make do without it...

On the other hand, I do not see how to make the bpf program handle an
offset in the skb as that offset is strparser-specific.

Maybe add a flag in the cb that specifies wether the callback allows
non-zero offset?


I'll let you see if you can reproduce this and will wait for advices on
how to solve this properly so we can work on a proper fix.


Thanks,
-- 
Dominique


Re: KCM - recvmsg() mangles packets?

2018-08-05 Thread Dominique Martinet
Dominique Martinet wrote on Sun, Aug 05, 2018:
> (I'm not sure about offset, since we pass the full skb to parse message,
> wouldn't it look at the start of the buffer everytime? Well, offset
> seems to be 0 everytime the first time that check fails so I can
> probably ignore that for now...)

Oh, this might actually not have been such a bad remark; if I have the
client write two "messages" in a single write() kcm seems to reliably
fail the same way...

Conversely, if I setsockopt(s, IPPROTO_TCP, TCP_NODELAY...) on the
sender socket, *and* make it wait till the kcm socket has been created
to start sending data, then it dramatically reduces the probability of
this happening (I had to let the reproducer run in a loop for 5 minutes,
wheras it used to happen within seconds).

So I think the problem is packet aggregation, and strparser not handling
that properly... The first packet still fails with TCP_NODELAY but
there's probably aggregation on the recv side as well before the socket
is attached to the multiplexor... I guess the low probability failure
that is still happening could be similar.
(I also noticed that I've mistakedly believed that the problem was the
first packet contained the 2nd packet's data because of an off-by-one in
the receiver, it really is the second packet, it only has the first
packet's length)



I've moved my check from kcm_rcv_strparser to just before parse_msg in
__strp_recv and surely enough it fails everytime there's an offset.


It's getting late but I'll try adding a pskb_pull in there tomorrow, it
would be better to make the bpf program start with an offset but I don't
think that'll be easy to change...



I'm almost done spamming, thanks for being a good rubber duck! :p
-- 
Dominique Martinet



Re: KCM - recvmsg() mangles packets?

2018-08-04 Thread Dominique Martinet
Dominique Martinet wrote on Sat, Aug 04, 2018:
> I talked too fast, I can get this to fail on later packets e.g.
> Got 18, expected 31 on 452nd message: 453453453453453453; flags: 80
> 
> The content is 453 in a loop so this really is the 453rd packet...
> 
> But being slower e.g. doing that usleep after every single packets and I
> could let the loop run until 100k without a hintch.
> 
> 
> There really has to be something wrong, I just can't tell what from
> looking at the code with my naive eyes.

I've spent a few hours debugging this weekend, I'm still not confident I
understand how this all works but it's a bit better than Friday.


I've tried updating kcm_recvmsg()'s use of kcm_wait_data() to use
skb_dequeue() instead of skb_peek(), with the idea that if the skb is
off the list it would have less chance to be messed with, and that
didn't seem to change anything so I was probably looking at the wrong
place.


So I went one step up to how we split the packet and the interface with
strparser and added an extra check of the protocol in kcm_rcv_strparser.

If I understand correctly, with stm = strp_msg(skb), stm->full_len
contains the output of the bpf progran and must match something like
this in the network byte-order version:
ntohl(*((u32 *)(skb->data + stm->offset)))

(I'm not sure about offset, since we pass the full skb to parse message,
wouldn't it look at the start of the buffer everytime? Well, offset
seems to be 0 everytime the first time that check fails so I can
probably ignore that for now...)


Just like the test program, that extra check seems to work but will fail
everytime the test program detects and error, although the data I can
see does not always match what the program sees, the packet would seem
to already be incorrect when it is added to the kcm queue... So there
must be something happening to the skb between parse_msg() and rcv_msg()
in __strp_recv() ...?

>From what I can see it's looking at a cloned skb so the headers are
private but the data is shared, but it looks like the tcp socket (csock
in kcm_attach) is properly locked at this time, so I'm not sure what
could corrupt the buffer.



I'll keep looking at this a bit more but any help is appreciated,

Thanks,
-- 
Dominique Martinet | Asmadeus


Re: KCM - recvmsg() mangles packets?

2018-08-03 Thread Dominique Martinet
Dominique Martinet wrote on Sat, Aug 04, 2018:
> Actually, now I'm looking closer to the timing, it looks specific to the
> connection setup. This send loop works:
> int i = 1;
> while(i <= 1000) {
> int len = (i++ * 1312739ULL) % 31 + 1;
> my_msg.hdr.len = htonl(len);
> for (int j = 0; j < len; ) {
> j += snprintf(my_msg.data + j, len - j,
>   "%i", i - 1);
> }
> my_msg.data[len-1] = '\0';
> //printf("%d: writing %d\n", i-1, len);
> len = write(s, &my_msg, sizeof(my_msg.hdr) + len);
> if (error == -1)
> err(EXIT_FAILURE, "write");
> if (i == 2)
> usleep(1);
> }
> 
> But removing the usleep(1) after the first packet makes recvmsg()
> "fail": it reads the content of the second packet with the size of the
> first.

I talked too fast, I can get this to fail on later packets e.g.
Got 18, expected 31 on 452nd message: 453453453453453453; flags: 80

The content is 453 in a loop so this really is the 453rd packet...

But being slower e.g. doing that usleep after every single packets and I
could let the loop run until 100k without a hintch.


There really has to be something wrong, I just can't tell what from
looking at the code with my naive eyes.
Maybe we need to lock both the tcp and the kcm sockets?


Thanks,
-- 
Dominique


Re: KCM - recvmsg() mangles packets?

2018-08-03 Thread Dominique Martinet
Tom Herbert wrote on Fri, Aug 03, 2018:
> On Fri, Aug 3, 2018 at 4:20 PM, Dominique Martinet
>  wrote:
> > Tom Herbert wrote on Fri, Aug 03, 2018:
> >> struct my_proto {
> >>struct _hdr {
> >>uint32_t len;
> >> } hdr;
> >> char data[32];
> >> } __attribute__((packed));
> >>
> >> // use htons to use LE header size, since load_half does a first convertion
> >> // from network byte order
> >> const char *bpf_prog_string = " \
> >> ssize_t bpf_prog1(struct __sk_buff *skb) \
> >> { \
> >> return bpf_htons(load_half(skb, 0)) + 4; \
> >> }";
> >
> > (Just to make sure I did fix it to htonl(load_word()) and I can confirm
> > there is no difference)
> 
> You also need to htonl for
> 
> my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1;

Thanks, but this looks correct to me - I was writing the header in
little endian order here and doing the double-swap dance in the bpf prog
because the protocol I was considering making a KCM implementation for
uses that.

Just to make sure, I rewrote it using network byte order e.g. these
three points and this makes no difference:
---8<--
diff --git a/kcm.c b/kcm.c
index cb48df1..d437226 100644
--- a/kcm.c
+++ b/kcm.c
@@ -36,7 +36,7 @@ struct my_proto {
 const char *bpf_prog_string = "\
 ssize_t bpf_prog1(struct __sk_buff *skb)   \
 {  \
-   return bpf_htons(load_half(skb, 0)) + 4;\
+   return load_word(skb, 0) + 4;   \
 }";
 
 int servsock_init(int port)
@@ -110,13 +110,15 @@ void client(int port)
 
int i = 1;
while(1) {
-   my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1;
-   for (int j = 0; j < my_msg.hdr.len; ) {
-   j += snprintf(my_msg.data + j, my_msg.hdr.len - j, 
"%i", i - 1);
+   int len = (i++ * 1312739ULL) % 31 + 1;
+   my_msg.hdr.len = htonl(len);
+   for (int j = 0; j < len; ) {
+   j += snprintf(my_msg.data + j, len - j,
+ "%i", i - 1);
}
-   my_msg.data[my_msg.hdr.len-1] = '\0';
-   //printf("%d: writing %d\n", i-1, my_msg.hdr.len);
-   len = write(s, &my_msg, sizeof(my_msg.hdr) + my_msg.hdr.len);
+   my_msg.data[len-1] = '\0';
+   //printf("%d: writing %d\n", i-1, len);
+   len = write(s, &my_msg, sizeof(my_msg.hdr) + len);
if (error == -1)
err(EXIT_FAILURE, "write");
//usleep(1);
@@ -171,9 +173,10 @@ void process(int kcmfd)
len = recvmsg(kcmfd, &msg, 0);
if (len == -1)
err(EXIT_FAILURE, "recvmsg");
-   if (len != my_msg.hdr.len + 4) {
+   if (len != ntohl(my_msg.hdr.len) + 4) {
printf("Got %d, expected %d on %dth message: %s; flags:
%x\n",
-  len - 4, my_msg.hdr.len, i, my_msg.data, 
msg.msg_flags);
+  len - 4, ntohl(my_msg.hdr.len), i,
+  my_msg.data, msg.msg_flags);
exit(1);
}
i++;
8<---

Frankly I do not believe this is a rule problem, as if the length
splitting was incorrect the program would not work at all, but just
uncommenting the usleep on the sender side makes this work.

Actually, now I'm looking closer to the timing, it looks specific to the
connection setup. This send loop works:
int i = 1;
while(i <= 1000) {
int len = (i++ * 1312739ULL) % 31 + 1;
my_msg.hdr.len = htonl(len);
for (int j = 0; j < len; ) {
j += snprintf(my_msg.data + j, len - j,
  "%i", i - 1);
}
my_msg.data[len-1] = '\0';
//printf("%d: writing %d\n", i-1, len);
len = write(s, &my_msg, sizeof(my_msg.hdr) + len);
if (error == -1)
err(EXIT_FAILURE, "write");
if (i == 2)
usleep(1);
}

But removing the usleep(1) after the first packet makes recvmsg()
"fail": it reads the content of the second packet with the size of the
first.


I assume that usleep gives the server time to finish setting up the kcm
socket, because it does accept(); ioctl(SIOCKCMATTACH); recvmsg(); but
the client does not wait to send packets so there could be some sort of
race with the attach and multiple packets?


FWIW I took the time to look at older kernel and this has been happening
ever since KCM got introduced in 4.6


Thanks,
-- 
Dominique


Re: KCM - recvmsg() mangles packets?

2018-08-03 Thread Dominique Martinet
Tom Herbert wrote on Fri, Aug 03, 2018:
> struct my_proto {
>struct _hdr {
>uint32_t len;
> } hdr;
> char data[32];
> } __attribute__((packed));
> 
> // use htons to use LE header size, since load_half does a first convertion
> // from network byte order
> const char *bpf_prog_string = " \
> ssize_t bpf_prog1(struct __sk_buff *skb) \
> { \
> return bpf_htons(load_half(skb, 0)) + 4; \
> }";
> 
> The length in hdr is uint32_t above, but this looks like it's being
> read as a short.

Err, I agree this is obviously wrong here (I can blame my lack of
attention to this and the example I used), but this isn't the problem as
the actual size is between 0 and 32 -- I could use any size I want here
and the result would the same.

A "real" problem with the conversion program would mean that my example
would not work if I slow it down, but I can send as many packet as I
want if I uncomment the usleep() on the client side or if I just
throttle the network stack with a loud tcpdump writing to stdout -- that
means the algorithm is working even if it's making some badly-sized
conversions.

(Just to make sure I did fix it to htonl(load_word()) and I can confirm
there is no difference)


Thanks,
-- 
Dominique Martinet


KCM - recvmsg() mangles packets?

2018-08-03 Thread Dominique Martinet
I've been playing with KCM on a 4.18.0-rc7 kernel and I'm running in a
problem where the iovec filled by recvmsg() is mangled up: it is filled
by the length of one packet, but contains (truncated) data from another
packet, rendering KCM unuseable.

(I haven't tried old kernels to see for how long this is broken/try to
bisect; I might if there's no progress but this might be simpler than I
think)


I've attached a reproducer, a simple program that forks, creates a tcp
server/client, attach the server socket to a kcm socket, and in an
infinite loop sends varying-length messages from the client to the
server.
The loop stops when the server gets a message which length is not the
length indicated in the packet header, rather fast (I can make it run
for a while if I slow down emission, or if I run a verbose tcpdump for
example)

In the quiet version on a VM on my laptop, I get this output:
[root@f2 ~]# gcc -g -l bcc -o kcm kcm.c
[root@f2 ~]# ./kcm 
client is starting
server is starting
server is receiving data
Got 14, expected 27 on 1th message: 22; flags: 80

The client sends message deterministacally, first one is 14 bytes filled
with 1, second one is 27 bytes filled with 2, third one is 9 bytes
filled with 3 etc (final digit is actually a \0 instead)

As we can see, the server received 14 '2', and the header size matches
the second message header, so something went wrong™.
Flags 0x80 is MSG_EOR meaning recvmsg copied the full message.



This happens even if I reduce the VMs CPU to 1, so I was thinking some
irq messes with the sock between skb_peek and the actual copy of the
data (as this deos work if I send slowly!), but even disabling
irq/preempt doesn't seem to help so I'm not sure what to try next.

Any idea?


Thanks,
-- 
Dominique Martinet
/*
 * A sample program of KCM.
 * Originally https://gist.github.com/peo3/fd0e266a3852d3422c08854aba96bff5
 *
 * $ gcc -lbcc kcm-sample.c
 * $ ./a.out 1
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 


struct my_proto {
	struct _hdr {
		uint32_t len;
	} hdr;
	char data[32];
} __attribute__((packed));

// use htons to use LE header size, since load_half does a first convertion
// from network byte order
const char *bpf_prog_string = "\
ssize_t bpf_prog1(struct __sk_buff *skb)		\
{			\
	return bpf_htons(load_half(skb, 0)) + 4;	\
}";

int servsock_init(int port)
{
	int s, error;
	struct sockaddr_in addr;

	s = socket(AF_INET, SOCK_STREAM, 0);

	addr.sin_family = AF_INET;
	addr.sin_port = htons(port);
	addr.sin_addr.s_addr = INADDR_ANY;
	error = bind(s, (struct sockaddr *)&addr, sizeof(addr));
	if (error == -1)
		err(EXIT_FAILURE, "bind");

	error = listen(s, 10);
	if (error == -1)
		err(EXIT_FAILURE, "listen");

	return s;
}

int bpf_init(void)
{
	int fd, map_fd;
	void *mod;
	int key;
	long long value = 0;

	mod = bpf_module_create_c_from_string(bpf_prog_string, 0, NULL, 0);
	fd = bpf_prog_load(
		BPF_PROG_TYPE_SOCKET_FILTER,
		"bpf_prog1",
		bpf_function_start(mod, "bpf_prog1"),
		bpf_function_size(mod, "bpf_prog1"),
		bpf_module_license(mod),
		bpf_module_kern_version(mod),
		0, NULL, 0);

	if (fd == -1)
		exit(1);
	return fd;	
}

void client(int port)
{
	int s, error;
	struct sockaddr_in addr;
	struct hostent *host;
	struct my_proto my_msg;
	int len;

	printf("client is starting\n");

	s = socket(AF_INET, SOCK_STREAM, 0);
	if (s == -1)
		err(EXIT_FAILURE, "socket");

	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_port = htons(port);
	host = gethostbyname("localhost");
	if (host == NULL)
		err(EXIT_FAILURE, "gethostbyname");
	memcpy(&addr.sin_addr, host->h_addr, host->h_length);

	error = connect(s, (struct sockaddr *)&addr, sizeof(addr));
	if (error == -1)
		err(EXIT_FAILURE, "connect");

	len = sprintf(my_msg.data, "1234567890123456789012345678901");
	my_msg.data[len] = '\0';
	my_msg.hdr.len = len + 1;

	int i = 1;
	while(1) {
		my_msg.hdr.len = (i++ * 1312739ULL) % 31 + 1;
		for (int j = 0; j < my_msg.hdr.len; ) {
			j += snprintf(my_msg.data + j, my_msg.hdr.len - j, "%i", i - 1);
		}
		my_msg.data[my_msg.hdr.len-1] = '\0';
		//printf("%d: writing %d\n", i-1, my_msg.hdr.len);
		len = write(s, &my_msg, sizeof(my_msg.hdr) + my_msg.hdr.len);
		if (error == -1)
			err(EXIT_FAILURE, "write");
		//usleep(1);
	}

	close(s);
}

int kcm_init(void)
{
	int kcmfd;

	kcmfd = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
	if (kcmfd == -1)
		err(EXIT_FAILURE, "socket(AF_KCM)");

	return kcmfd;
}

int kcm_attach(int kcmfd, int csock, int bpf_prog_fd)
{
	int error;
	struct kcm_attach attach_info = {
		.fd = csock,
		.bpf_fd = bpf_prog_fd,
	};

	error = ioctl(k

Re: [PATCH PATCH net-next 10/18] 9p: fix whitespace issues

2018-07-29 Thread Dominique Martinet
(removed tons of Cc to make v9fs-developer@ happy - please do include
linux-ker...@vger.kernel.org next time though, I only saw this patch by
chance snooping at netdev)

Stephen Hemminger wrote on Tue, Jul 24, 2018:
> Remove trailing whitespace and blank lines at EOF
> 
> Signed-off-by: Stephen Hemminger 

LGTM, I'm not sure if someone would pick the whole series but as other
maintainers seem to have taken individual patches I've taken this one as
there's potential conflicts with what I have (planning to remove the
whole net/9p/util.c file)

> ---
>  net/9p/client.c   | 4 ++--
>  net/9p/trans_virtio.c | 2 +-
>  net/9p/util.c | 1 -
>  3 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 5c1343195292..ff02826e0407 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -341,7 +341,7 @@ struct p9_req_t *p9_tag_lookup(struct p9_client *c, u16 
> tag)
>* buffer to read the data into */
>   tag++;
>  
> - if(tag >= c->max_tag) 
> + if (tag >= c->max_tag)
>   return NULL;
>  
>   row = tag / P9_ROW_MAXTAG;
> @@ -1576,7 +1576,7 @@ p9_client_read(struct p9_fid *fid, u64 offset, struct 
> iov_iter *to, int *err)
>   int count = iov_iter_count(to);
>   int rsize, non_zc = 0;
>   char *dataptr;
> - 
> +
>   rsize = fid->iounit;
>   if (!rsize || rsize > clnt->msize-P9_IOHDRSZ)
>   rsize = clnt->msize - P9_IOHDRSZ;
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 05006cbb3361..279b24488d79 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -446,7 +446,7 @@ p9_virtio_zc_request(struct p9_client *client, struct 
> p9_req_t *req,
>   out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
> out_pages, out_nr_pages, offs, outlen);
>   }
> - 
> +
>   /*
>* Take care of in data
>* For example TREAD have 11.
> diff --git a/net/9p/util.c b/net/9p/util.c
> index 59f278e64f58..55ad98277e85 100644
> --- a/net/9p/util.c
> +++ b/net/9p/util.c
> @@ -138,4 +138,3 @@ int p9_idpool_check(int id, struct p9_idpool *p)
>   return idr_find(&p->pool, id) != NULL;
>  }
>  EXPORT_SYMBOL(p9_idpool_check);
> -

-- 
Dominique Martinet


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> Thanks for the testing! One more line is required, however: we have to get 
> the assured bit set for the connection, see the new patch below.

I think it actually was better before. If I understand things correctly
at this point (when we get in the case TCP_CONNTRACK_SYN_RECV) we will
have seen SYN(out) SYN(in) SYNACK(out), but not the final ACK(in) yet.

Leaving old state as it was will not set the assured bit, but that will
be set on the next packet because old_state == new_state == established
at that point and the connection will really be setup then.


I don't think anything will blow up if we do either way, but strictly
speaking I'm more comfortable with the former.
I'll test the new patch regardless, I left work so can't reproduce
anymore but will yell tomorrow if it does explode ;)


> The tcp_conntracks state table could be fixed with introducing a new 
> state, but that part is exposed to userspace (ctnetlink) and ugly 
> compatibility code would be required for backward compatibility.

I agree a new state is more work than it is worth, I'm happy to leave it
as is.

-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Dominique Martinet wrote on Wed, Apr 18, 2018:
> Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> > Yes, the state transition is wrong for simultaneous open, because the 
> > tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
> > next untested patch?
> 
> Thanks for the patch; I'll give it a try (probably won't make it today
> so will report tomorrow)

Actually had time; I can confirm (added printks) we did get in that if
that was pointed at, and we no longer get there now.
The connection no longer gets in invalid state, so that looks like it
nailed it.

I'm now confused what this has to do with tcp_timestamp though, since
setting that off also seemed to work around the issue, but if we get
something like that in I'll be happy anyway.


Big props to everyone involved, I would have taken much longer alone,
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-18 Thread Dominique Martinet
Jozsef Kadlecsik wrote on Wed, Apr 18, 2018:
> Yes, the state transition is wrong for simultaneous open, because the 
> tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
> next untested patch?

Thanks for the patch; I'll give it a try (probably won't make it today
so will report tomorrow)

-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-17 Thread Dominique Martinet
Michal Kubecek wrote on Tue, Apr 17, 2018:
> Data (21 bytes) packet in reply direction. And somewhere between the
> first and second debugging print, we ended up with sender scale=0 and
> that value is then preserved from now on.
> 
> The only place between the two debug prints where we could change only
> one of the td_sender values are the two calls to tcp_options() but
> neither should be called now unless I missed something. I'll try to
> think about it some more.

Could it have something to do with the way I setup the connection?
I don't think the "both remotes call connect() with carefully selected
source/dest port" is a very common case..

If you look at the tcpdump outputs I attached the sequence usually is
something like
 server > client SYN
 client > server SYN
 server > client SYNACK
 client > server ACK

ultimately it IS a connection, but with an extra SYN packet in front of
it (that first SYN opens up the conntrack of the nat so that the
client's syn can come in, the client's conntrack will be that of a
normal connection since its first SYN goes in directly after the
server's (it didn't see the server's SYN))


Looking at my logs again, I'm seeing the same as you:

This looks like the actual SYN/SYN/SYNACK/ACK:
 - 14.364090 seq=505004283 likely SYN coming out of server
 - 14.661731 seq=1913287797 on next line it says receiver
end=505004284 so likely the matching SYN from client
Which this time gets a proper SYNACK from server:
14.662020 seq=505004283 ack=1913287798
And following final dataless ACK:
14.687570 seq=1913287798 ack=505004284

Then as you point out some data ACK, where the scale poofs:
14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
14.688824 tcp_in_window: 
14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
end=1913287819
14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7

As you say, only tcp_options() will clear only on side of the scales.
We don't have sender->td_maxwin == 0 (printed) so I see no other way
than we are in the last else if:
 - we have after(end, sender->td_end) (end=1913287819 > sender
end=1913287798)
 - I assume the tcp state machine must be confused because of the
SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
but since this is a data packet it doesn't have the tcp option for scale
thus scale resets.


At least peeling the logs myself helped me follow the process, I'll
sprinkle some carefully crafted logs tomorrow to check if this is true
and will let you figure what is best of trying to preserve scale if it
was set before, setting a default to 14 or something else.

Thanks!
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Dominique Martinet wrote on Mon, Apr 16, 2018:
> . . . Oh, there is something interesting there, the connection doesn't
> come up with -G?

Hm, sorry, I take this last part back. I cannot reproduce -G not working
reliably.
I'll dig around the conntrack table a bit more.


-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
rack -G --protonum tcp --src server --dst client --sport 15080 --dport 
32979
tcp  6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 
src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=3
conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.

hang:
# conntrack -G --protonum tcp --src server --dst client --sport 21308 --dport 
37552
conntrack v1.4.4 (conntrack-tools): 0 flow entries have been shown.

So something happened that makes it show up in -L (table dump) but not
when querying...?
And only when there is enough traffic: I have previously kept such a
connection without workaround for hours just fine as long as I made sure
not to display more than a screen at a time.




Thanks again,
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
bled (after running my
reproducer, e.g. outputing a big chunk of text quickly):
 reno wscale:7,7 rto:228 rtt:27.267/1.423 ato:40 mss:1386 pmtu:1500
 rcvmss:1248 advmss:1460 cwnd:10 ssthresh:12 bytes_acked:17311070
 bytes_received:40445 segs_out:13331 segs_in:7523
 data_segs_out:13279 data_segs_in:947 send 4.1Mbps lastsnd:5
 lastrcv:5 lastack:5 pacing_rate 4.9Mbps delivery_rate 431.4Kbps
 app_limited busy:36064ms unacked:1 retrans:0/6 rcv_rtt:9112.95
 rcv_space:29233 rcv_ssthresh:39184 minrtt:25.566

So I guess lost:33 matching unacked:33 might be another hint?


I'll need a bit more time reading the code to understand what this all
implies ; feel free to beat me to it.

Thanks,
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
Eric Dumazet wrote on Fri, Apr 13, 2018:
> Ah sorry, your trace was truncated, we need more packets _before_ the excerpt.

Ah, sorry as well. I tried to go far back enough to include the replayed
packets, but I see it didn't include the original ack for that packet.
I'm resending both full traces from connect to fin this time.

> That might be caused by some TS val/ecr breakage :
> 
> Many acks were received by the server tcpdump,
> but none of them was accepted by TCP stack, for some reason.
> 
> Try to disable TCP timestamps, it will give some hint if bug does not 
> reproduce.

I unfortunately cannot reproduce on a local network, but will give this
a try on Monday and report. (sysctl net.ipv4.tcp_timestamps=0, correct?)


Here's the traces again, server-capture first:
16:49:22.374452 IP .13317 > .31872: Flags 
[S], seq 2026966826, win 29200, options [mss 1460,sackOK,TS val 1313933281 ecr 
0,nop,wscale 7], length 0
16:49:22.700553 IP .31872 > .13317: Flags 
[S], seq 882075258, win 29200, options [mss 1386,sackOK,TS val 1617125446 ecr 
0,nop,wscale 7], length 0
16:49:22.700577 IP .13317 > .31872: Flags 
[S.], seq 2026966826, ack 882075259, win 29200, options [mss 1460,sackOK,TS val 
1313933607 ecr 1617125446,nop,wscale 7], length 0
16:49:22.727131 IP .31872 > .13317: Flags 
[.], ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 1313933607], length 0
16:49:22.729193 IP .31872 > .13317: Flags 
[P.], seq 1:22, ack 1, win 229, options [nop,nop,TS val 1617125472 ecr 
1313933607], length 21
16:49:22.729230 IP .13317 > .31872: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1313933635 ecr 1617125472], 
length 0
16:49:22.732726 IP .13317 > .31872: Flags 
[P.], seq 1:22, ack 22, win 229, options [nop,nop,TS val 1313933639 ecr 
1617125472], length 21
16:49:22.759809 IP .31872 > .13317: Flags 
[.], ack 22, win 229, options [nop,nop,TS val 1617125503 ecr 1313933639], 
length 0
16:49:22.759835 IP .13317 > .31872: Flags 
[P.], seq 22:734, ack 22, win 229, options [nop,nop,TS val 1313933666 ecr 
1617125503], length 712
16:49:22.761643 IP .31872 > .13317: Flags 
[P.], seq 22:1270, ack 22, win 229, options [nop,nop,TS val 1617125504 ecr 
1313933639], length 1248
16:49:22.801860 IP .13317 > .31872: Flags 
[.], ack 1270, win 248, options [nop,nop,TS val 1313933708 ecr 1617125504], 
length 0
16:49:22.826902 IP .31872 > .13317: Flags 
[.], ack 734, win 240, options [nop,nop,TS val 1617125572 ecr 1313933666], 
length 0
16:49:22.827576 IP .31872 > .13317: Flags 
[P.], seq 1270:1318, ack 734, win 240, options [nop,nop,TS val 1617125573 ecr 
1313933708], length 48
16:49:22.827600 IP .13317 > .31872: Flags 
[.], ack 1318, win 248, options [nop,nop,TS val 1313933734 ecr 1617125573], 
length 0
16:49:22.833028 IP .13317 > .31872: Flags 
[P.], seq 734:1122, ack 1318, win 248, options [nop,nop,TS val 1313933739 ecr 
1617125573], length 388
16:49:22.858455 IP .31872 > .13317: Flags 
[.], ack 1122, win 251, options [nop,nop,TS val 1617125604 ecr 1313933739], 
length 0
16:49:22.865866 IP .31872 > .13317: Flags 
[P.], seq 1318:1334, ack 1122, win 251, options [nop,nop,TS val 1617125612 ecr 
1313933739], length 16
16:49:22.906865 IP .13317 > .31872: Flags 
[.], ack 1334, win 248, options [nop,nop,TS val 1313933813 ecr 1617125612], 
length 0
16:49:22.944474 IP .31872 > .13317: Flags 
[P.], seq 1334:1386, ack 1122, win 251, options [nop,nop,TS val 1617125678 ecr 
1313933813], length 52
16:49:22.944497 IP .13317 > .31872: Flags 
[.], ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 1617125678], 
length 0
16:49:22.944747 IP .13317 > .31872: Flags 
[P.], seq 1122:1174, ack 1386, win 248, options [nop,nop,TS val 1313933851 ecr 
1617125678], length 52
16:49:22.971083 IP .31872 > .13317: Flags 
[P.], seq 1386:1454, ack 1174, win 251, options [nop,nop,TS val 1617125716 ecr 
1313933851], length 68
16:49:22.971607 IP .13317 > .31872: Flags 
[P.], seq 1174:1258, ack 1454, win 248, options [nop,nop,TS val 1313933878 ecr 
1617125716], length 84
16:49:22.998201 IP .31872 > .13317: Flags 
[P.], seq 1454:2082, ack 1258, win 251, options [nop,nop,TS val 1617125742 ecr 
1313933878], length 628
16:49:22.998987 IP .13317 > .31872: Flags 
[P.], seq 1258:1838, ack 2082, win 268, options [nop,nop,TS val 1313933905 ecr 
1617125742], length 580
16:49:23.065102 IP .31872 > .13317: Flags 
[.], ack 1838, win 262, options [nop,nop,TS val 1617125810 ecr 1313933905], 
length 0
16:49:24.811297 IP .31872 > .13317: Flags 
[P.], seq 2082:3254, ack 1838, win 262, options [nop,nop,TS val 1617127527 ecr 
1313933905], length 1172
16:49:24.812562 IP .13317 > .31872: Flags 
[P.], seq 1838:1874, ack 3254, win 287, options [nop,nop,TS val 1313935719 ecr 
1617127527], length 36
16:49:24.838210 IP .31872 > .13317: Flags 
[.], ack 1874, win 262, options [nop,nop,TS val 1617127583 ecr 1313935719], 
length 0
16:49:24.838236 IP .13317 > .31872: Flags 
[P.], seq 1874:2534, ack 3254, win 287, options [nop,nop,TS val 1313935744 ecr 
1617127583], length 660
16:49:24.839175 I

Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet
seq 64980:66354, ack 4190, win 307, options [nop,nop,TS val 1313937638 ecr 
1617129473], length 1374
16:49:26.754585 IP .31872 > .13317: Flags 
[.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], 
length 0
16:49:26.755260 IP .13317 > .31872: Flags 
[.], seq 66354:67728, ack 4190, win 307, options [nop,nop,TS val 1313937638 ecr 
1617129473], length 1374
16:49:26.755271 IP .31872 > .13317: Flags 
[.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], 
length 0
16:49:26.755493 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 
ecr 1313937638], length 36
16:49:26.756076 IP .13317 > .31872: Flags 
[.], seq 67728:69102, ack 4190, win 307, options [nop,nop,TS val 1313937639 ecr 
1617129473], length 1374
16:49:26.756391 IP .13317 > .31872: Flags 
[.], seq 69102:70476, ack 4190, win 307, options [nop,nop,TS val 1313937639 ecr 
1617129473], length 1374
16:49:26.756402 IP .31872 > .13317: Flags 
[.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], 
length 0
16:49:26.757020 IP .13317 > .31872: Flags 
[.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 
1617129473], length 1374
16:49:26.757349 IP .13317 > .31872: Flags 
[.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 
1617129473], length 1374
16:49:26.757360 IP .31872 > .13317: Flags 
[.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], 
length 0
16:49:26.757878 IP .13317 > .31872: Flags 
[.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 
1617129473], length 1374
16:49:26.758457 IP .13317 > .31872: Flags 
[.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 
1617129473], length 1374
16:49:26.758465 IP .31872 > .13317: Flags 
[.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], 
length 0
16:49:26.828268 IP .13317 > .31872: Flags 
[.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 
1617129473], length 1374
16:49:26.868867 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], 
length 0
16:49:27.010733 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 
ecr 1313937714], length 36
16:49:27.070035 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 
1617129473], length 1374
16:49:27.070062 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:27.266769 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 
ecr 1313937714], length 36
16:49:27.540998 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 
1617129473], length 1374
16:49:27.541022 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:27.778687 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 
ecr 1313937714], length 36
16:49:28.532876 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 
1617129473], length 1374
16:49:28.532907 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:28.802879 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 
ecr 1313937714], length 36
16:49:30.452830 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 
1617129473], length 1374
16:49:30.452861 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:30.850734 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 
ecr 1313937714], length 36
16:49:34.229420 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 
1617129473], length 1374
16:49:34.229447 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:35.202854 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 
ecr 1313937714], length 36
16:49:42.165044 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 
1617129473], length 1374
16:49:42.165073 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:43.394773 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 
ecr 1313937714], length 36


Once again, thank you both.
-- 
Dominique Martinet | Asmadeus


Re: tcp hang when socket fills up ?

2018-04-13 Thread Dominique Martinet

Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{   
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_sacktag_state sack_state;
struct rate_sample rs = { .prior_delivered = 0 };
u32 prior_snd_una = tp->snd_una;
bool is_sack_reneg = tp->is_sack_reneg;
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
bool is_dupack = false;
int prior_packets = tp->packets_out;
u32 delivered = tp->delivered;
u32 lost = tp->lost;
int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
losses */
u32 prior_fack;

sack_state.first_sackt = 0;
sack_state.rate = &rs;

/* We very likely will need to access rtx queue. */
prefetch(sk->tcp_rtx_queue.rb_node);

/* If the ack is older than previous acks
 * then we can probably ignore it.
 */
if (before(ack, prior_snd_una)) {
}

/* If the ack includes data we haven't sent yet, discard
 * this segment (RFC793 Section 3.9).
 */
if (after(ack, tp->snd_nxt))

if (after(ack, prior_snd_una)) {
}

prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
tp->snd_una;
rs.prior_in_flight = tcp_packets_in_flight(tp);

/* ts_recent update must be made after we are sure that the
packet   
 * is in window.
 */
if (flag & FLAG_UPDATE_TS_RECENT)

if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
} else {
u32 ack_ev_flags = CA_ACK_SLOWPATH;

if (ack_seq != TCP_SKB_CB(skb)->end_seq)
else
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPPUREACKS);

flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

if (TCP_SKB_CB(skb)->sacked)

if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
}

if (flag & FLAG_WIN_UPDATE)

tcp_in_ack_event(sk, ack_ev_flags);
}

/* We passed data and got it acked, remove any soft error
 * log. Something worked...
 */
sk->sk_err_soft = 0;
icsk->icsk_probes_out = 0;
tp->rcv_tstamp = tcp_jiffies32;
if (!prior_packets)
goto no_queue;

no_queue:
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK)
tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
&flag,
  &rexmit);
/* If this ack opens up a zero window, clear backoff.  It was
 * being used to time the probes, and is probably far higher
 than
 * it needs to be for normal retransmission.
 */
tcp_ack_probe(sk);

if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
return 1;


And here is 'info local' towards th

tcp hang when socket fills up ?

2018-04-06 Thread Dominique Martinet
p,nop,TS val 1617129506 ecr 1313937638], 
length 0
16:49:26.762199 IP .31872 > .13317: Flags 
[.], ack 67728, win 1348, options [nop,nop,TS val 1617129507 ecr 1313937638], 
length 0
16:49:26.763547 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 67728, win 1348, options [nop,nop,TS val 1617129507 
ecr 1313937638], length 36
16:49:26.763553 IP .31872 > .13317: Flags 
[.], ack 70476, win 1393, options [nop,nop,TS val 1617129508 ecr 1313937639], 
length 0
16:49:26.764298 IP .31872 > .13317: Flags 
[.], ack 73224, win 1438, options [nop,nop,TS val 1617129509 ecr 1313937641], 
length 0
16:49:26.764676 IP .31872 > .13317: Flags 
[.], ack 75972, win 1444, options [nop,nop,TS val 1617129510 ecr 1313937643], 
length 0
16:49:26.807754 IP .13317 > .31872: Flags 
[.], seq 75972:77346, ack 4190, win 307, options [nop,nop,TS val 1313937714 ecr 
1617129473], length 1374
16:49:26.876467 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617129620 ecr 1313937714], 
length 0
16:49:27.048760 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313937955 ecr 
1617129473], length 1374
16:49:27.051791 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617129762 
ecr 1313937714], length 36
16:49:27.076444 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:27.371182 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 
ecr 1313937714], length 36
16:49:27.519862 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 ecr 
1617129473], length 1374
16:49:27.547662 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:27.883372 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 
ecr 1313937714], length 36
16:49:28.511861 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 ecr 
1617129473], length 1374
16:49:28.538891 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:28.907197 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 
ecr 1313937714], length 36
16:49:30.431864 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 ecr 
1617129473], length 1374
16:49:30.459127 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:30.955388 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 
ecr 1313937714], length 36
16:49:34.207879 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 ecr 
1617129473], length 1374
16:49:34.235726 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:35.256285 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 
ecr 1313937714], length 36
16:49:42.143864 IP .13317 > .31872: Flags 
[.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 ecr 
1617129473], length 1374
16:49:42.171531 IP .31872 > .13317: Flags 
[.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 
1313937714,nop,nop,sack 1 {32004:33378}], length 0
16:49:43.448262 IP .31872 > .13317: Flags 
[P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 
ecr 1313937714], length 36

(I still have the pcap files if someone wants to see them, but I'd
rather not give my work IP publicly so will send it privately on
request)


At this point, only the same 3 packets keep being replayed over and
over... According to 'ss' the Send-Q isn't empty on either side, the
client has some ~1k to send but the server has much more (87k)
After increasing the window size through net.*wmem sysctl it got stuck
with over 1MB in Send-Q, which makes sense because the socket is full...


So, what I don't get is, why are these acks continuously replayed? Given
the timing it looks like the server doesn't take the client acks into
account, despite having received that precise 33378 ack earlier and I
believe it should accept a higher ack value anyway ?

The ultimate question being, how can I go about debugging that?
I'm working on getting perf probe/crash to work on the server so I can
look at the kernel side now, I can reproduce this semi-easily so I'm
sure I'll get down to it eventually, but if anyone has an idea I'm all
ears.


Thanks!
-- 
Dominique Martinet | Asmadeus


Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API

2016-03-08 Thread Dominique Martinet
Christoph Hellwig wrote on Thu, Mar 03, 2016:
> New version with the nits fixed below.  Now that checkpath started
> a stupid warning about not using tabs for indentation which I've
> ignored here and will take up in my usual fights against Joes
> idicotic opinions separately..

Thanks for the nitpicks, I can confirm it works as expected as well so
all good with me.
I like the new CQ interface :)

(if someone adds an Acked-by please use dominique.marti...@cea.fr for my
mail; sorry for the split personality)

-- 
Dominique


Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API

2016-02-27 Thread Dominique Martinet
Hi,

Couple of checkpatch complains:

Christoph Hellwig wrote on Sat, Feb 27, 2016:
> -struct p9_rdma_context {
> - enum ib_wc_opcode wc_op;
> +struct p9_rdma_context { 

trailing tab

> - p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status);
> + p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, 
> wc->status);

line over 80 chars


That aside it looks good ; I need to check on the new API (hadn't
noticed the change) but it looks nice.

Will do the actual testing likely only next week only though;
Eric has been taking my patches for 9p/RDMA so I suspect he'll take
your's as well eventually (get_maintainer.pl has a long-ish list of CC
for us usually)


BTW I think it's easy enough to do the testing if you have a server that
can dish it out. diod[1] and nfs-ganesha[2] are the only two I'm aware
of but there might be more (using ganesha myself; happy to help you set
it up in private if you need)

[1] https://github.com/chaos/diod
[2] https://github.com/nfs-ganesha/nfs-ganesha

-- 
Dominique Martinet


[PATCH v2] 9p: trans_fd, bail out if recv fcall if missing

2015-09-07 Thread Dominique Martinet
req->rc is pre-allocated early on with p9_tag_alloc and shouldn't be missing

Signed-off-by: Dominique Martinet 
---
 net/9p/trans_fd.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

Feel free to adapt error code/message if you can think of something better.

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index a270dcc..a6d89c0 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -356,13 +356,12 @@ static void p9_read_work(struct work_struct *work)
}
 
if (m->req->rc == NULL) {
-   m->req->rc = kmalloc(sizeof(struct p9_fcall) +
-   m->client->msize, GFP_NOFS);
-   if (!m->req->rc) {
-   m->req = NULL;
-   err = -ENOMEM;
-   goto error;
-   }
+   p9_debug(P9_DEBUG_ERROR,
+"No recv fcall for tag %d (req %p), 
disconnecting!\n",
+m->rc.tag, m->req);
+   m->req = NULL;
+   err = -EIO;
+   goto error;
}
m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall);
memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [V9fs-developer] [PATCH] 9p: trans_fd, initialize recv fcall properly if not set

2015-09-05 Thread Dominique Martinet
Eric Van Hensbergen wrote on Sat, Sep 05, 2015:
> On Thu, Sep 3, 2015 at 4:38 AM, Dominique Martinet
>  wrote:
> > To be honest, I think it might be better to just bail out if we get in
> > this switch (m->req->rc == NULL after p9_tag_lookup) and not try to
> > allocate more, because if we get there it's likely a race condition and
> > silently re-allocating will end up in more troubles than trying to
> > recover is worth.
> > Thoughts ?
> >
> 
> Hmmm...trying to rattle my brain and remember why I put it in there
> back in 2008.
> It might have just been over-defensive programming -- or more likely it just
> pre-dated all the zero copy infrastructure which pretty much guaranteed we had
> an rc allocated and what is there is vestigial.  I'm happy to accept a
> patch which
> makes this an assert, or perhaps just resets the connection because something
> has gone horribly wrong (similar to the ENOMEM path that is there now).

Yeah, it looks like the safety comes from the zero-copy stuff that came
much later.
Let's go with resetting the connection then. Hmm. EIO is a bit too
generic so would be good to avoid that if possible, but can't think of
anything better...


Speaking of zero-copy, I believe it should be fairly straight-forward to
implement for trans_fd now I've actually looked at it, since we do the
payload read after a p9_tag_lookup, would just need m->req to point to a
zc buffer. Write is similar, if there's a zc buffer just send it after
the header.
The cost is a couple more pointers in req and an extra if in both
workers, that seems pretty reasonable.

Well, I'm not using trans_fd much here (and unfortunately zero-copy
isn't possible at all given the transport protocol for RDMA, at least
for recv), but if anyone cares it probably could be done without too
much hassle for the fd workers.

-- 
Dominique
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 9p: trans_fd, initialize recv fcall properly if not set

2015-09-03 Thread Dominique Martinet
That code really should never be called (rc is allocated in
tag_alloc), but if it had been it couldn't have worked...

Signed-off-by: Dominique Martinet 
---
 net/9p/trans_fd.c | 3 +++
 1 file changed, 3 insertions(+)

To be honest, I think it might be better to just bail out if we get in
this switch (m->req->rc == NULL after p9_tag_lookup) and not try to
allocate more, because if we get there it's likely a race condition and
silently re-allocating will end up in more troubles than trying to
recover is worth.
Thoughts ?

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index a270dcc..0d9831a 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -363,6 +363,9 @@ static void p9_read_work(struct work_struct *work)
err = -ENOMEM;
goto error;
}
+   m->req->rc.capacity = m->client->msize;
+   m->req->rc.sdata = (char*)m->req->rc +
+   sizeof(struct p9_fcall);
}
m->rc.sdata = (char *)m->req->rc + sizeof(struct p9_fcall);
memcpy(m->rc.sdata, m->tmp_buf, m->rc.capacity);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 9p: trans_fd, read rework to use p9_parse_header

2015-09-03 Thread Dominique Martinet
Most of the changes here are no-op and just renaming to use a
fcall struct, needed for p9_parse_header

It fixes the unaligned memory access to read the tag and defers to
common functions for part of the protocol knowledge (although header
length is still hard-coded...)

Reported-By: Rob Landley 
Signed-Off-By: Dominique Martinet 
---
 net/9p/trans_fd.c | 75 +--
 1 file changed, 40 insertions(+), 35 deletions(-)

It ended up alot bigger than I thought it'd be, submitting it anyway
but happy with either version - letting Eric decide what's better :)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index bced8c0..a270dcc 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -108,9 +108,7 @@ struct p9_poll_wait {
  * @unsent_req_list: accounting for requests that haven't been sent
  * @req: current request being processed (if any)
  * @tmp_buf: temporary buffer to read in header
- * @rsize: amount to read for current frame
- * @rpos: read position in current frame
- * @rbuf: current read buffer
+ * @rc: temporary fcall for reading current frame
  * @wpos: write position for current frame
  * @wsize: amount of data to write for current frame
  * @wbuf: current write buffer
@@ -131,9 +129,7 @@ struct p9_conn {
struct list_head unsent_req_list;
struct p9_req_t *req;
char tmp_buf[7];
-   int rsize;
-   int rpos;
-   char *rbuf;
+   struct p9_fcall rc;
int wpos;
int wsize;
char *wbuf;
@@ -305,49 +301,56 @@ static void p9_read_work(struct work_struct *work)
if (m->err < 0)
return;
 
-   p9_debug(P9_DEBUG_TRANS, "start mux %p pos %d\n", m, m->rpos);
+   p9_debug(P9_DEBUG_TRANS, "start mux %p pos %zd\n", m, m->rc.offset);
 
-   if (!m->rbuf) {
-   m->rbuf = m->tmp_buf;
-   m->rpos = 0;
-   m->rsize = 7; /* start by reading header */
+   if (!m->rc.sdata) {
+   m->rc.sdata = m->tmp_buf;
+   m->rc.offset = 0;
+   m->rc.capacity = 7; /* start by reading header */
}
 
clear_bit(Rpending, &m->wsched);
-   p9_debug(P9_DEBUG_TRANS, "read mux %p pos %d size: %d = %d\n",
-m, m->rpos, m->rsize, m->rsize-m->rpos);
-   err = p9_fd_read(m->client, m->rbuf + m->rpos,
-   m->rsize - m->rpos);
+   p9_debug(P9_DEBUG_TRANS, "read mux %p pos %zd size: %zd = %zd\n",
+m, m->rc.offset, m->rc.capacity,
+m->rc.capacity - m->rc.offset);
+   err = p9_fd_read(m->client, m->rc.sdata + m->rc.offset,
+m->rc.capacity - m->rc.offset);
p9_debug(P9_DEBUG_TRANS, "mux %p got %d bytes\n", m, err);
-   if (err == -EAGAIN) {
+   if (err == -EAGAIN)
goto end_clear;
-   }
 
if (err <= 0)
goto error;
 
-   m->rpos += err;
+   m->rc.offset += err;
 
-   if ((!m->req) && (m->rpos == m->rsize)) { /* header read in */
-   u16 tag;
+   /* header read in */
+   if ((!m->req) && (m->rc.offset == m->rc.capacity)) {
p9_debug(P9_DEBUG_TRANS, "got new header\n");
 
-   n = le32_to_cpu(*(__le32 *) m->rbuf); /* read packet size */
-   if (n >= m->client->msize) {
+   err = p9_parse_header(&m->rc, NULL, NULL, NULL, 0);
+   if (err) {
p9_debug(P9_DEBUG_ERROR,
-"requested packet size too big: %d\n", n);
+"error parsing header: %d\n", err);
+   goto error;
+   }
+
+   if (m->rc.size >= m->client->msize) {
+   p9_debug(P9_DEBUG_ERROR,
+"requested packet size too big: %d\n",
+m->rc.size);
err = -EIO;
goto error;
}
 
-   tag = le16_to_cpu(*(__le16 *) (m->rbuf+5)); /* read tag */
p9_debug(P9_DEBUG_TRANS,
-"mux %p pkt: size: %d bytes tag: %d\n", m, n, tag);
+"mux %p pkt: size: %d bytes tag: %d\n",
+m, m->rc.size, m->rc.tag);
 
-   m->req = p9_tag_lookup(m->client, tag);
+   m->req = p9_tag_lookup(m->client, m->rc.tag);
if (!m->req || (m->req->status != REQ_STATUS_SENT)) {
p9_debug(P9_DEBUG_ERROR, "Unexpected packet tag %d\n",
-tag);
+